Big Data

Table of contents
Apache Spark vs Hadoop: Introduction to Hadoop..................................................................12

HDFS...............................................................................................................................12
NameNode..............................................................................................................12
DataNode................................................................................................................13
YARN..............................................................................................................................13
ResourceManager...................................................................................................13
NodeManager.........................................................................................................13
Apache Spark vs Hadoop: Introduction to Apache Spark.........................................................14
Apache Spark vs Hadoop: Parameters to Compare .................................................................16
Performance.....................................................................................................................16
Ease of Use......................................................................................................................16
Costs................................................................................................................................17
Data Processing...............................................................................................................17
Batch Processing vs Stream Processing.................................................................17
Fault Tolerance................................................................................................................18
Security............................................................................................................................19
Use-cases where Hadoop fits best:............................................................................................19
Use-cases where Spark fits best:...............................................................................................19
Real-Time Big Data Analysis:.........................................................................................19
Graph Processing:............................................................................................................20
Iterative Machine Learning Algorithms:.........................................................................21
Mesos features........................................................................................................26
DevOps tooling.........................................................................................................................26
Long Running Services..........................................................................................26
Big Data Processing...............................................................................................26
Batch Scheduling....................................................................................................27
Data Storage...........................................................................................................27
The Aurora Mesos Framework.................................................................................................27
Cloud Watcher....................................................................................................................................27
What is Singularity....................................................................................................................28
How it Works............................................................................................................................29
Singularity Components............................................................................................................31
Singularity Scheduler.......................................................................................................32
Slave Placement.....................................................................................................34
Singularity Scheduler Dependencies......................................................................34
Singularity UI.........................................................................................................35
Optional Slave Components............................................................................................35
Singularity Executor...............................................................................................35
S3 Uploader............................................................................................................35
S3 Downloader.......................................................................................................35
Singularity Executor Cleanup................................................................................35
Log Watcher...........................................................................................................36
OOM Killer............................................................................................................36
Chapel: Productive Parallel Programming.........................................................................................36
Installation.................................................................................................................................38
Example....................................................................................................................................38
Configuration............................................................................................................................38
UI..............................................................................................................................................39
UI when running..............................................................................................................39
UI after running...............................................................................................................39
UI examples for features..................................................................................................39
Exelixi.................................................................................................................................................40
Quick Start.......................................................................................................................40
Flink....................................................................................................................................................41
Low latency on minimal resources..................................................................................42
Variety of Sources and Sinks...........................................................................................42
Fault tolerance.................................................................................................................42
High level API.................................................................................................................42
Stateful processing...........................................................................................................43
Exactly once processing..................................................................................................43
SQL Support....................................................................................................................43
Environment Support.......................................................................................................43
Awesome community......................................................................................................43
Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing
Framework..........................................................................................................................................44
System Properties Comparison Cassandra vs. HBase vs. MongoDB................................................58
Top 5 reasons to use the Apache Cassandra Database........................................................................67
Cassandra helps solve complicated tasks with ease........................................................67
Cassandra has a short learning curve...............................................................................68
Cassandra lowers admin overhead and costs DevOps engineer......................................68
Cassandra offers rapid writing and lightning-fast reading..............................................68
Cassandra provides extreme resilience and fault tolerance.............................................68
Final thoughts..................................................................................................................68
Who Uses These Databases?.....................................................................................................73
What About Database Structure?..............................................................................................73
Are Indexes Needed?...............................................................................................................75
How Are Their Queries Different?...........................................................................................75
Where (And How) Are These Databases Deployed?...............................................................76
What Types Of Replication / Clustering Are Available?...........................................................77
Who's Currently Behind The Databases?..................................................................................77
Who Provides Support?...........................................................................................................77
Who Maintains The Documentation?......................................................................................77
Is There An Active Community?..............................................................................................77
Which Database Is Right For Your Business?.........................................................................78
Redis:..............................................................................................................................78
MongoDB:......................................................................................................................79
ystem Properties Comparison GraphDB vs. Neo4j............................................................................80
Related products and services...................................................................................................83
Related products and services...................................................................................................89
Hadoop 2.x Architecture..................................................................................................91
Hadoop 2.x Major Components.......................................................................................92
How Hadoop 2.x Major Components Works...................................................................93
Installation and Configuration..................................................................................................96
Running HiveServer2 and Beeline.........................................................................96
Requirements...................................................................................................................96
Installing Hive from a Stable Release.............................................................................97
Building Hive from Source..............................................................................................97
Compile Hive on master.........................................................................................98
Compile Hive on branch-1.....................................................................................99
Compile Hive Prior to 0.13 on Hadoop 0.20..........................................................99
Compile Hive Prior to 0.13 on Hadoop 0.23........................................................101
Running Hive.................................................................................................................101
Running Hive CLI................................................................................................102
Running HiveServer2 and Beeline.......................................................................102
Running HCatalog................................................................................................103
Running WebHCat (Templeton)...........................................................................103
Configuration Management Overview..........................................................................104
Runtime Configuration..................................................................................................105
Hive, Map-Reduce and Local-Mode.............................................................................106
Hive Logging.................................................................................................................108
HiveServer2 Logs.................................................................................................111
Audit Logs............................................................................................................111
Perf Logger...........................................................................................................111
DDL Operations......................................................................................................................112
Creating Hive Tables......................................................................................................112
Browsing through Tables...............................................................................................113
Altering and Dropping Tables........................................................................................113
Metadata Store...............................................................................................................114
DML Operations......................................................................................................................115
SQL Operations.......................................................................................................................116
Example Queries............................................................................................................116
SELECTS and FILTERS......................................................................................117
GROUP BY..........................................................................................................118
JOIN.....................................................................................................................119
MULTITABLE INSERT......................................................................................119
STREAMING.......................................................................................................119
Simple Example Use Cases.....................................................................................................120
MovieLens User Ratings...............................................................................................120
Apache Weblog Data.....................................................................................................123
6.1. Using Command Aliases.........................................................................................125
6.2. Controlling the Hadoop Installation.......................................................................125
6.3. Using Generic and Specific Arguments..................................................................126
6.4. Using Options Files to Pass Arguments..................................................................128
6.5. Using Tools.............................................................................................................129
7. sqoop-import.......................................................................................................................129
7.1. Purpose...................................................................................................................130
7.2. Syntax.....................................................................................................................130
7.2.1. Connecting to a Database Server...............................................................131
7.2.2. Selecting the Data to Import.......................................................................136
7.2.3. Free-form Query Imports...........................................................................136
7.2.4. Controlling Parallelism..............................................................................137
7.2.5. Controlling Distributed Cache...................................................................139
7.2.6. Controlling the Import Process..................................................................139
7.2.7. Controlling transaction isolation................................................................140
7.2.8. Controlling type mapping...........................................................................141
7.2.9. Schema name handling...............................................................................141
7.2.10. Incremental Imports.................................................................................142
7.2.11. File Formats..............................................................................................143
7.2.12. Large Objects...........................................................................................144
7.2.13. Importing Data Into Hive.........................................................................148
7.2.14. Importing Data Into HBase......................................................................150
7.2.15. Importing Data Into Accumulo................................................................152
7.2.16. Additional Import Configuration Properties............................................154
7.3. Example Invocations..............................................................................................155
8. sqoop-import-all-tables.......................................................................................................156
8.1. Purpose...................................................................................................................157
8.2. Syntax.....................................................................................................................157
9. sqoop-import-mainframe....................................................................................................160
9.1. Purpose...................................................................................................................160
9.2. Syntax.....................................................................................................................160
9.2.1. Connecting to a Mainframe........................................................................161
9.2.2. Selecting the Files to Import......................................................................162
9.2.3. Controlling Parallelism..............................................................................162
9.2.4. Controlling Distributed Cache...................................................................163
9.2.5. Controlling the Import Process..................................................................163
9.2.6. File Formats................................................................................................164
9.2.7. Importing Data Into Hive...........................................................................166
9.2.8. Importing Data Into HBase........................................................................168
9.2.9. Importing Data Into Accumulo..................................................................170
9.2.10. Additional Import Configuration Properties............................................172
10. sqoop-export......................................................................................................................172
10.1. Purpose.................................................................................................................173
10.2. Syntax...................................................................................................................173
10.3. Inserts vs. Updates................................................................................................176
10.4. Exports and Transactions......................................................................................178
10.5. Failed Exports.......................................................................................................179
10.6. Example Invocations............................................................................................179
11. validation...........................................................................................................................180
11.1. Purpose..................................................................................................................180
11.2. Introduction...........................................................................................................180
11.3. Syntax...................................................................................................................181
11.4. Configuration........................................................................................................181
11.5. Limitations............................................................................................................182
11.6. Example Invocations.............................................................................................182
12. Saved Jobs.........................................................................................................................183
13. sqoop-job...........................................................................................................................183
13.1. Purpose.................................................................................................................183
13.2. Syntax...................................................................................................................183
13.3. Saved jobs and passwords....................................................................................185
13.4. Saved jobs and incremental imports.....................................................................186
14. sqoop-metastore................................................................................................................186
14.1. Purpose.................................................................................................................186
14.2. Syntax...................................................................................................................186
15. sqoop-merge......................................................................................................................187
15.1. Purpose.................................................................................................................187
15.2. Syntax...................................................................................................................188
16. sqoop-codegen..................................................................................................................189
16.1. Purpose.................................................................................................................189
16.2. Syntax...................................................................................................................189
17. sqoop-create-hive-table.....................................................................................................191
17.1. Purpose.................................................................................................................191
17.2. Syntax...................................................................................................................192
18. sqoop-eval.........................................................................................................................193
18.1. Purpose.................................................................................................................193
18.2. Syntax...................................................................................................................194
19. sqoop-list-databases..........................................................................................................194
19.1. Purpose.................................................................................................................195
19.2. Syntax...................................................................................................................195
Flume vs. Kafka vs. Kinesis - A Detailed Guide on Hadoop Ingestion Tools..................................196
Flume vs. Kafka vs. Kinesis:..................................................................................................197
Apache Flume:...............................................................................................................197
Apache Kafka:...............................................................................................................197
AWS Kinesis:.................................................................................................................198
So Which to Choose - Flume or Kafka of Kinesis:.................................................................198
There are other players as well like:.......................................................................................198
System Requirements....................................................................................................199
Architecture...................................................................................................................199
Data flow model...................................................................................................199
Complex flows.....................................................................................................200
Reliability.............................................................................................................200
Recoverability......................................................................................................200
Setup........................................................................................................................................201
Setting up an agent.........................................................................................................201
Configuring individual components.....................................................................201
Wiring the pieces together....................................................................................201
Starting an agent...................................................................................................201
A simple example.................................................................................................201
Using environment variables in configuration files.............................................203
Logging raw data..................................................................................................203
Zookeeper based Configuration...........................................................................204
Installing third-party plugins................................................................................205
The plugins.d directory...............................................................................205
Directory layout for plugins........................................................................205
Data ingestion................................................................................................................205
RPC......................................................................................................................205
Executing commands...........................................................................................206
Network streams...................................................................................................206
Setting multi-agent flow................................................................................................206
Consolidation.................................................................................................................206
Multiplexing the flow....................................................................................................207
Configuration..........................................................................................................................208
Defining the flow...........................................................................................................208
Configuring individual components..............................................................................209
Adding multiple flows in an agent.................................................................................210
Configuring a multi agent flow......................................................................................211
Fan out flow...................................................................................................................212
SSL/TLS support...........................................................................................................214
Source and sink batch sizes and channel transaction capacities....................................217
Flume Sources...............................................................................................................217
Avro Source..........................................................................................................217
Thrift Source........................................................................................................220
Exec Source..........................................................................................................222
JMS Source..........................................................................................................224
JMS message converter..............................................................................226
SSL and JMS Source..................................................................................226
Spooling Directory Source...................................................................................227
Event Deserializers.....................................................................................230
LINE..................................................................................................230
AVRO................................................................................................230
BlobDeserializer................................................................................231
Taildir Source.......................................................................................................231
Twitter 1% firehose Source (experimental).........................................................235
Kafka Source........................................................................................................235
NetCat TCP Source..............................................................................................241
NetCat UDP Source..............................................................................................242
Sequence Generator Source.................................................................................242
Syslog Sources.....................................................................................................243
Syslog TCP Source.....................................................................................243
Multiport Syslog TCP Source.....................................................................246
Syslog UDP Source.....................................................................................249
HTTP Source........................................................................................................251
JSONHandler..............................................................................................253
BlobHandler................................................................................................254
Stress Source........................................................................................................254
Legacy Sources....................................................................................................254
Avro Legacy Source....................................................................................255
Thrift Legacy Source..................................................................................255
Custom Source.....................................................................................................256
Scribe Source........................................................................................................256
HBase MapReduce Examples.................................................................................................257
55.1. HBase MapReduce Read Example.......................................................................257
55.2. HBase MapReduce Read/Write Example.............................................................258
55.3. HBase MapReduce Read/Write Example With Multi-Table Output....................259
55.4. HBase MapReduce Summary to HBase Example................................................259
ZooKeeper: A Distributed Coordination Service for Distributed Applications......................260
Design Goals..................................................................................................................260
Data model and the hierarchical namespace..................................................................261
Nodes and ephemeral nodes..........................................................................................261
Conditional updates and watches...................................................................................262
Guarantees.....................................................................................................................262
Simple API.....................................................................................................................262
Implementation..............................................................................................................262
Uses...............................................................................................................................263
Performance...................................................................................................................263
Reliability......................................................................................................................264
Apache Hadoop YARN.....................................................................................................................265
Overview.................................................................................................................................267
User Commands......................................................................................................................267
application......................................................................................................................267
applicationattempt..........................................................................................................268
classpath.........................................................................................................................268
container........................................................................................................................268
jar...................................................................................................................................268
logs.................................................................................................................................269
node...............................................................................................................................269
queue..............................................................................................................................269
version...........................................................................................................................269
envvars...........................................................................................................................269
Administration Commands.....................................................................................................269
daemonlog......................................................................................................................270
nodemanager..................................................................................................................270
proxyserver....................................................................................................................270
resourcemanager............................................................................................................270
rmadmin.........................................................................................................................270
schedulerconf.................................................................................................................272
scmadmin.......................................................................................................................272
sharedcachemanager......................................................................................................273
Apache Spark vs Hadoop: Introduction to Hadoop
Hadoop is a framework that allows you to first store Big Data in a distributed
environment so that you can process it parallely. There are basically two
components in Hadoop:
HDFS
HDFS creates an abstraction of resources, let me simplify it for you. Similar as
virtualization, you can see HDFS logically as a single unit for storing Big Data, but
actually you are storing your data across multiple nodes in a distributed fashion.
Here, you have master-slave architecture. In HDFS, Namenode is a master node and
Datanodes are slaves.
NameNode
It is the master daemon that maintains and manages the DataNodes (slave nodes). It
records the metadata of all the files stored in the cluster, e.g. location of blocks
stored, the size of the files, permissions, hierarchy, etc. It records each and every
change that takes place to the file system metadata.
For example, if a file is deleted in HDFS, the NameNode will immediately record this
in the EditLog. It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of
all the blocks in HDFS and in which nodes these blocks are stored.
DataNode
These are slave daemons which runs on each slave machine. The actual data is
stored on DataNodes. They are responsible for serving read and write requests from
the clients. They are also responsible for creating blocks, deleting blocks and
replicating the same based on the decisions taken by the NameNode.
YA
RN
YARN performs all your processing activities by allocating resources and scheduling
tasks. It has two major daemons, i.e. ResourceManager and NodeManager.
ResourceManager
It is a cluster level (one for each cluster) component and runs on the master
machine. It manages resources and schedule applications running on top of YARN.
NodeManager
It is a node level component (one on each node) and runs on each slave machine. It
is responsible for managing containers and monitoring resource utilization in each
container. It also keeps track of node health and log management. It continuously
communicates with ResourceManager to remain up-to-date. So, you can perform
parallel processing on HDFS using MapReduce.
To learn more about Hadoop, you can go through this Hadoop Tutorial blog. Now, that
we are all set with Hadoop introduction, let’s move on to Spark introduction.
Apache Spark and Scala Certification Training Watch The Course Preview
Apache Spark
vs Hadoop:
Introduction to Apache Spark
Apache Spark is a framework for real time data analytics in a distributed computing
environment. It executes in-memory computations to increase speed of data
processing. It is faster for processing large scale data as it exploits in-memory
computations and other optimizations. Therefore, it requires high processing
power.
Resilient Distributed Dataset (RDD) is a fundamental data structure of Spark. It is an

immutable distributed collection of objects. Each dataset in RDD is divided into
logical partitions, which may be computed on different nodes of the cluster. RDDs
can contain any type of Python, Java, or Scala objects, including user-defined
classes. Spark components make it fast and reliable. Apache Spark has the following
components:
1. Spark Core – Spark Core is the base engine for large-scale parallel and
distributed data processing. Further, additional libraries which are built atop
the core allow diverse workloads for streaming, SQL, and machine learning. It
is responsible for memory management and fault recovery, scheduling,
distributing and monitoring jobs on a cluster & interacting with storage
systems
2. Spark Streaming – Spark Streaming is the component of Spark which is used to
process real-time streaming data. Thus, it is a useful addition to the core
Spark API. It enables high-throughput and fault-tolerant stream processing of
live data streams
3. Spark SQL: Spark SQL is a new module in Spark which integrates relational
processing with Spark’s functional programming API. It supports querying
data either via SQL or via the Hive Query Language. For those of you familiar
with RDBMS, Spark SQL will be an easy transition from your earlier tools
where you can extend the boundaries of traditional relational data
processing.
4. GraphX: GraphX is the Spark API for graphs and graph-parallel computation.
Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. At
a high-level, GraphX extends the Spark RDD abstraction by introducing the
Resilient Distributed Property Graph: a directed multigraph with properties
attached to each vertex and edge.
5. MLlib (Machine Learning): MLlib stands for Machine Learning Library. Spark
MLlib is used to perform machine learning in Apache Spark.
As you can see, Spark comes packed with high-level libraries, including support for
R, SQL, Python, Scala, Java etc. These standard libraries increase the seamless
integrations in complex workflow. Over this, it also allows various sets of services to
integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc. to
increase its capabilities.
To learn more about Apache Spark, you can go through this Spark Tutorial blog. Now
the ground is all set for Apache Spark vs Hadoop. Let’s move ahead and compare
Apache Spark with Hadoop on different parameters to understand their strengths.
Apache Spark vs Hadoop: Parameters to Compare

Performance
Spark is fast because it has in-memory processing. It can also use disk for data that
doesn’t all fit into memory. Spark’s in-memory processing delivers near real-time
analytics. This makes Spark suitable for credit card processing system, machine
learning, security analytics and Internet of Things sensors.
Hadoop was originally setup to continuously gather data from multiple sources
without worrying about the type of data and storing it across distributed
environment. MapReduce uses batch processing. MapReduce was never built for
real-time processing, main idea behind YARN is parallel processing over distributed
dataset.
The problem with comparing the two is that they perform processing differently.
Ease of Use
Spark comes with user-friendly APIs for Scala, Java, Python, and Spark SQL. Spark
SQL is very similar to SQL, so it becomes easier for SQL developers to learn it. Spark
also provides an interactive shell for developers to query & perform other actions, &
have immediate feedback.
You can ingest data in Hadoop easily either by using shell or integrating it with
multiple tools like Sqoop, Flume etc. YARN is just a processing framework and it can
be integrated with multiple tools like Hive and Pig. HIVE is a data warehousing
component which performs reading, writing and managing large data sets in a
distributed environment using SQL-like interface. You can go through this Hadoop
ecosystem blog to know about the various tools that can be integrated with Hadoop.
Costs
Hadoop and Spark are both Apache open source projects, so there’s no cost for the
software. Cost is only associated with the infrastructure. Both the products are
designed in such a way that it can run on commodity hardware with low TCO.
Now you may be wondering the ways in which they are different. Storage &
processing in Hadoop is disk-based & Hadoop uses standard amounts of memory.
So, with Hadoop we need a lot of disk space as well as faster disks. Hadoop also
requires multiple systems to distribute the disk I/O.
Due to Apache Spark’s in memory processing it requires a lot of memory, but it can
deal with a standard speed & amount of disk. As disk space is a relatively
inexpensive commodity and since Spark does not use disk I/O for processing,
instead it requires large amounts of RAM for executing everything in memory. Thus,
Spark system incurs more cost.
But yes, one important thing to keep in mind is that Spark’s technology reduces the
number of required systems. It needs significantly fewer systems that cost more. So,
there will be a point at which Spark reduces the costs per unit of computation even
with the additional RAM requirement.
Data Processing
There are two types of data processing: Batch Processing & Stream Processing.
Batch Processing vs Stream Processing
Batch Processing: Batch processing has been crucial to big data world. In simplest
term, batch processing is working with high data volumes collected over a period.
In batch processing data is first collected and then processed results are produced at
a later stage.
Batch processing is an efficient way of processing large, static data sets. Generally,
we perform batch processing for archived data sets. For example, calculating
average income of a country or evaluating the change in e-commerce in last decade.
Stream processing: Stream processing is the current trend in the big data world. Need
of the hour is speed and real-time information, which is what steam processing
does. Batch processing does not allow businesses to quickly react to changing
business needs in real time, stream processing has seen a rapid growth in demand.
Now coming back to Apache Spark vs Hadoop, YARN is a basically a batch-

processing framework. When we submit a job to YARN, it reads data from the
cluster, performs operation & write the results back to the cluster. Then it again
reads the updated data, performs the next operation & write the results back to the
cluster and so on.
Spark performs similar operations, but it uses in-memory processing and optimizes
the steps. GraphX allows users to view the same data as graphs and as collections.
Users can also transform and join graphs with Resilient Distributed Datasets (RDDs).
Fault Tolerance
Hadoop and Spark both provides fault tolerance, but both have different approach.
For HDFS and YARN both, master daemons (i.e. NameNode & ResourceManager
respectively) checks heartbeat of slave daemons (i.e. DataNode & NodeManager
respectively). If any slave daemon fails, master daemons reschedules all pending
and in-progress operations to another slave. This method is effective, but it can
significantly increase the completion times for operations with single failure also.
As Hadoop uses commodity hardware, another way in which HDFS ensures fault
tolerance is by replicating data.
As we discussed above, RDDs are building blocks of Apache Spark. RDDs provide
fault tolerance to Spark. They can refer to any dataset present in external storage
system like HDFS, HBase, shared filesystem. They can be operated parallelly.
RDDs can persist a dataset in memory across operations, which makes future
actions 10 times much faster. If a RDD is lost, it will automatically be recomputed by
using the original transformations. This is how Spark provides fault-tolerance.
Security
Hadoop supports Kerberos for authentication, but it is difficult to handle.

Nevertheless, it also supports third party vendors like LDAP (Lightweight Directory
Access Protocol) for authentication. They also offer encryption. HDFS supports
traditional file permissions, as well as access control lists (ACLs). Hadoop provides
Service Level Authorization, which guarantees that clients have the right
permissions for job submission.
Spark currently supports authentication via a shared secret. Spark can integrate
with HDFS and it can use HDFS ACLs and file-level permissions. Spark can also run
on YARN leveraging the capability of Kerberos.
Use-cases where Hadoop fits best:

 Analysing Archive Data. YARN allows parallel processing of huge amounts of
data. Parts of Data is processed parallelly & separately on different DataNodes
& gathers result from each NodeManager.
 If instant results are not required. Hadoop MapReduce is a good and economical
solution for batch processing.
Use-cases where Spark fits best:
Real-Time Big Data Analysis:
Real-time data analysis means processing data generated by the real-time event
streams coming in at the rate of millions of events per second, Twitter data for
instance. The strength of Spark lies in its abilities to support streaming of data along
with distributed processing. This is a useful combination that delivers near real-
time processing of data. MapReduce is handicapped of such an advantage as it was
designed to perform batch cum distributed processing on large amounts of data.
Real-time data can still be processed on MapReduce but its speed is nowhere close
to that of Spark.
Spark claims to process data 100x faster than MapReduce, while 10x faster with the
disks.
Graph Processing:
Most graph processing algorithms like page rank perform multiple iterations over
the same data and this requires a message passing mechanism. We need to program
MapReduce explicitly to handle such multiple iterations over the same data.
Roughly, it works like this: Read data from the disk and after a particular iteration,
write results to the HDFS and then read data from the HDFS for next the iteration.
This is very inefficient since it involves reading and writing data to the disk which
involves heavy I/O operations and data replication across the cluster for fault
tolerance. Also, each MapReduce iteration has very high latency, and the next
iteration can begin only after the previous job has completely finished.
Also, message passing requires scores of neighboring nodes in order to evaluate the
score of a particular node. These computations need messages from its neighbors
(or data across multiple stages of the job), a mechanism that MapReduce lacks.
Different graph processing tools such as Pregel and GraphLab were designed in
order to address the need for an efficient platform for graph processing algorithms.
These tools are fast and scalable, but are not efficient for creation and post-
processing of these complex multi-stage algorithms.
Introduction of Apache Spark solved these problems to a great extent. Spark

contains a graph computation library called GraphX which simplifies our life. In-
memory computation along with in-built graph support improves the performance
of the algorithm by a magnitude of one or two degrees over traditional MapReduce
programs. Spark uses a combination of Netty and Akka for distributing messages
throughout the executors. Let’s look at some statistics that depict the performance
of the PageRank algorithm using Hadoop and Spark.
Iterative Machine Learning Algorithms:
Almost all machine learning algorithms work iteratively. As we have seen earlier,
iterative algorithms involve I/O bottlenecks in the MapReduce implementations.
MapReduce uses coarse-grained tasks (task-level parallelism) that are too heavy for
iterative algorithms. Spark with the help of Mesos – a distributed system kernel,
caches the intermediate dataset after each iteration and runs multiple iterations on
this cached dataset which reduces the I/O and helps to run the algorithm faster in a
fault tolerant manner.
Spark has a built-in scalable machine learning library called MLlib which contains
high-quality algorithms that leverages iterations and yields better results than one
pass approximations sometimes used on MapReduce.
 Fast data processing. As we know, Spark allows in-memory processing. As a

result, Spark is up to 100 times faster for data in RAM and up to 10 times for
data in storage.
 Iterative processing. Spark’s RDDs allow performing several map operations in
memory, with no need to write interim data sets to a disk.
 Near real-time processing. Spark is an excellent tool to provide immediate
business insights. This is the reason why Spark is used in credit card’s
streaming system.
Open source datacenter
computing with Apache Mesos
Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across
distributed applications or frameworks. Mesos is a open source software originally developed at the
University of California at Berkeley. It sits between the application layer and the operating system
and makes it easier to deploy and manage applications in large-scale clustered environments more
efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of
Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.
Mesos leverages features of the modern kernel—"cgroups" in Linux, "zones" in Solaris—to provide
isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large
collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling
mechanism called resource offers. Mesos decides how many resources to offer each framework,
while frameworks decide which resources to accept and which computations to run on them. It is a
thin resource sharing layer that enables fine-grained sharing across diverse cluster computing
frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is
to deploy multiple distributed systems to a shared pool of nodes in order to increase resource
utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop,
Memecached, Ruby on Rails, Storm, JBoss Data Grid, MPI, Spark and Node.js, as well as various
web servers, databases and application servers.
Node abstraction in Apache Mesos (source)

In a similar way that a PC operating system manages access to the resources on a desktop computer,
Mesos ensures applications have access to the resources they need in a cluster. Instead of setting up
numerous server clusters for different parts of an application, Mesos allows you to share a pool of
servers that can all run different parts of your application without them interfering with each other
and with the ability to dynamically allocate resources across the cluster as needed. That means, it
could easily switch resources away from framework1 (for example, doing big-data analysis) and
allocate them to framework2 (for example, a web server), if there is heavy network traffic. It also
reduces a lot of the manual steps in deploying applications and can shift workloads around
automatically to provide fault tolerance and keep utilization rates high.
Resource sharing across the cluster increases throughput and utilization (source)
Mesos is essentially data center kernel—which means it's the software that actually isolates the
running workloads from each other. It still needs additional tooling to let engineers get their
workloads running on the system and to manage when those jobs actually run. Otherwise, some
workloads might consume all the resources, or important workloads might get bumped by less-
important workloads that happen to require more resources.Hence Mesos needs more than just a
kernel—Chronos scheduler, a cron replacement for automatically starting and stopping services
(and handling failures) that runs on top of Mesos. The other part of the Mesos is Marathon that
provides API for starting, stopping and scaling services (and Chronos could be one of those
services).
Workloads in Chronos and Marathon (source)
Architecture
Mesos consists of a master process that manages slave daemons running on each cluster node, and
frameworks that run tasks on these slaves. The master implements fine-grained sharing across
frameworks using resource offers. Each resource offer is a list of free resources on multiple slaves.
The master decides how many resources to offer to each framework according to an organizational
policy, such as fair sharing or priority. To support a diverse set of inter-framework allocation
policies, Mesos lets organizations define their own policies via a pluggable allocation module.
Mesos architecture with two running frameworks (source)
Each framework running on Mesos consists of two components: a scheduler that registers with the
master to be offered resources, and an executor process that is launched on slave nodes to run the
framework's tasks. While the master determines how many resources to offer to each framework,
the frameworks' schedulers select which of the offered resources to use. When a framework accepts
offered resources, it passes Mesos a description of the tasks it wants to launch on them.
Framework scheduling in Mesos (source)

The figure above shows an example of how a framework gets scheduled to run tasks. In step one,
slave 1 reports to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes
the allocation module, which tells it that framework 1 should be offered all available resources. In
step two, the master sends a resource offer describing these resources to framework 1. In step three,
the framework's scheduler replies to the master with information about two tasks to run on the
slave, using 2 CPUs; 1 GB RAM for the first task, and 1 CPUs; 2 GB RAM for the second task.
Finally, in step four, the master sends the tasks to the slave, which allocates appropriate resources to
the framework's executor, which in turn launches the two tasks (depicted with dotted borders).
Because 1 CPU and 1 GB of RAM are still free, the allocation module may now offer them to
framework 2. In addition, this resource offer process repeats when tasks finish and new resources
become free.
While the thin interface provided by Mesos allows it to scale and allows the frameworks to evolve
independently. A framework will reject the offers that do not satisfy its constraints and accept the
ones that do. In particular, we have found that a simple policy called delay scheduling, in which
frameworks wait for a limited time to acquire nodes storing the input data, yields nearly optimal
data locality.
Mesos features
 Fault-tolerant replicated master using ZooKeeper
 Scalability to thousands of nodes
 Isolation between tasks with Linux containers
 Multi-resource scheduling (memory and CPU aware)
 Java, Python and C++ APIs for developing new parallel applications
 Web UI for viewing cluster state
There are a number of software projects built on top of Apache Mesos:
DevOps tooling
Vamp is a deployment and workflow tool for container orchestration systems, including Mesos/
Marathon. It brings canary releasing, A/B testing, auto scaling and self healing through a web
UI, CLI and REST API.
Long Running Services

 Aurora is a service scheduler that runs on top of Mesos, enabling you to run long-running
services that take advantage of Mesos' scalability, fault-tolerance, and resource isolation.
 Marathon is a private PaaS built on Mesos. It automatically handles hardware or software
failures and ensures that an app is "always on."
 Singularity is a scheduler (HTTP API and web interface) for running Mesos tasks: long
running processes, one-off tasks, and scheduled jobs.
 SSSP is a simple web application that provides a white-label "Megaupload" for storing and
sharing files in S3.
Big Data Processing
 Cray Chapel is a productive parallel programming language. The Chapel Mesos scheduler
lets you run Chapel programs on Mesos.
 Dpark is a Python clone of Spark, a MapReduce-like framework written in Python, running
on Mesos.
 Exelixi is a distributed framework for running genetic algorithms at scale.
 Flink is an open source platform for distributed stream and batch data processing.
 Hadoop : Running Hadoop on Mesos distributes MapReduce jobs efficiently across an
entire cluster.
 Hama is a distributed computing framework based on Bulk Synchronous Parallel computing
techniques for massive scientific computations e.g., matrix, graph and network algorithms.
 MPI is a message-passing system designed to function on a wide variety of parallel
computers.
 Spark is a fast and general-purpose cluster computing system which makes parallel jobs
easy to write.
 Storm is a distributed realtime computation system. Storm makes it easy to reliably process
unbounded streams of data, doing for realtime processing what Hadoop did for batch
processing.
Batch Scheduling
 Chronos is a distributed job scheduler that supports complex job topologies. It can be used
as a more fault-tolerant replacement for cron.
 Jenkins is a continuous integration server. The mesos-jenkins plugin allows it to
dynamically launch workers on a Mesos cluster depending on the workload.
 JobServer is a distributed job scheduler and processor which allows developers to build
custom batch processing Tasklets using point and click web UI.
 Torque is a distributed resource manager providing control over batch jobs and distributed
compute nodes.
Data Storage
 Cassandra is a highly available distributed database. Linear scalability and proven fault-
tolerance on commodity hardware or cloud infrastructure make it the perfect platform for
mission-critical data.
 ElasticSearch is a distributed search engine. Mesos makes it easy to run and scale.
 Hypertable is a high performance, scalable, distributed storage and processing system for
structured and unstructured data.
Aurora
The Aurora Mesos Framework

Cloud Watcher
Article from ADMIN 28/2015
By Udo Seidel
Apache Aurora is a service daemon built for the data center.
Apache's Mesos project is an important building block for a new generation of cloud applications. The
goal of the Mesos project is to let the developer "program against the datacenter, like it's a single pool
of resources. Mesos abstracts CPU, memory, storage, and other compute resources away from
machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built
and run effectively" [1].
An important tool that has evolved out of the Mesos environment is Aurora, which recently graduated
from the Apache Incubator and is now a full Apache project Figure 1. According to the project website,,
"Aurora runs applications and services across a shared pool of machines, and is responsible for
keeping them running, forever. When machines experience failure, Aurora intelligently reschedules
those jobs onto healthy machines" [2]. In other words, Aurora is a little like an init tool for data centers
and cloud-based virtual environments.
Figure 1: Aurora is a Mesos Framework;
Mesos is in turn an Apache project.
The Aurora project has many fathers: In addition to its kinship with Apache and Mesos, Aurora was
initially supported by Twitter, and Google was at least indirectly an inspiration for the project. The
beginnings of Aurora date back to 2010. Bill Farner, a member of the research team at Twitter, launched
a project to facilitate the operation of tweeting infrastructure. The IT landscape of the short message
service had grown considerably at that time. The operations team was faced with thousands of
computers and hundreds of applications. Added to this was the constant rollout of new software
versions.
Bill Farner had previously worked at Google and had some experience working with Google's Borg
cluster manager [3]. In the early years, development took place only within Twitter and behind closed
doors. However, more and more employees contributed to the development, and Aurora became
increasingly important for the various Twitter services. Eventually, the opening of the project in the
direction of the open source community was a natural step to maintain such a fast-growing software
project. Aurora has been part of the Apache family since 2013.
What is Singularity
Singularity is a platform that enables deploying and running services and scheduled jobs in the
cloud or data centers. Combined with Apache Mesos, it provides efficient management of the
underlying processes life cycle and effective use of cluster resources.
Singularity is an essential part of the HubSpot Platform and is ideal for deploying micro-services. It
is optimized to manage thousands of concurrently running processes in hundreds of servers.
How it Works
Singularity is an Apache Mesos framework. It runs as a task scheduler on top of Mesos Clusters
taking advantage of Apache Mesos' scalability, fault-tolerance, and resource isolation. Apache
Mesos is a cluster manager that simplifies the complexity of running different types of applications
on a shared pool of servers. In Mesos terminology, Mesos applications that use the Mesos APIs to
schedule tasks in a cluster are called frameworks.
There are different types of frameworks and most frameworks concentrate on a specific type of task
(e.g. long-running vs scheduled cron-type jobs) or supporting a specific domain and relevant
technology (e.g. data processing with hadoop jobs vs data processing with spark).
Singularity tries to be more generic by combining long-running tasks and job scheduling
functionality in one framework to support many of the common process types that developers need
to deploy every day to build modern web applications and services. While Mesos allows multiple
frameworks to run in parallel, it greatly simplifies the PaaS architecture by having a consistent and
uniform set of abstractions and APIs for handling deployments across the organization.
Additionally, it reduces the amount of framework boilerplate that must be supported - as all Mesos
frameworks must keep state, handle failures, and properly interact with the Mesos APIs. These are
the main reasons HubSpot engineers initiated the development of a new framework. As of this
moment, Singularity supports the following process types:
 Web Services. These are long running processes which expose an API and may run with
multiple load balanced instances. Singularity supports automatic configurable health
checking of the instances at the process and API endpoint level as well as load balancing.
Singularity will automatically restart these tasks when they fail or exit.
 Workers. These are long running processes, similar to web services, but do not expose an
API. Queue consumers are a common type of worker processes. Singularity does automatic
health checking, cool-down and restart of worker instances.
 Scheduled (CRON-type) Jobs. These are tasks that periodically run according to a
provided CRON schedule. Scheduled jobs will not be restarted when they fail unless
instructed to do so. Singularity will run them again on the next scheduling cycle.
 On-Demand Processes. These are manually run processes that will be deployed and ready
to run but Singularity will not automatically run them. Users can start them through an API
call or using the Singularity Web UI, which allows them to pass command line parameters
on-demand.
Singularity Components
Mesos frameworks have two major components. A scheduler component that registers with the
Mesos master to be offered resources and an executor component that is launched on cluster slave
nodes by the Mesos slave process to run the framework tasks.
The Mesos master determines how many resources are offered to each framework and the
framework scheduler selects which of the offered resources to use to run the required tasks. Mesos
slaves do not directly run the tasks but delegate the running to the appropriate executor that has
knowledge about the nature of the allocated task and the special handling that might be required.
As depicted in the figure, Singularity implements the two basic framework components as well as a
few more to solve common complex / tedious problems such as task cleanup and log tailing /
archiving without requiring developers to implement it for each task they want to run:
Singularity Scheduler
The scheduler is the core of Singularity: a DropWizard API that implements the Mesos Scheduler
Driver. The scheduler matches client deploy requests to Mesos resource offers and acts as a web
service offering a JSON REST API for accepting deploy requests.
Clients use the Singularity API to register the type of deployable item that they want to run (web
service, worker, cron job) and the corresponding runtime settings (cron schedule, # of instances,
whether instances are load balanced, rack awareness, etc.).
After a deployable item (a request, in API terms) has been registered, clients can post Deploy
requests for that item. Deploy requests contain information about the command to run, the executor
to use, executor specific data, required cpu, memory and port resources, health check URLs and a
variety of other runtime configuration options. The Singularity scheduler will then attempt to match
Mesos offers (which in turn include resources as well as rack information and what else is running
on slave hosts) with its list of Deploy requests that have yet to be fulfilled.
Rollback of failed deploys, health checking and load balancing are also part of the advanced
functionality the Singularity Scheduler offers. A new deploy for a long runing service will run as
shown in the diagram below.
When a service or worker instance fails in a new deploy, the Singularity scheduler will rollback all
instances to the version running before the deploy, keeping the deploys always consistent. After the
scheduler makes sure that a Mesos task (corresponding to a service instance) has entered the
TASK_RUNNING state it will use the provided health check URL and the specified health check
timeout settings to perform health checks. If health checks go well, the next step is to perform load
balancing of service instances. Load balancing is attempted only if the corresponding deployable
item has been defined to be loadBalanced. To perform load balancing between service instances,
Singularity supports a rich integration with a specific Load Balancer API. Singularity will post
requests to the Load Balancer API to add the newly deployed service instances and to remove those
that were previously running. Check Integration with Load Balancers to learn more. Singularity also
provides generic webhooks which allow third party integrations, which can be registered to follow
request, deploy, or task updates.
Slave Placement
When matching a Mesos resource offer to a deploy, Singularity can use one of several strategies to
determine if the host in the offer is appropriate for the task in question, or SlavePlacement in
Singularity terms. Available placement strategies are:
 GREEDY: uses whatever slaves are available
 SEPARATE_BY_DEPLOY/SEPARATE: ensures no 2 instances / tasks of the same request and
deploy id are ever placed on the same slave
 SEPARATE_BY_REQUEST: ensures no two tasks belonging to the same request (regardless if
deploy id) are placed on the same host
 OPTIMISTIC: attempts to spread out tasks but may schedule some on the same slave
 SPREAD_ALL_SLAVES: ensure the task is running on every slave. Some behaviour as
SEPARATE_BY_DEPLOY but with autoscaling the Request to keep instances equal number of
slaves.
Slave placement can also be impacted by slave attributes. There are three scenarios that Singularity
supports:
1. Specific Slaves -> For a certain request, only run it on slaves with matching attributes - In
this case, you would specify requiredSlaveAttributes in the json for your request, and
the tasks for that request would only be scheduled on slaves that have all of those attributes.
2. Reserved Slaves -> Reserve a slave for specific requests, only run those requests on those
slaves - In your Singularity config, specify the reserveSlavesWithAttributes field.
Singularity will then only schedule tasks on slaves with those attributes if the request's
required attributes also match those.
3. Test Group of Slaves -> Reserve a slave for specific requests, but don't restrict the requests
to that slave - In your Singularity config, specify the reserveSlavesWithAttributes
field as in the previous example. But, in the request json, specify the
allowedSlaveAttributes field. Then, the request will be allowed to run elsewhere in
the cluster, but will also have the matching attributes to run on the reserved slave.
Singularity Scheduler Dependencies

The Singularity scheduler uses ZooKeeper as a distributed replication log to maintain state and keep
track of registered deployable items, the active deploys for these items and the running tasks that
fulfill the deploys. As shown in the drawing, the same ZooKeeper quorum utilized by Mesos
masters and slaves can be reused for Singularity.
Since ZooKeeper is not meant to handle large quantities of data, Singularity can optionally (and
recommended for any real usage) utilize a database (MySQL or PostgreSQL) to periodically offload
historical data from ZooKeeper and keep records of deployable item changes, deploy request
history as well as the history of all launched tasks.
In production environments Singularity should be run in high-availability mode by running multiple
instances of the Singularity Scheduler component. As depicted in the drawing, only one instance is
always active with all the other instances waiting in stand-by mode. While only one instance is
registered for receiving resource offers, all instances can process API requests. Singularity uses
ZooKeeper to perform leader election and maintain a single leader. Because of the ability for all
instances to change state, Singularity internally uses queues which are consumed by the Singularity
leader to make calls to Mesos.
Singularity UI
The Singularity UI is a single page static web application served from the Singularity Scheduler that
uses the Singularity API to present information about deployed items.
It is a fully-featured application which provides historical as well as active task information. It
allows users to view task logs and interact directly with tasks and deploy requests.
Optional Slave Components

Singularity Executor
Users can opt for the default Mesos executor, the Docker container executor, or the Singularity
executor. Like the other executors, the Singularity executor is executed directly by the Mesos slave
process for each task that executes on a slave. The requests sent to the executor contain all the
required data for setting up the running environment like the command to execute, environment
variables, executable artifact URLs, application configuration files, etc. The Singularity executor
provides some advanced (configurable) features:
 Custom Fetcher Downloads and extracts artifacts over HTTP, directly from S3, or using the
S3 Downloader component.
 Log Rotation Sets up logrotate for specified log files inside the task directory.
 Task Sandbox Cleanup. Can cleanup large (uninteresting) application files but leave
important logs and debugging files.
 Graceful Task Killing. Can send SIGTERM and escalate to SIGKILL for graceful
shutdown of tasks.
 Environment Setup and Runner Script. Provides for setup of environment variables and
corresponding bash script to run the command.
S3 Uploader
The S3 uploader reliably uploads rotated task log files to S3 for archiving. These logs can then be
downloaded directly from the Singularity UI.
S3 Downloader
The S3 downloader downloads and extract artifacts from S3 outside of the context of an executor -
this is useful to avoid using the memory (page cache) of the executor process and also downloads
from S3 without pre-generating expiring URIs (a bad idea inside Mesos.)
Singularity Executor Cleanup
While the Mesos slave has the ability to garbage collect tasks, the cleanup process maintains
consistent state with other Singularity services (like the uploader and log watcher). This is a utility
that is meant to run in each slave on CRON (e.g once per hour) and will clean the sandbox of
finished or failed tasks that the Singualrity executor failed to clean.
Log Watcher
The log watcher is an experimental service that provides log tailing and streaming / forwarding of
executor task log lines to third party services like fluentd or logstash to support real-time log
viewing and searching.
OOM Killer
The Out of Memory process Killer is an experimental service that replaces the default memory
limit checking supported by Mesos and Linux Kernel CGROUPS. The intention of the OOM
Killer is to provide more consistent task notification when tasks are killed. It is also an attempt to
workaround Linux Kernel issues with CGROUP OOMs and also prevents the CGROUP OOM
killer from killing tasks due to page cache overages.
Chapel: Productive Parallel Programming

Parallel computing has resulted in numerous significant advances in science and technology over
the past several decades. However, in spite of these successes, the fact remains that only a small
fraction of the world’s programmers are capable of effectively using the parallel processing
languages and programming models employed within HPC and mainstream computing. Chapel is
an emerging parallel language being developed at Cray Inc. with the goal of addressing this issue
and making parallel programming far more productive and generally accessible.
Chapel originated from the DARPA High Productivity Computing Systems (HPCS) program, which
challenged vendors like Cray to improve the productivity of high-end computing systems.
Engineers at Cray noted that the HPC community was hungry for alternative parallel processing
languages and developed Chapel as part of our response. The reaction from HPC users so far has
been very encouraging—most would be excited to have the opportunity to use Chapel once it
becomes production-grade.
Chapel Overview
Though it would be impossible to give a thorough introduction to Chapel in the space of this article,
the following characterizations of the language should serve to give an idea of what we are
pursuing:
 General Parallelism: Chapel has the goal of supporting any parallel algorithm you can
conceive of on any parallel hardware you want to target. In particular, you should never hit
a point where you think “Well, that was fun while it lasted, but now that I want to do x, I’d
better go back to MPI.”
 Separation of Parallelism and Locality: Chapel supports distinct concepts for describing
parallelism (“These things should run concurrently”) from locality (“This should be placed
here; that should be placed over there”). This is in sharp contrast to conventional
approaches that either conflate the two concepts or ignore locality altogether.
 Multiresolution Design: Chapel is designed to support programming at higher or lower
levels, as required by the programmer. Moreover, higher-level features—like data
distributions or parallel loop schedules—may be specified by advanced programmers within
the language.
 Productivity Features: In addition to all of its features designed for supercomputers,
Chapel also includes a number of sequential parallel processing language features designed
for productive programming. Examples include type inference, iterator functions, object-
oriented programming, and a rich set of array types. The result combines productivity
features as in Python™, Matlab®, or Java™ software with optimization opportunities as in
Fortran or C.
Chapel’s implementation is also worth characterizing:
 Open Source: Since its outset, Chapel has been developed in an open-source manner, with
collaboration from academics, computing labs, and industry. Chapel is released under a
BSD license in order to minimize barriers to its use.
 Portable: While Cray machines are an obvious target for Chapel, the parallel processing
language was designed to be very portable. Today, Chapel runs on virtually any architecture
supporting a C compiler, UNIX-like environment, POSIX threads, and MPI or UDP.
 Optimized for Crays: Though designed for portability, the Chapel implementation has also
been optimized to take advantage of Cray-specific features.
Chapel: Today and Tomorrow
While the HPCS project that spawned Chapel concluded successfully at the end of 2012, the Chapel
project remains active and ongoing. The Chapel prototype and demonstrations developed under
HPCS were considered compelling enough to users that Cray plans to continue the project over the
next several years. Current priorities include:
 Performance Optimizations: To date, the implementation effort has focused primarily on
correctness over performance. Improving performance is typically considered the number
one priority for growing the Chapel community.
 Support for Accelerators: Emerging compute nodes are increasingly likely to contain
accelerators like GPU supercomputers or Intel® MIC chips. We are currently working on
extending our locality abstractions to better handle such architectures.
 Interoperability: Beefing up Chapel’s current interoperability features is a priority, to
permit users to reuse existing libraries or gradually transition applications to Chapel.
 Feature Improvements: Having completed HPCS, we now have the opportunity to go back
and refine features that have not received sufficient attention to date. In many cases, these
improvements have been motivated by feedback from early users.
 Outreach and Evangelism: While improving Chapel, we are seeking out ways to grow
Chapel’s user base, particularly outside of the traditional HPC sphere.
 Research Efforts: In addition to hardening the implementation, a number of interesting
research directions remain for Chapel, including resilience mechanisms, applicability to “big
data” analytic computations, energy-aware computing, and support for domain specific
languages.
Dpark
DPark is a Python clone of Spark, MapReduce(R) alike computing framework supporting

iterative computation.
Installation
## Due to the use of C extensions, some libraries need to be installed first.
$ sudo apt-get install libtool pkg-config build-essential autoconf automake

$ sudo apt-get install python-dev
$ sudo apt-get install libzmq-dev
## Then just pip install dpark (``sudo`` maybe needed if you encounter permission
problem).
$ pip install dpark
Example
for word counting (wc.py):
from dpark import DparkContext
ctx = DparkContext()
file = ctx.textFile("/tmp/words.txt")
words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))
wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()
print wc
This script can run locally or on a Mesos cluster without any modification, just using
different command-line arguments:
$ python wc.py
$ python wc.py -m process
$ python wc.py -m host[:port]
See examples/ for more use cases.
Configuration
DPark can run with Mesos 0.9 or higher.
If a $MESOS_MASTER environment variable is set, you can use a shortcut and run DPark with
Mesos just by typing
$ python wc.py -m mesos
$MESOS_MASTER can be any scheme of Mesos master, such as
$ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master
In order to speed up shuffling, you should deploy Nginx at port 5055 for accessing data in
DPARK_WORK_DIR (default is /tmp/dpark), such as:
server {
listen 5055;
server_name localhost;
root /tmp/dpark/;
}
UI
2 DAGs:
1.stage graph: stage is a running unit, contain a set of task, each run same ops for a
split of rdd.
2.use api callsite graph
UI when running
Just open the url from log like start listening on Web UI http://server_01:40812 .
UI after running
1.before run, config LOGHUB & LOGHUB_PATH_FORMAT in dpark.conf, pre-create
LOGHUB_DIR.
2.get log hubdir from log like logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b-

9858-4153-b491-80699c757485-8754 , which in clude mesos framework id.
3.run dpark_web.py -p 9999 -l LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-
80699c757485-8728/, dpark_web.py is in tools/
UI examples for features

show sharing shuffle map output
rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey()
rdd.map(m).collect()
rdd.map(m).collect()
combine nodes iff with same lineage, form a logic tree inside stage, then each node
contain a PIPELINE of rdds.
rdd1 = get_rdd()
rdd2 = dc.union([get_rdd() for i in range(2)])
rdd3 = get_rdd().groupByKey()
dc.union([rdd1, rdd2, rdd3]).collect()
Exelixi
Exelixi is a distributed framework based on Apache Mesos, mostly implemented in Python
using gevent for high-performance concurrency It is intended to run cluster computing
jobs (partitioned batch jobs, which include some messaging) in pure Python. By default, it
runs genetic algorithms at scale. However, it can handle a broad range of other problem
domains by using --uow command line option to override the UnitOfWorkFactory class
definition.
Please see the project wiki for more details, including a tutorial on how to build Mesos-
based frameworks.
Quick Start
To check out the GA on a laptop (with Python 2.7 installed), simply run:
./src/ga.py
Otherwise, to run at scale, the following steps will help you get Exelixi running on Apache
Mesos. For help in general with command line options:
./src/exelixi.py -h
The following instructions are based on using the Elastic Mesos service, which uses Ubuntu
Linux servers running on Amazon AWS. Even so, the basic outline of steps shown here
apply in general.
First, launch an Apache Mesos cluster. Once you have confirmation that your cluster is
running (e.g., Elastic Mesos sends you an email messages with a list of masters and slaves)
then use ssh to login on any of the masters:
ssh -A -l ubuntu <master-public-ip>
You must install the Python bindings for Apache Mesos. The default version of Mesos
changes in this code as there are updates to Elastic Mesos, since the tutorials are based on
that service. You can check http://mesosphere.io/downloads/ for the latest. If you run
Mesos in different environment, simply make a one-line change to the EGG environment
variable in the bin/local_install.sh script. Also, you need to install the Exelixi source.
On the Mesos master, download the master branch of the Exelixi code repo on GitHub and
install the required libraries:
wget https://github.com/ceteri/exelixi/archive/master.zip ; \
unzip master.zip ; \
cd exelixi-master ; \
./bin/local_install.sh
If you've customized the code by forking your own GitHub code repo, then substitute that
download URL instead. Alternatively, if you've customized by subclassing the
uow.UnitOfWorkFactory default GA, then place that Python source file into the src/
subdirectory.
Next, run the installation command on the master, to set up each of the slaves:
./src/exelixi.py -n localhost:5050 | ./bin/install.sh

Now launch the Framework, which in turn launches the worker services remotely on slave
nodes. In the following case, it runs workers on two slave nodes:
./src/exelixi.py -m localhost:5050 -w 2
Once everything has been set up successfully, the log file in exelixi.log will show a line:
all worker services launched and init tasks completed
Flink
Apache Flink is an open source streaming platform which provides you tremendous capabilities to
run real-time data processing pipelines in a fault-tolerant way at a scale of millions of events per
second.
The key point is that it does all this using the minimum possible resources at single millisecond
latencies.
So how does it manage that and what makes it better than other solutions in the same domain?
Low latency on minimal resources

Flink is based on the DataFlow model i.e. processing the elements as and when they come rather
than processing them in micro-batches (which is done by Spark streaming).
Micro-batches can contain huge number of elements and the resources needed to process those
elements at once can be substantial. In the case of a sparse data stream (in which you get only a
burst of data at irregular intervals), this becomes a major pain point.
You also don’t need to go through the trial and error of configuring the micro-batch size so that the
processing time of the batch doesn’t exceed it’s accumulation time. If it happens, then the batches
start to queue up and eventually all the processing will come to a halt.
Dataflow allows flink to process millions of records per minutes at milliseconds of latencies on a
single machine (it’s also because of Flink’s managed memory and custom serialisation but more on
that in next article). Here are some benchmarks.
Variety of Sources and Sinks

Flink provides seamless connectivity to a variety of data sources and sinks.
Some of these include:
 Apache Cassandra
 Elasticsearch
 Kafka
 RabbitMQ
 Hive
Fault tolerance
Flink provides robust fault-tolerance using checkpointing (periodically saving internal state to
external sources such as HDFS).
However, Flink’s checkpointing mechanism can be made incremental (save only the changes and
not the whole state) which really reduces the amount of data in HDFS and the I/O duration. The
checkpointing overhead is almost negligible which enables users to have large states inside Flink
applications.
Flink also provides a high availability setup through zookeeper. This is for re-spawning the job in
the cases when the driver (which is known as JobManager in Flink) crashes due to some error.
High level API

Unlike Apache Storm (which also follows a data flow model), Flink provides a extremely simple
high level api in the form of Map/Reduce, Filters, Window, GroupBy, Sort and Joins.
This provides a developer lot of flexibility and speeds up the development while writing new jobs.
Stateful processing
Sometimes an operation requires some config or data from some other source to perform an
operations. A simple example will be to count the number of records of type Y in a stream X. This
counter will be known as the state of the operation.
Flink provides a simple API to interact with state like you would interact with a java object. States
can be backed by Memory, Filesystem or RocksDB which are check pointed and are thus fault
tolerant. e.g. With respect to the above example, in case your application restarts, your counter
value will still be preserved.
Exactly once processing

Apache Flink provides exactly once processing like Kafka 0.11 and above with minimal overhead
and zero dev effort. This is not trivial to do in other streaming solutions such as Spark Streaming
and Storm and is not supported in Apache Samza.
SQL Support
Like Spark streaming Flink also provides a SQL API interface which makes writing a job easier for
people with non programming background. Flink SQL is maturing day by day and is already being
used by companies such as UBER and Alibaba to do analytics on real time data.
Environment Support
A Flink job can be run in a distributed system or in local machine. The program can run on mesos,
yarn, kubernetes as well as standalone mode (e.g. in docker containers). Since Flink 1.4, Hadoop is
not a pre-requisite which opens up a number of possibilities for places to run a flink job.
Awesome community
Flink has a great dev community which allows for frequent new features and bug fixes as well as
great tools to ease the developer effort further. Some of these tools are:
 Flink Tensorflow — Run Tensorflow graphs as a Flink process
 Flink HTM —Anomaly detection in a stream in Flink
 Tink — A temporal graph library build on top of Flink
Flink SQL and Complex Event Processing (CEP) were also initially developed by Alibaba and
contributed back to flink.
Apache Hama
Apache Hama is a BSP (Bulk Synchronous Parallel) computing framework
on top of HDFS (Hadoop Distributed File System) for massive scientific
computations such as matrix, graph and network algorithms.
This release is the first release as a top level project, contains two
significant new features (Message Compressor, complete clone of the
Google's Pregel) and many improvements for computing system
performance and durability.
MPI
The Open MPI Project is an open source Message Passing Interface implementation that is developed and
maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to
combine the expertise, technologies, and resources from all across the High Performance Computing
community in order to build the best MPI library available. Open MPI offers advantages for system and
software vendors, application developers and computer science researchers.
Features implemented or in short-term development for Open MPI include:
 Full MPI-3.1 standards  Many OS's supported (32 and 64

conformance bit)
 Thread safety and  Production quality software
concurrency  High performance on all platforms
 Dynamic process spawning  Portable and maintainable
 Network and process fault  Tunable by installers and end-users
tolerance  Component-based design,
 Support network documented APIs
heterogeneity  Active, responsive mailing list
 Single library supports all  Open source license based on the
networks BSD license
 Run-time instrumentation
 Many job schedulers
supported
Spark Vs Other tools :
Spark Streaming vs Flink vs Storm vs Kafka

Streams vs Samza : Choose Your Stream
Processing Framework
According to a recent report by IBM Marketing cloud, “90 percent of the data in the world
today has been created in the last two years alone, creating 2.5 quintillion bytes of data
every day — and with new devices, sensors and technologies emerging, the data growth
rate will likely accelerate even more”.
Technically this means our Big Data Processing world is going to be more complex and
more challenging. And a lot of use cases (e.g. mobile app ads, fraud detection, cab booking,
patient monitoring,etc) need data processing in real-time, as and when data arrives, to
make quick actionable decisions. This is why Distributed Stream Processing has become
very popular in Big Data world.
Today there are a number of open source streaming frameworks available. Interestingly,
almost all of them are quite new and have been developed in last few years only. So it is
quite easy for a new person to get confused in understanding and differentiating among
streaming frameworks. In this post I will first talk about types and aspects of Stream
Processing in general and then compare the most popular open source Streaming
frameworks : Flink, Spark Streaming, Storm, Kafka Streams. I will try to explain how they
work (briefly), their use cases, strengths, limitations, similarities and differences.
What is Streaming/Stream Processing :
The most elegant definition I found is : a type of data processing engine that is designed
with infinite data sets in mind. Nothing more.
Unlike Batch processing where data is bounded with a start and an end in a job and the job
finishes after processing that finite data, Streaming is meant for processing unbounded
data coming in realtime continuously for days,months,years and forever. As such, being
always meant for up and running, a streaming application is hard to implement and harder
to maintain.
Important Aspects of Stream Processing:
There are some important characteristics and terms associated with Stream processing
which we should be aware of in order to understand strengths and limitations of any
Streaming framework :
 Delivery Guarantees :
It means what is the guarantee that no matter what, a particular incoming record in
a streaming engine will be processed. It can be either Atleast-once (will be processed
atleast one time even in case of failures) , Atmost-once (may not be processed in
case of failures) or Exactly-once (will be processed one and exactly one time even in
case of failures) . Obviously Exactly-once is desirable but is hard to achieve in
distributed systems and comes in tradeoffs with performance.
 Fault Tolerance :
In case of failures like node failures,network failures,etc, framework should be able
to recover and should start processing again from the point where it left. This is
achieved through checkpointing the state of streaming to some persistent storage
from time to time. e.g. checkpointing kafka offsets to zookeeper after getting record
from Kafka and processing it.
 State Management :
In case of stateful processing requirements where we need to maintain some state
(e.g. counts of each distinct word seen in records), framework should be able to
provide some mechanism to preserve and update state information.
 Performance :
This includes latency(how soon a record can be processed), throughput (records
processed/second) and scalability. Latency should be as minimum as possible while
throughput should be as much as possible. It is difficult to get both at same time.
 Advanced Features : Event Time Processing, Watermarks, Windowing
These are features needed if stream processing requirements are complex. For
example, processing records based on time when it was generated at source (event
time processing). To know more in detail, please read these must-read posts by
Google guy Tyler Akidau : part1 and part2.
 Maturity :
Important from adoption point of view, it is nice if the framework is already proven
and battle tested at scale by big companies. More likely to get good community
support and help on stackoverflow.
Two Types of Stream Processing:
Now being aware of the terms we just discussed, it is now easy to understand that there are
2 approaches to implement a Streaming framework:
Native Streaming :
Also known as Native Streaming. It means every incoming record is processed as soon as it
arrives, without waiting for others. There are some continuous running processes (which
we call as operators/tasks/bolts depending upon the framework) which run for ever and
every record passes through these processes to get processed. Examples : Storm, Flink,
Kafka Streams, Samza.
Micro-batching :
Also known as Fast Batching. It means incoming records in
every few seconds are batched together and then processed
in a single mini batch with delay of few seconds. Examples:
Spark Streaming, Storm-Trident.
Both approaches have some advantages and disadvantages.
Native Streaming feels natural as every record is processed as soon as
it arrives, allowing the framework to achieve the minimum latency
possible. But it also means that it is hard to achieve fault tolerance
without compromising on throughput as for each record, we need to
track and checkpoint once processed. Also, state management is easy
as there are long running processes which can maintain the required
state easily.
Micro-batching , on the other hand, is quite opposite. Fault tolerance comes for free as it is
essentially a batch and throughput is also high as processing and checkpointing will be
done in one shot for group of records. But it will be at some cost of latency and it will not
feel like a natural streaming. Also efficient state management will be a challenge to
maintain.
Streaming Frameworks One By One:
Storm :
Storm is the hadoop of Streaming world. It is the oldest open source streaming framework
and one of the most mature and reliable one. It is true streaming and is good for simple
event based use cases. I have shared details about Storm at length in these posts: part1 and
part2.
Advantages:
 Very low latency,true streaming, mature and high throughput
 Excellent for non-complicated streaming use cases
Disadvantages
 No state management
 No advanced features like Event time processing, aggregation, windowing, sessions,
watermarks, etc
 Atleast-once guarantee
Spark Streaming :
Spark has emerged as true successor of hadoop in Batch processing and the first
framework to fully support the Lambda Architecture (where both Batch and Streaming are
implemented; Batch for correctness, Streaming for Speed). It is immensely popular,
matured and widely adopted. Spark Streaming comes for free with Spark and it uses micro
batching for streaming. Before 2.0 release, Spark Streaming had some serious performance
limitations but with new release 2.0+ , it is called structured streaming and is equipped
with many good features like custom memory management (like flink) called tungsten,
watermarks, event time processing support,etc. Also Structured Streaming is much more
abstract and there is option to switch between micro-batching and continuous streaming
mode in 2.3.0 release. Continuous Streaming mode promises to give sub latency like Storm
and Flink, but it is still in infancy stage with many limitations in operations.
Advantages:
 Supports Lambda architecture, comes free with Spark
 High throughput, good for many use cases where sub-latency is not required
 Fault tolerance by default due to micro-batch nature
 Simple to use higher level APIs
 Big community and aggressive improvements
 Exactly Once
Disadvantages
 Not true streaming, not suitable for low latency requirements
 Too many parameters to tune. Hard to get it right. Have written a post on my
personal experience while tuning Spark Streaming
 Stateless by nature
 Lags behind Flink in many advanced features
Flink :
Flink is also from similar academic background like Spark. While Spark came from UC
Berkley, Flink came from Berlin TU University. Like Spark it also supports Lambda
architecture. But the implementation is quite opposite to that of Spark. While Spark is
essentially a batch with Spark streaming as micro-batching and special case of Spark
Batch, Flink is essentially a true streaming engine treating batch as special case of
streaming with bounded data. Though APIs in both frameworks are similar, but they don’t
have any similarity in implementations. In Flink, each function like map,filter,reduce,etc is
implemented as long running operator (similar to Bolt in Storm)
Flink looks like a true successor to Storm like Spark succeeded hadoop in batch.
Advantages:
 Leader of innovation in open source Streaming landscape
 First True streaming framework with all advanced features like event time
processing, watermarks, etc
 Low latency with high throughput, configurable according to requirements
 Auto-adjusting, not too many parameters to tune
 Exactly Once
 Getting widely accepted by big companies at scale like Uber,Alibaba.
Disadvantages
 Little late in game, there was lack of adoption initially
 Community is not as big as Spark but growing at fast pace now
 No known adoption of the Flink Batch as of now, only popular for streaming.
Kafka Streams :
Kafka Streams , unlike other streaming frameworks, is a light weight library. It is useful for
streaming data from Kafka , doing transformation and then sending back to kafka. We can
understand it as a library similar to Java Executor Service Thread pool, but with inbuilt
support for Kafka. It can be integrated well with any application and will work out of the
box.
Due to its light weight nature, can be used in microservices type architecture. There is no
match in terms of performance with Flink but also does not need separate cluster to run, is
very handy and easy to deploy and start working . Internally uses Kafka Consumer group
and works on the Kafka log philosophy.
This post thoroughly explains the use cases of Kafka Streams vs Flink Streaming.
One major advantage of Kafka Streams is that its processing is Exactly Once end to end. It
is possible because the source as well as destination, both are Kafka and from Kafka 0.11
version released around june 2017, Exactly once is supported. For enabling this feature, we
just need to enable a flag and it will work out of the box. For more details shared here and
here.
Advantages:
 Very light weight library, good for microservices,IOT applications
 Does not need dedicated cluster
 Inherits all Kafka good characteristics
 Supports Stream joins, internally uses rocksDb for maintaining state.
 Exactly Once ( Kafka 0.11 onwards).
Disadvantages
 Tightly coupled with Kafka, can not use without Kafka in picture
 Quite new in infancy stage, yet to be tested in big companies
 Not for heavy lifting work like Spark Streaming,Flink.
Samza :
Will cover Samza in short. Samza from 100 feet looks like similar to Kafka Streams in
approach. There are many similarities. Both of these frameworks have been developed
from same developers who implemented Samza at LinkedIn and then founded Confluent
where they wrote Kafka Streams. Both these technologies are tightly coupled with Kafka,
take raw data from Kafka and then put back processed data back to Kafka. Use the same
Kafka Log philosophy. Samza is kind of scaled version of Kafka Streams. While Kafka
Streams is a library intended for microservices , Samza is full fledge cluster processing
which runs on Yarn.
Advantages :
 Very good in maintaining large states of information (good for use case of joining
streams) using rocksDb and kafka log.
 Fault Tolerant and High performant using Kafka properties
 One of the options to consider if already using Yarn and Kafka in the processing
pipeline.
 Good Yarn citizen
 Low latency , High throughput , mature and tested at scale
Disadvantages :
 Tightly coupled with Kafka and Yarn. Not easy to use if either of these not in your
processing pipeline.
 Atleast-Once processing guarantee. I am not sure if it supports exactly once now like
Kafka Streams after Kafka 0.11
 Lack of advanced streaming features like Watermarks, Sessions, triggers, etc
Comparison of Streaming Frameworks:
We can compare technologies only with similar offerings. While Storm, Kafka Streams and
Samza look now useful for simpler use cases, the real competition is clear between the
heavyweights with latest features: Spark vs Flink
When we talk about comparison, we generally tend to ask: Show me the
numbers :)
Benchmarking is a good way to compare only when it has been done by third parties.
For example one of the old bench marking was this.
But this was at times before Spark Streaming 2.0 when it had limitations with RDDs and
project tungsten was not in place.
Now with Structured Streaming post 2.0 release , Spark Streaming is trying to catch up a
lot and it seems like there is going to be tough fight ahead.
Recently benchmarking has kind of become open cat fight between Spark and Flink.
Spark had recently done benchmarking comparison with Flink to which Flink developers
responded with another benchmarking after which Spark guys edited the post.
It is better not to believe benchmarking these days because even a small tweaking can
completely change the numbers. Nothing is better than trying and testing ourselves before
deciding.
As of today, it is quite obvious Flink is leading the Streaming Analytics space, with most of
the desired aspects like exactly once, throughput, latency, state management, fault
tolerance, advance features, etc.
These have been possible because of some of the true innovations of Flink like light
weighted snapshots and off heap custom memory management.
One important concern with Flink was maturity and adoption level till sometime back but
now companies like Uber,Alibaba,CapitalOne are using Flink streaming at massive scale
certifying the potential of Flink Streaming.
Recently, Uber open sourced their latest Streaming analytics framework called AthenaX
which is built on top of Flink engine. In this post, they have discussed how they moved
their streaming analytics from STorm to Apache Samza to now Flink.
One important point to note, if you have already noticed, is that all native streaming
frameworks like Flink, Kafka Streams, Samza which support state management uses
RocksDb internally. RocksDb is unique in sense it maintains persistent state locally on
each node and is highly performant. It has become crucial part of new streaming systems. I
have shared detailed info on RocksDb in one of the previous posts.
How to Choose the Best Streaming Framework :
This is the most important part. And the honest answer is: it depends :)
It is important to keep in mind that no single processing framework can be silver bullet for
every use case. Every framework has some strengths and some limitations too. Still , with
some experience, will share few pointers to help in taking decisions:
1. Depends on the use cases:
If the use case is simple, there is no need to go for the latest and greatest framework
if it is complicated to learn and implement. A lot depends on how much we are
willing to invest for how much we want in return. For example, if it is simple IOT
kind of event based alerting system, Storm or Kafka Streams is perfectly fine to work
with.
2. Future Considerations:
At the same time, we also need to have a conscious consideration over what will be
the possible future use cases? Is it possible that demands of advanced features like
event time processing,aggregation, stream joins,etc can come in future ? If answer is
yes or may be, then its is better to go ahead with advanced streaming frameworks
like Spark Streaming or Flink. Once invested and implemented in one technology,
its difficult and huge cost to change later. For example, In previous company we
were having a Storm pipeline up and running from last 2 years and it was working
perfectly fine until a requirement came for uniquifying incoming events and only
report unique events. Now this demanded state management which is not
inherently supported by Storm. Although I implemented using time based in-
memory hashmap but it was with limitation that the state will go away on restart .
Also, it gave issues during such changes which I have shared in one of the previous
posts. The point I am trying to make is, if we try to implement something on our
own which the framework does not explicitly provide, we are bound to hit unknown
issues.
3. Existing Tech Stack :
One more important point is to consider the existing tech stack. If the existing stack
has Kafka in place end to end, then Kafka Streams or Samza might be easier fit.
Similarly, if the processing pipeline is based on Lambda architecture and Spark
Batch or Flink Batch is already in place then it makes sense to consider Spark
Streaming or Flink Streaming. For example, in my one of previous projects I already
had Spark Batch in pipeline and so when the streaming requirement came, it was
quite easy to pick Spark Streaming which required almost the same skill set and
code base.
In short, If we understand strengths and limitations of the frameworks along with our use
cases well, then it is easier to pick or atleast filtering down the available options. Lastly it is
always good to have POCs once couple of options have been selected. Everyone has
different taste bud after all.
Datastorage for big Data:

Types of NoSQL databases-
There are 4 basic types of NoSQL databases:
1. Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak,
Amazon S3 (Dynamo)}
2. Document-based Store- It stores documents made up of tagged elements.
{Example- CouchDB}
3. Column-based Store- Each storage block contains data from only one column,
{Example- HBase, Cassandra}
4. Graph-based-A network database that uses edges and nodes to represent and
store data. {Example- Neo4J}
1. Key Value Store NoSQL Database
The schema-less format of a key value database like Riak is just about what you need for
your storage needs. The key can be synthetic or auto-generated while the value can be
String, JSON, BLOB (basic large object) etc.
The key value type basically, uses a hash table in which there exists a unique key and a
pointer to a particular item of data. A bucket is a logical group of keys – but they don’t
physically group the data. There can be identical keys in different buckets.
Performance is enhanced to a great degree because of the cache mechanisms that

accompany the mappings. To read a value you need to know both the key and the bucket
because the real key is a hash (Bucket+ Key).
There is no complexity around the Key Value Store database model as it can be
implemented in a breeze. Not an ideal method if you are only looking to just update part of
a value or query the database.
When we try and reflect back on the CAP theorem, it becomes quite clear that key value
stores are great around the Availability and Partition aspects but definitely lack in
Consistency.
Example: Consider the data subset represented in the following table. Here the key is the
name of the 3Pillar country name, while the value is a list of addresses of 3PiIllar centers
in that country.
Key Value
“India” {“B-25, Sector-58, Noida, India – 201301”
“Romania {“IMPS Moara Business Center, Buftea No. 1, Cluj-Napoca, 400606″,City Business Center, Coriolan
” Brediceanu No. 10, Building B, Timisoara, 300011”}
“US” {“3975 Fair Ridge Drive. Suite 200 South, Fairfax, VA 22033”}
The key can be synthetic or auto-generated while the value can be String, JSON, BLOB
(basic large object) etc.
This key/value type database allow clients to read and write values using a key as follows:
 Get(key), returns the value associated with the provided key.
 Put(key, value), associates the value with the key.
 Multi-get(key1, key2, .., keyN), returns the list of values associated with the list of keys.
 Delete(key), removes the entry for the key from the data store.
While Key/value type database seems helpful in some cases, but it has some weaknesses
as well. One, is that the model will not provide any kind of traditional database capabilities
(such as atomicity of transactions, or consistency when multiple transactions are executed
simultaneously). Such capabilities must be provided by the application itself.
Secondly, as the volume of data increases, maintaining unique values as keys may
become more difficult; addressing this issue requires the introduction of some complexity
in generating character strings that will remain unique among an extremely large set of
keys.
 Riak and Amazon’s Dynamo are the most popular key-value store NoSQL databases.
2. Document Store NoSQL Database
The data which is a collection of key value pairs is compressed as a document store quite
similar to a key-value store, but the only difference is that the values stored (referred to as
“documents”) provide some structure and encoding of the managed data. XML, JSON
(Java Script Object Notation), BSON (which is a binary encoding of JSON objects) are
some common standard encodings.
The following example shows data values collected as a “document” representing the
names of specific retail stores. Note that while the three examples all represent locations,
the representative models are different.
{officeName:”3Pillar Noida”,
{Street: “B-25, City:”Noida”, State:”UP”, Pincode:”201301”}
}
{officeName:”3Pillar Timisoara”,
{Boulevard:”Coriolan Brediceanu No. 10”, Block:”B, Ist Floor”, City: “Timisoara”, Pincode:
300011”}
}
{officeName:”3Pillar Cluj”,
{Latitude:”40.748328”, Longitude:”-73.985560”}
}
One key difference between a key-value store and a document store is that the latter
embeds attribute metadata associated with stored content, which essentially provides a
way to query the data based on the contents. For example, in the above example, one
could search for all documents in which “City” is “Noida” that would deliver a result set
containing all documents associated with any “3Pillar Office” that is in that particular city.
Apache CouchDB is an example of a document store. CouchDB uses JSON to store data,
JavaScript as its query language using MapReduce and HTTP for an API. Data and
relationships are not stored in tables as is a norm with conventional relational databases
but in fact are a collection of independent documents.
The fact that document style databases are schema-less makes adding fields to JSON
documents a simple task without having to define changes first.
 Couchbase and MongoDB are the most popular document based databases.
3. Column Store NoSQL Database–
In column-oriented NoSQL database, data is stored in cells grouped in columns of data

rather than as rows of data. Columns are logically grouped into column families. Column
families can contain a virtually unlimited number of columns that can be created at runtime
or the definition of the schema. Read and write is done using columns rather than rows.
In comparison, most relational DBMS store data in rows, the benefit of storing data in
columns, is fast search/ access and data aggregation. Relational databases store a single
row as a continuous disk entry. Different rows are stored in different places on disk while
Columnar databases store all the cells corresponding to a column as a continuous disk
entry thus makes the search/access faster.
For example: To query the titles from a bunch of a million articles will be a painstaking
task while using relational databases as it will go over each location to get item titles. On
the other hand, with just one disk access, title of all the items can be obtained.
Data Model
 ColumnFamily: ColumnFamily is a single structure that can group Columns and

SuperColumns with ease.
 Key: the permanent name of the record. Keys have different numbers of columns, so
the database can scale in an irregular way.
 Keyspace: This defines the outermost level of an organization, typically the name of
the application. For example, ‘3PillarDataBase’ (database name).
 Column: It has an ordered list of elements aka tuple with a name and a value defined.
The best known examples are Google’s BigTable and HBase & Cassandra that were
inspired from BigTable.
BigTable, for instance is a high performance, compressed and proprietary data storage
system owned by Google. It has the following attributes:
 Sparse – some cells can be empty
 Distributed – data is partitioned across many hosts
 Persistent – stored to disk
 Multidimensional – more than 1 dimension
 Map – key and value
 Sorted – maps are generally not sorted but this one is

A 2-dimensional table comprising of rows and columns is part of the relational database
system.
City Pincode Strength Project

Noida 201301 250 20
Cluj 400606 200 15
Timisoara 300011 150 10
Fairfax VA 22033 100 5
For above RDBMS table a BigTable map can be visualized as shown below.
{
3PillarNoida: {
city: Noida
pincode: 201301
},
details: {
strength: 250
projects: 20
}
}
{
3PillarCluj: {
address: {
city: Cluj
pincode: 400606
},
details: {
strength: 200
projects: 15
}
},
{
3PillarTimisoara: {
address: {
city: Timisoara
pincode: 300011
},
details: {
strength: 150
projects: 10
}
}
{
3PillarFairfax : {
address: {
city: Fairfax
pincode: VA 22033
},
details: {
strength: 100
projects: 5
}
}
 The outermost keys 3PillarNoida, 3PillarCluj, 3PillarTimisoara and 3PillarFairfax are

analogues to rows.
 ‘address’ and ‘details’ are called column families.
 The column-family ‘address’ has columns ‘city’ and ‘pincode’.
 The column-family details’ has columns ‘strength’ and ‘projects’.

Columns can be referenced using CloumnFamily.
 Google’s BigTable, HBase and Cassandra are the most popular column store based
databases.
4. Graph Base NoSQL Database
In a Graph Base NoSQL Database, you will not find the rigid format of SQL or the tables
and columns representation, a flexible graphical representation is instead used which is
perfect to address scalability concerns. Graph structures are used with edges, nodes and
properties which provides index-free adjacency. Data can be easily transformed from one
model to the other using a Graph Base NoSQL database.
 These databases that uses edges and nodes to represent and store data.
 These nodes are organised by some relationships with one another, which is
represented by edges between the nodes.
 Both the nodes and the relationships have some defined properties.
The following are some of the features of the graph based database, which are explained
on the basis of the example below:
Labeled, directed, attributed multi-graph : The graphs contains the nodes which are
labelled properly with some properties and these nodes have some relationship with one
another which is shown by the directional edges. For example: in the following
representation, “Alice knows Bob” is shown by an edge that also has some properties.
While relational database models can replicate the graphical ones, the edge would require
a join which is a costly proposition.
UseCase–
Any ‘Recommended for You’ rating you see on e-commerce websites (book/video renting
sites) is often derived by taking into account how other users have rated the product in
question. Arriving at such a UseCase is made easy using Graph databases.
InfoGrid and Infinite Graph are the most popular graph based databases. InfoGrid allows
the connection of as many edges (Relationships) and nodes (MeshObjects), making it
easier to represent hyperlinked and complex set of information.
There are two kinds of GraphDatabase offered by InfoGrid, these include the following:
MeshBase– It is a perfect option where standalone deployment is required.
NetMeshBase – It is ideally suited for large distributed graphs and has additional
capabilities to communicate with other similar NetMeshbase.
System Properties Comparison Cassandra vs.
HBase vs. MongoDB
Editorial information provided by DB-Engines

Cassandra Xexclu MongoDB Xexclu
HBase Xexclude
Name de from de from
from comparison
comparison comparison
Wide-column store
Wide-column store
based on ideas of
based on Apache One of the most
BigTable and
Description Hadoop and on popular document
DynamoDB
concepts of stores
Optimized for
BigTable
write access
Document store
DB-Engines
Ranking
measures the
Primary database
Wide column store Wide column store popularity of
model
database
management
systems
Trend Chart
Score 123.37 Score 60.28
Score 395.09
Rank #11 OverallRank #17 Overall
Rank #5 Overall
Wide Wide
Document
#1 column #2 column #1
stores
stores stores
cassandra.apache.o www.mongodb.co
Website hbase.apache.org
rg m
Technical cassandra.apache.o docs.mongodb.com
hbase.apache.org
documentation rg/doc/latest /manual
Apache Software Apache Software
Foundation Foundation
Apache top level Apache top-level
Developer MongoDB, Inc
project, originally project, originally
developped by developed by
Facebook Powerset
Initial release 2008 2008 2009
3.11.4, February 1.4.8, October 4.0.6, February
Current release
2019 2018 2019
License Open Source Open Source Open Source
Commercial or Apache version 2 Apache version 2 MongoDB Inc.'s
Open Source Server Side Public
License v1. Prior
versions were
published under
GNU AGPL v3.0.
Commercial
licenses are also
available.
Cloud-based only no MongoDB
Only available no no available as DBaaS
as a cloud service (MongoDB Atlas)
DBaaS offerings
(sponsored links)
Database as a
Service
Providers of
DBaaS offerings,
please contact us to
be listed.
Implementation
Java Java C++
language
BSD Linux Linux
Server operating Linux Unix OS X
systems OS X Windows using Solaris
Windows Cygwin Windows
schema-free
Although schema-
free, documents of
the same collection
often follow the
Data scheme schema-free schema-free
same structure.
Optionally impose
all or part of a
schema by defining
a JSON schema.
yes string,
Typing
integer, double,
predefined data
yes no decimal, boolean,
types such as float
date, object_id,
or date
geospatial
XML support
Some form of
processing data in
XML format, e.g.
support for XML no no
data structures,
and/or support for
XPath, XQuery or
XSLT.
Secondary indexes restricted only no yes
equality queries,
not always the best
performing
solution
Read-only SQL
SQL-like SELECT,
SQL Support of queries via the
DML and DDL no
SQL MongoDB
statements (CQL)
Connector for BI
Proprietary
protocol CQL Java API
proprietary
APIs and other (Cassandra Query RESTful HTTP
protocol using
access methods Language, an SQL- API
JSON
like language) Thrift
Thrift
Supported C# C Actionscript
programming C++ C# inofficial driver
languages Clojure C++ C
Erlang Groovy C#
Go Java C++
Haskell PHP Clojure inofficial
Java Python driver
JavaScript Scala ColdFusion
Node.js inofficial driver
Perl D inofficial
PHP driver
Python Dart inofficial
Ruby driver
Scala Delphi inofficial
driver
Erlang
Go inofficial
driver
Groovy
inofficial driver
Haskell
Java
JavaScript
Lisp inofficial
driver
Lua inofficial
driver
MatLab
inofficial driver
Perl
PHP
PowerShell
inofficial driver
Prolog inofficial
driver
Python
R inofficial
driver
Ruby
Scala
Smalltalk
inofficial driver
Server-side scripts yes
Stored no Coprocessors in JavaScript
procedures Java
Triggers yes yes no
Partitioning
methods
Sharding no
Methods for
"single point of Sharding Sharding
storing different
failure"
data on different
nodes
Replication selectable
methods replication factor
Methods for Representation selectable Master-slave
redundantly storing of geographical replication factor replication
data on multiple distribution of
nodes servers is possible
MapReduce
Offers an API for
yes yes yes
user-defined Map/
Reduce methods
Eventual Eventual
Consistency Consistency Consistency
concepts Immediate Immediate
Immediate
Methods to ensure Consistency can Consistency can
Consistency
consistency in a be individually be individually
distributed system decided for each decided for each
write operation write operation
no typically not
Foreign keys used, however
Referential no no similar
integrity functionality with
DBRef possible
Transaction
concepts
no Atomicity Multi-document
Support to ensure
and isolation are ACID Transactions
data integrity after no
supported for with snapshot
non-atomic
single operations isolation
manipulations of
data
Concurrency
Support for
concurrent yes yes yes
manipulation of
data
Durability yes yes yes optional
Support for making
data persistent
In-memory
capabilities Is yes In-memory
there an option to storage engine
define some or all no no introduced with
structures to be MongoDB version
held in-memory 3.2
only.
Access Control
Access rights for Lists (ACL)
User concepts Access rights for
users can be Implementation
Access control users and roles
defined per object based on Hadoop
and ZooKeeper
More information provided by the system vendor
Cassandra HBase MongoDB
Apache Cassandra MongoDB is the
is the leading next-generation
NoSQL, distributed database that helps
Specific
database businesses
characteristics
management transform their
system, well... industries...
» more » more
No single point of By offering the
failure ensures best of traditional
Competitive 100% availability . databases as well
advantages Operational as the flexibility,
simplicity for... scale,...
» more » more
Internet of Things Internet of Things
(IOT), fraud (Bosch, Silver
detection Spring Networks)
Typical application
applications, Mobile (The
scenarios
recommendation Weather Channel,
engines, product... ADP,...
» more » more
ADP, Adobe,
Apple, Netflix,
Amadeus,
Uber, ING,,
AstraZeneca,
Intuit,Fidelity, NY
Barclays, BBVA,
Key customers Times, Outbrain,
Bond, Bosch,
BazaarVoice,
Cisco, CERN,
Best...
City...
» more
» more
Market metrics Cassandra is used 40 million
by 40% of the downloads
Fortune 100. (growing at 30
» more thousand
downloads per
day). 6,600+
customers....
» more
Apache licenseÂ MongoDB
Pricing for database server:
commercial Free Software
Licensing and distributions Foundation's GNU
pricing models provided by AGPL v3.0.
DataStax and Commercial
available... licenses...
» more » more
We invite representatives of system vendors to contact us for updating and extending the system
information,
and for displaying vendor-provided information such as key customers, competitive advantages and
market metrics.
Related products and services

3rd parties Instaclustr: Fully Dremio is like CData: Connect to
Hosted and magic for HBase Big Data &
Managed Apache accelerating your NoSQL through
Cassandra analytical queries standard Drivers.
» more up to 1,000x with » more
Apache Arrow.
DataStax » more Dremio: Analyze
Enterprise: Apache your data with
Cassandra for standard SQL and
enterprises. any BI tool.
» more Accelerate your
queries up to
CData: Connect to 1,000x.
Big Data & » more
NoSQL through
standard Drivers. Studio 3T: The
» more world's favorite
IDE for working
with MongoDB
» more
Navicat for
MongoDB gives
you a highly
effective GUI
interface for
MongoDB
database
management,
administration and
development.
» more
ScaleGrid: Deploy,
monitor, backup
and scale
MongoDB in the
cloud with the #1
Database-as-a-
Service (DBaaS)
platform.
» more
Benchmarking NoSQL Databases: Cassandra vs. MongoDB vs. HBase vs. Couchbase
Understanding the performance behavior of a NoSQL database like Apache Cassandra™ under
various conditions is critical. Conducting a formal proof of concept (POC) in the environment in
which the database will run is the best way to evaluate platforms. POC processes that include the
right benchmarks such as production configurations, parameters and anticipated data and concurrent
user workloads give both IT and business stakeholders powerful insight about platforms under
consideration and a view for how business applications will perform in production.
Independent benchmark analyses and testing of various NoSQL platforms under big data,
production-level workloads have been performed over the years and have consistently identified
Apache Cassandra as the platform of choice for businesses interested in adopting NoSQL as the
database for modern Web, mobile and IOT applications.
One benchmark analysis (Solving Big Data Challenges for Enterprise Application Performance
Management) by engineers at the University of Toronto, which in evaluating six different data
stores, found Apache Cassandra the “clear winner throughout our experiments”. Also, End Point
Corporation, a database and open source consulting company, benchmarked the top NoSQL
databases including: Apache Cassandra, Apache HBase, Couchbase, and MongoDB using a variety
of different workloads on AWS EC2.
The databases involved were:
Apache Cassandra: Highly scalable, high performance distributed database designed to handle
large amounts of data across many commodity servers, providing high availability with no single
point of failure.
Apache HBase: Open source, non-relational, distributed database modeled after Google’s BigTable
and is written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop
project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like
capabilities for Hadoop.
MongoDB: Cross-platform document-oriented database system that eschews the traditional table-
based relational database structure in favor of JSON-like documents with dynamic schemas making
the integration of data in certain types of applications easier and faster.
Couchbase: Distributed NoSQL document-oriented database that is optimized for interactive

applications.
End Point conducted the benchmark of these NoSQL database options on Amazon Web Services
EC2 instances, which is an industry-standard platform for hosting horizontally scalable services. In
order to minimize the effect of AWS CPU and I/O variability, End Point performed each test 3 times
on 3 different days. New EC2 instances were used for each test run to further reduce the impact of
any “lame instance” or “noisy neighbor” effects sometimes experienced in cloud environments, on
any one test.
NoSQL Database Performance Testing Results
When it comes to performance, it should be noted that there is (to date) no single “winner takes all”
among the top NoSQL databases or any other NoSQL engine for that matter. Depending on the use
case and deployment conditions, it is almost always possible for one NoSQL database to outperform
another and yet lag its competitor when the rules of engagement change. Here are a couple
snapshots of the performance benchmark to give you a sense of how each NoSQL database stacks
up.
Throughput by Workload
Each workload appears below with the throughput/operations-per-second (more is better) graphed
vertically, the number of nodes used for the workload displayed horizontally, and a table with the
result numbers following each graph.
Load process
For load, Couchbase, HBase, and MongoDB all had to be configured for non-durable writes to
complete in a reasonable amount of time, with Cassandra being the only database performing
durable write operations. Therefore, the numbers below for Couchbase, HBase, and MongoDB
represent non-durable write metrics.
Nodes Cassandra HBase MongoDB Couchbase

1 18,683.43 15,617.98 8,368.44 13,761.12
2 31,144.24 23,373.93 13,462.51 26,140.82
4 53,067.62 38,991.82 18,038.49 40,063.34
8 86,924.94 74,405.64 34,305.30 76,504.40
16 173,001.20 143,553.41 73,335.62 131,887.99
32 326,427.07 296,857.36 134,968.87 192,204.94
Mixed Operational and Analytical Workload
Note that Couchbase was eliminated from this test because it does not support scan operations
(producing the error: “Range scan is not supported”).
Nodes Cassandra HBase MongoDB

1 4,690.41 269.30 939.01
2 10,386.08 333.12 30.96
4 18,720.50 1,228.61 10.55
8 36,773.58 2,151.74 39.28
16 78,894.24 5,986.65 377.04
32 128,994.91 8,936.18 227.80
For a comprehensive analysis, please download the complete report: Benchmarking Top NoSQL
Databases.
NoSQL Database Performance Conclusion
These performance metrics are just a few of the many that have solidified Apache Cassandra as the
NoSQL database of choice for businesses needing a modern, distributed database for their Web,
mobile and IOT applications. Each database option (Cassandra, HBase, Couchbase and MongoDB)
will certainly shine in particular use cases, so it’s important to test your specific use cases to ensure
your selected database meets your performance SLA. Whether you are primarily concerned with
throughput or latency, or more interested in the architectural benefits such as having no single point
of failure or being able to have elastic scalability across multiple data centers and the cloud, much
of an application’s success comes down to its ability to deliver the response times Web, mobile and
IOT customers expect.
As the benchmarks referenced here showcase, Cassandra’s reputation for fast write and read
performance, and delivering true linear scale performance in a masterless, scale-out design, bests its
top NoSQL database rivals in many use cases.
Apache HBase: Why We Use It and Believe In It
What is HBase?
Apache HBase is an open-source, distributed, versioned, non-relational database

modeled after Google’s Bigtable: A Distributed Storage System for Structured Data.
HBase supports random, real-time read/write access with a goal of hosting very
large tables atop clusters of commodity hardware. HBase features include:
Consistent reads and writes

Automatic and configurable sharding of tables
Automatic failover support
How HBase Works:
HBase uses ZooKeeper for coordination of “truth” across the cluster. As region
servers come online, they register themselves with ZooKeeper as members of the
cluster. Region servers have shards of data (partitions of a database table) called
“regions”.
When a change is made to a row, it is updated in a persistent Write-Ahead-Log
(WAL) file and Memstore, the sorted memory cache for HBase. Once Memstore
fills, its changes are “flushed” to HFiles in HDFS. The WAL ensures that HBase
does not lose the change if Memstore loses its data before it is written to an HFile.
During a read, HBase checks to see if the data exists first in Memstore, which can
provide the fastest response with direct memory access. If the data is not in
Memstore, HBase will retrieve the data from the HFile.
HFiles are replicated by HDFS, typically to at least 3 nodes. HBase always writes to
the local node first and then replicates to other nodes. In the event of a node failure,
HBase will assign the regions to another node that has a local HFile copy replicated
by HDFS.
Why do we use HBase?
Splice Machine has chosen to replace the storage engine in Apache Derby (our
customized SQL-database) with HBase to leverage its ability scale out on
commodity hardware. HBase co-processors are used to embed Splice Machine in
each distributed HBase region (i.e., data shard). This enables Splice Machine to
achieve massive parallelization by pushing the computation down to each data
shard.
Benefits of HBase within Splice Machine include:
Strong consistency – writes and reads are always consistent as compared

to eventually consistent databases like Cassandra
Proven scalability to dozens of petabytes
Auto-sharding
Scaling with commodity hardware
Cost-effective from gigabytes to petabytes
High availability through failover and replication
Parallelized query execution across cluster
Splice Machine does not modify HBase, so it may be used with any standard
Hadoop distribution that has HBase. Supported Hadoop distributions include
Cloudera, MapR and Hortonworks.
Splice Machine has an innovative integration with HBase, including:
Asynchronous write pipeline which supports non-blocking, parallel writes

to across the cluster.
Synchronization free internal scanner synchronized external scanners.
Linux scheduler modeled resource manager which resources queues
that handle DDL, DML, Dictionary and Maintenance Operations.
Sparse Data Support which efficiently stores data but not storing nulls for
sparse data.
The Splice Machine schema advantage on Hbase includes non-blocking schema
changes so that you can add columns in a DDL transaction and does not lock read/
writes while you are adding columns.
10 reasons we should use couchbase Database :

#10. Document access in Couchbase is strongly consistent, query access is eventually
consistent
Couchbase guarantees strong consistency by making sure that all reads and writes for a particular
document go to a single node in a cluster. This is for document (key / value ) access. Views are
eventually eventually consistent compared to the underlying stored documents.
#9. Writes are asynchronous by default but can be controlled
By default, writes in Couchbase are async – replication and persistence happen in the background,
and the client is notified of a success or failure. The updates are stored in memory, and are flushed
to disk and replicated to other Couchbase nodes asynchronously.
Using the APIs with durability constraints within the application, you can choose to have the update
replicated to other nodes or persisted to disk, before the client responds back to the app.
#8. Couchbase has atomic operations for counting and appending
Couchbase supports atomic incr/decr and append operations for blobs.
For example:
cb.set(“mykey”, 1)
x = cb.incr(“mykey”)
puts x #=> 2
incr is both writing and returning the resulting value.
The update operation occurs on the server and is provided at the protocol level. This means that it is
atomic on the cluster, and executed by the server. Instead of a two-stage operation, it is a single
atomic operation.
#7. Start with everything in one bucket
A bucket is equivalent to a database. You store objects of different characteristics or attributes in the
same bucket. So if you are moving from a RDBMS, you should store records from multiple tables
in a single bucket.
Remember to create a “type” attribute that will help you differentiate the various objects stored in
the bucket and create indexes on them. It is recommended to start with one bucket and grow to
more buckets when necessary.
#6. Try to use 5 or less buckets in Couchbase. Never more than 10.
Documents don’t have a fixed schema, multiple documents with different schema can be in the
same bucket. Most deployments have a low number of buckets (usually 2 or 3) and only a few
upwards of 5. Although there is no hard limit in the software, the max of 10 buckets comes from
some known CPU and disk IO overhead of the persistence engine and the fact that we allocate
specific amount of memory to each bucket. We certainly plan to reduce this overhead with future
releases, but that still wouldn’t change our recommendation of only having a few buckets.
#5. Use CAS over GetL almost always
Optimistic or pessimistic locking, which one should you pick? If your app needs locking, first
consider using CAS(optimistic locking) before using GetL (pessimistic locking).
But remember, locking might not be good for all cases – your application can have a problem if
there is a lock contention. A thread can hold a lock and be de-scheduled by the OS. Then all the
threads that want to acquire this lock will be blocked. One option is to avoid locking altogether
where possible by using atomic operations. These API’s can be very helpful on heavily contested
data.
#4. Use multi-get operations
Once your client application has a list of document IDs, the highest performance approach to
retrieve items in bulk using a multi-GET request. This performs better than a serial loop that tries to
GET for each item individually and sequentially.
#3. Keep your client libraries up-to-date
Make sure that you’re using the most recent client library. Couchbase client libraries are available in
Java, .NET, C/C++, Ruby, Python and PHP.
#2. Model your data using JSON documents
Couchbase Server supports JSON and binary document format. First, try modeling your data using
JSON. JSON documents can be indexed and queried. You can store binary blobs and range query
off of the key names. Start by creating documents from application-level objects. Documents that
grow continuously or under high write contention should be split.
#1. Use indexes effectively
Use primary key access as much as possible. Couchbase has keys and metadata in memory – data
accesses are fast. Use secondary indexes for less performance sensitive paths or for analytics. Start
with 4 design documents and less than 10 views per design document. Create a few “long” indexes
that can be used for multiple queries and use creative filtering. Construct indexes to “emit” the
smallest amount of data possible: use “null” for value if you do not have any reduce function.
The battle of the NoSQL databases continues. We first analyzed Cassandra vs MongoDB. This time
we will dig in Redis vs MongoDB.
Redis is an in-memory database that has been benchmarked as the fastest database in the world,
while MongoDB is known for its flexibility.
Who Uses These Databases?

Here are a few examples of companies that use these databases:
Redis: Jet.com, Samsung, Intuit, UnitedHealthcare, Shopify, TMZ
MongoDB: The Weather Channel, eBay, Electronic Arts, Forbes, Under Armour
What About Database Structure?
Redis: In its simplest form, Redis uses key-value stores to store data with pairs of a key and an
associated value. This structure is very different from relational databases that have tables featuring
rows and columns, and can be difficult for some traditional SQL buffs to master.
It is probably best to show how key-value works with an example:

Let’s assume you have a list of dog calendars and you need to track the photographer along with the
type of dog for each calendar so you’ll know who to contact when you need a calendar for next
year.
A simple key-value pair can be represented with this format:
Dogbreed : “beagle”
Photographer : “Joe Photographer”
To retrieve the dog breed or photographer from the example above, you would need to use the GET
command:
> GET Dogbreed
“beagle”
> GET Photographer
“Joe Photographer”
When you need to retrieve additional dog breeds and photographers, one way to do this is to use
namespaces. With namespaces, you add a unique number after the key name that will allow you to
retrieve the data.
> GET Dogbreed:1
“beagle”
>GET Dogbreed:2
“labrador”
Redis also offers more advanced data structures like lists, sorted sets, hashes, strings, sets, bitmaps,
and more. Redis Modules can be used to extend these capabilities further, turning it into a JSON
store, search engine, rate limiter, and much more.
MongoDB: MongoDB uses JSON-like documents to store schema-free data. In MongoDB,
collections of documents do not have to have a predefined structure and columns can vary for
different documents.
MongoDB has many of the features of a relational database, including an expressive query language
and strong consistency. However, since it is schema-free MongoDB allows you to create documents
without having to create the structure for the document first.
Are Indexes Needed?

Redis: Secondary indexing is available with Redis. It is possible to use sorted sets to create these
indexes. Advanced secondary indexes, graph traversal indexes, and composite indexes are also
possible.
MongoDB: Indexes are preferred for MongoDB. If you don’t have an index, every document must
be searched within the collection, which can slow read times.
How Are Their Queries Different?

Selecting records:
Redis: KEYS * - retrieves all existing keys and values
MongoDB: db.calendars.find()
Inserting records into the customer table:

Redis: SET dogbreed:1 "beagle"
MongoDB: db.calendars.insert({ dogbreed : 'beagle' })
Where (And How) Are These Databases

Deployed?
Redis: Redis was written in C. It is available on BSD, Linux, OS X, and Windows.
Redis has support for the following programming languages: C, C#, C++, Clojure, Crystal, D, Dart,
Elixir, Erlang, Fancy, Go, Haskell, Haxe, Java, JavaScript, Lisp, Lua, MatLab, Objective-C, OCaml,
Perl, PHP, Prolog, Pure Data, Python, R, Rebol, Ruby, Rust, Scala, Scheme, Smalltalk, Tcl
MongoDB: MongoDB was written in C++. It is available on Linux, OS X, Solaris, and Windows.
MongoDB has support for the following programming languages: Actionscript, C, C#, C++,
Clojure, ColdFusion, D, Dart, Delphi, Erlang, Go, Groovy, Haskell, Java, JavaScript, Lisp, Lua,
MatLab, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Scala, Smalltalk
What Types Of Replication / Clustering Are Available?

Redis: Redis offers master – slave replication. Using Redis Enterprise, you can have a cluster
manager and shared-nothing cluster architecture. Redis handles sharding and re-sharding along with
migration to make scaling automated. It offers high-availability with persistence along with in-
memory replication that scales from racks to data centers to multiple cloud platforms.
MongoDB: MongoDB has a single-master replication with built-in auto-election. This means you
can have a second database set up and it can be auto-elected if the primary database becomes
unavailable. With MongoDB replica sets, one member is primary and any replicated databases have
a secondary role.
Panoply, smart data warehouse, allows you to connect to Redis and MongoDB at the same time,
giving you access to all of your data from one place.
Who's Currently Behind The Databases?

Redis: Redis was created in 2009 and is currently run by RedisLabs. It is being distributed using the
open source BSD license.
MongoDB: MongoDB is currently managed by MongoDB, Inc. It was originally started in 2007 by
10gen which later changed its name to MongoDB, Inc.
Who Provides Support?

Redis: Community support is available via the mailing list as well as #redis on Freenode
(irc://irc.freenode.net/#redis). Commercial managed support is provided by RedisLabs. With
managed support, you get a fully managed instance with email alerts, backups, SSL and more.
MongoDB: Community support for MongoDB is available at StackOverflow, ServerFault,
Community Support Forum, and Freenode IRC chat (irc://irc.freenode.net/#mongodb).
MongoDB, Inc. offers 24x7 enterprise support with optional extended lifecycle support for people
that want to use an older version. With support, you also have access to security updates and fixes.
Who Maintains The Documentation?

Redis: RedisLabs maintains the documentation and can be found at redis.io/documentation
MongoDB: The MongoDB documentation is maintained by MongoDB, Inc. and can be found at
https://docs.MongoDB.com/
Is There An Active Community?

Redis: The Redis community page contains instructions for local meet ups and how to contribute to
the project.
MongoDB: MongoDB has their community page at https://www.MongoDB.com/community where
they offer the latest news about user groups, webinars, events, and MongoDB University.
Which Database Is Right For Your Business?
Redis:
When speed is a priority
If you need a high performance database, Redis is hard to beat. Redis can be a good way to increase
the speed of an existing application.
Not a typical standalone database
Panoply integrates all your data sources and connects to any BI
visualization tool
View Integrations
Redis is typically not a standalone database for most companies. With Craigslist, Redis is used
alongside their existing primary database (i.e. MongoDB). While it is possible for a company to use
Redis by itself, it’s uncommon.
When you have a well-planned design
With Redis, you have a variety of data structures that include hashes, sets, and lists but you have to
define explicitly how the data will be stored. You have to plan your design and decide in advance
how you want to store and then organize your data. Redis is not well suited for prototyping. If you
regularly do prototyping, MongoDB is probably a better choice.
If your database will have a predictable size (or will stay the same size), Redis can be used to
increase lookup speed. If you know your key, then lookup will be very fast. For applications that
need real-time data, Redis can be great because it can look up information about specific keys
quickly.
The size of your data is stable
While it is possible to do master-slave replication using Redis, being able to distribute your keys
between multiple instances must be done within your application. This makes Redis harder to scale
than MongoDB.
MongoDB:
More flexible
If you’re not certain how you will query your data, MongoDB is probably a better choice.
MongoDB is better for prototyping and is helpful when you need to get data into a database and
you’re not certain of the final design from the beginning.
Easier to learn
With MongoDB, it is easier to query data because MongoDB uses a more consistent structure.
MongoDB typically has a shorter learning curve.
Better use of large, growing data sets
If you need to store data in large data sets with the potential for rapid growth, MongoDB is a good
choice. MongoDB is also easier to administer as the database grows.
If you want to have an even better understanding about MongoDB and its differences with other
NoSQL databases
ystem Properties Comparison GraphDB vs. Neo4j
Please select another system to include it in the comparison.
Our visitors often compare GraphDB and Neo4j with Microsoft Azure
Cosmos DB, Amazon Neptune and MongoDB.
Editorial information provided by DB-Engines
Name GraphDB X Neo4j X
Enterprise RDF and
graph database with
efficient reasoning, Open source graph
Description
cluster and external database
index synchronization
support
Graph DBMS
Graph DBMS DB-Engines
Primary database model
RDF store Ranking
Trend Chart
Score 0.86
Rank #152 Overall
Score 47.86
Graph
#10 Rank #22 Overall
DBMS
#1 Graph DBMS
RDF
#6
stores
Website graphdb.ontotext.com neo4j.com
Technical graphdb.ontotext.com
neo4j.com/docs
documentation /documentation
Developer Ontotext Neo4j, Inc.
Initial release 2000 2007
Current release 8.8, January 2019 3.5.1, December 2018
License commercial Open Source
Cloud-based only no no
DBaaS
offerings (sponsored
links)
Implementation
Java Java, Scala
language
All OS with a Java VM Linux
Server operating Linux OS X
systems OS X Solaris
Windows Windows
schema-free and
schema-free and
Data scheme OWL/RDFS-schema
schema-optional
support
Typing yes yes
XML support no
yes, supports real-
time synchronization
and indexing in
Secondary indexes SOLR/Elastic yes
search/Lucene and
GeoSPARQL geometry
data indexes
SPARQL is used as
SQL no
query language
GeoSPARQL Bolt protocol
Java API Cypher query
RDF4J API language
APIs and other access RIO Java API
methods Sail API Neo4j-OGM
Sesame REST HTTP RESTful HTTP API
Protocol Spring Data Neo4j
SPARQL 1.1 TinkerPop 3
.Net
Clojure
.Net Elixir
C# Go
Clojure Groovy
Java Haskell
Supported programming
JavaScript (Node.js) Java
languages
PHP JavaScript
Python Perl
Ruby PHP
Scala Python
Ruby
Scala
Server-side scripts Java Server Plugin yes
Triggers no yes
Partitioning methods none none
Master-master Causal Clustering
Replication methods
replication using Raft protocol
MapReduce no no
Immediate Causal and Eventual
Consistency, Eventual Consistency
consistency configurable in Causal
Consistency
(configurable in Cluster setup
concepts
cluster mode per Immediate
master or individual Consistency in stand-
client request) alone mode
Foreign keys yes yes
Transaction concepts ACID ACID
Concurrency yes yes
Durability yes yes
Users, roles and
permissions.
Pluggable
User concepts yes authentication with
supported standards
(LDAP, Active
Directory, Kerberos)
GraphDB Neo4j
GraphDB Enterprise is Neo4j is a native
a high-performance graph database
semantic repository platform that is built
Specific characteristics
created by to store, query,
Ontotext.... analyze...
» more » more
GraphDB allows you Neo4j database is the
to link text and data only transactional
in big knowledge database that
Competitive advantages
graphs. It offers an combines everything
easy... you need...
» more » more
Real-Time
Metadata enrichment Recommendations
and management, Master Data
Typical application linked open data Management Identity
scenarios publishing, semantic and Access
inferencing... Management
» more Network...
» more
BBC, Press
Over 300 commercial
Association, Financial
customers and over
Times, DK,
750 startups use
Key customers Euromoney, The
Neo4j. Flagship
British Museum,
customers...
Getty...
» more
» more
GraphDB is the most Neo4j boasts the
utilized semantic world's largest graph
triplestore for database ecosystem
Market metrics
mission-critical with more than a 15
enterprise... million...
» more » more
GraphDB-Free is free GPL v3 license that
to use. SE and can be used all the
Licensing and pricing Enterprise are places where you
models licensed per CPU-Core might use MySQL.
used. Perpetual,... Neo4j Commercial...
» more » more
Graph Algorithms in
Neo4j: Strongly
Scientific Literature
Connected
Monitoring for Drug
Components
Safety Signals
18 February 2019
8 February 2019
#GraphCast: You
Ontotext Platform: A
Know We Have an
Global View Across
Online Meetup, Right?
Knowledge Graphs
[Also, Salmon]
and Content
17 February 2019
Annotations
This Week in Neo4j –
25 January 2019
Time Based Graph
Ontotext Platform:
Versioning, Pearson
Semantic Annotation
Coefficient, Neo4j
News Quality Assurance &
Multi DC, Modeling
Inter-Annotator
Provenance
Agreement
16 February 2019
18 January 2019
Apache Spark
Ontotext Platform:
Developers Have
Knowledge Quality via
Voted to Include
Efficient Annotation at
Cypher in Spark 3.0
Scale
[Update]
11 January 2019
15 February 2019
Top 5 Technology
Host an Event (Big or
Trends to Track in
Small) with Us for
2019
Global Graph
4 January 2019
Celebration Day
14 February 2019
We invite representatives of vendors of related products to contact us for presenting information
about their offerings here.
More resources
GraphDB Neo4j
DB-Engines blog posts MySQL, PostgreSQL
and Redis are the
winners of the March
ranking
2 March 2016, Paul
Andlinger
The openCypher
Project: Help Shape
the SQL for Graphs
22 December 2015,
Emil Eifrem (guest
author)
Graph DBMS
increased their
popularity by 500%
within the last 2 years
3 March 2015, Paul
Andlinger
show all
Building AI-powered
Drug Adverse Events
Monitoring Service
Conferences and events Webinar, 3pm GMT,
more DBMS events 8am PDT, 11am EDT,
4pm BST, 18
November 2018
(finished)
Global Graph
Database Market
2019 Revenue –
Ontotext GraphDB triAGENS
Named Innovator in GmbH(Arango DB),
Bloor's Graph Neo4j, Inc, OrientDB
Database Market Ltd, Cayley
Research 15 February 2019,
11 February 2019, Slap Coffee
PRNewswire Global Graph
Ontotext's GraphDB Database Market
8.7 Offers Vector- 2018 Trending
Based Concept Scenario - triAGENS
Matching and Better GmbH(Arango DB),
Scalability, Neo4j, Inc, OrientDB
Performance and Ltd
Data Governance 21 January 2019,
Recent citations in the 9 October 2018, PR Daily Seekers
news Newswire Graph Database
Ontotext has Just Technology – The
Released GraphDB Power To Probe
8.8 Complex Pharma
16 January 2019, Datasets
SYS-CON Media 26 November 2018,
The Graph Database PharmiWeb.com
Poised To Pounce On Neo4j nabs $80M
The Mainstream Series E as graph
19 September 2018, database tech
The Next Platform flourishes
Graph databases and 1 November 2018,
RDF: It's a family TechCrunch
affair AI could help push
19 May 2017, ZDNet Neo4j graph database
provided byGoogle News growth
20 September 2018,
TechCrunch
provided byGoogle News
Neo4j Xexclude from OrientDB Xexclude
Name
comparison from comparison
Multi-model DBMS
Open source graph
Description (Document, Graph,
database
Key/Value)
Document store
Graph DBMS
Key-value store
DB-Engines
Ranking measures
Primary database model Graph DBMS
the popularity of
database
management
systems
Trend Chart
Score 6.05
Score 47.86 Rank #52 Overall
Rank #22 Overall #8 Document stores
#1 Graph DBMS #3 Graph DBMS
#9 Key-value stores
Website neo4j.com orientdb.com
Technical
neo4j.com/docs orientdb.org/docs/3.0.x
documentation
OrientDB LTD;
Developer Neo4j, Inc.
CallidusCloud
Initial release 2007 2010
Current release 3.5.1, December 2018 3.0.12, December 2018
Open Source GPL
License Commercial Open Source Apache
version3, commercial
or Open Source version 2
licenses available
Cloud-based only
Only available as a no no
cloud service
DBaaS offerings
(sponsored links)
Database as a Service
Providers of DBaaS
offerings, please contact
us to be listed.
Implementation
Java, Scala Java
language
Linux Can also be
used server-less as
embedded Java
Server operating All OS with a Java JDK
database.
systems (>= JDK 6)
OS X
Solaris
Windows
schema-free Schema
can be enforced for
schema-free and
Data scheme whole record ("schema-
schema-optional
full") or for some fields
only ("schema-hybrid")
Typing predefined
data types such as float yes yes
or date
XML support Some
form of processing data
in XML format, e.g.
support for XML data no
structures, and/or
support for XPath,
XQuery or XSLT.
yes pluggable
Secondary indexes indexing subsystem, by yes
default Apache Lucene
SQL-like query
SQL Support of SQL no
language, no joins
Bolt protocol
Cypher query language Java API
Java API RESTful HTTP/JSON
APIs and other access Neo4j-OGM Object API
methods Graph Mapper Tinkerpop technology
RESTful HTTP API stack with Blueprints,
Spring Data Neo4j Gremlin, Pipes
TinkerPop 3
.Net
.Net
Clojure
C
Elixir
C#
Go
C++
Groovy
Clojure
Haskell
Supported Java
Java
programming languages JavaScript
JavaScript
JavaScript (Node.js)
Perl
PHP
PHP
Python
Python
Ruby
Ruby
Scala
Scala
Server-side scripts yes User defined Java, Javascript
Stored procedures Procedures and
Functions
Triggers yes via event handler Hooks
Partitioning methods
Methods for storing
none Sharding
different data on
different nodes
Replication methods Causal Clustering using
Methods for Raft protocol Master-master
redundantly storing data available in in replication
on multiple nodes Enterprise Version only
MapReduce Offers
no could be achieved
an API for user-defined no
with distributed queries
Map/Reduce methods
Causal and Eventual
Consistency concepts Consistency
Methods to ensure configurable in Causal
consistency in a Cluster setup
distributed system Immediate Consistency
in stand-alone mode
Foreign keys yes Relationships in yes relationship in
Referential integrity graphs graphs
Transaction concepts
Support to ensure data
integrity after non- ACID ACID
atomic manipulations of
data
Concurrency Support
for concurrent yes yes
manipulation of data
Durability Support
for making data yes yes
persistent
Users, roles and
permissions. Pluggable
Access rights for users
User concepts Access authentication with
and roles; record level
control supported standards
security configurable
(LDAP, Active
Directory, Kerberos)
Neo4j OrientDB
Neo4j is a native graph
database platform that
Specific characteristics is built to store, query,
analyze...
» more
Competitive advantages Neo4j database is the
only transactional
database that combines
everything you need...
» more
Real-Time
Recommendations
Master Data
Typical application
Management Identity
scenarios
and Access
Management Network...
» more
Over 300 commercial
customers and over 750
Key customers startups use Neo4j.
Flagship customers...
» more
Neo4j boasts the
world's largest graph
database ecosystem
Market metrics
with more than a 15
million...
» more
GPL v3 license that can
be used all the places
Licensing and pricing where you might use
models MySQL. Neo4j
Commercial...
» more
News Graph Algorithms in
Neo4j: Strongly
Connected Components
18 February 2019
#GraphCast: You Know

We Have an Online
Meetup, Right? [Also,
Salmon]
17 February 2019
This Week in Neo4j –

Time Based Graph
Versioning, Pearson
Coefficient, Neo4j
Multi DC, Modeling
Provenance
16 February 2019
Apache Spark
Developers Have Voted
to Include Cypher in
Spark 3.0 [Update]
15 February 2019
Host an Event (Big or
Small) with Us for
Global Graph
Celebration Day
14 February 2019
We invite representatives of system vendors to contact us for updating and extending the system
information,
and for displaying vendor-provided information such as key customers, competitive advantages
and market metrics.

We invite representatives of vendors of related products to contact us for presenting information
about their offerings here.
More resources
Neo4j OrientDB
MySQL, PostgreSQL
and Redis are the
winners of the March
ranking
Graph DBMS increased
2 March 2016, Paul
their popularity by
Andlinger
500% within the last 2
years
The openCypher
3 March 2015, Paul
Project: Help Shape the
Andlinger
SQL for Graphs
22 December 2015,
Graph DBMSs are
DB-Engines blog posts Emil Eifrem (guest
gaining in popularity
author)
faster than any other
database category
Graph DBMS increased
21 January 2014,
their popularity by
Matthias Gelbmann
500% within the last 2
years
show all
3 March 2015, Paul
Andlinger
show all
Recent citations in the Global Graph Database Global Graph Database

news Market 2019 Revenue – Market 2019 Revenue –
triAGENS triAGENS
GmbH(Arango DB), GmbH(Arango DB),
Neo4j, Inc, OrientDB Neo4j, Inc, OrientDB
Ltd, Cayley Ltd, Cayley
15 February 2019, Slap 15 February 2019, Slap
Coffee Coffee
Graph Database Market
Global Graph Database To Witness Astonishing
Market 2018 Trending Growth With
Scenario - triAGENS triAGENS, Neo4j,
GmbH(Arango DB), OrientDB, Cayley
Neo4j, Inc, OrientDB 30 January 2019,
Ltd Market News Biz
21 January 2019, Daily
Seekers Graph Database
Growing Popularity and
Graph Database Emerging Trends in the
Technology – The Market
Power To Probe 16 February 2019,
Complex Pharma Market News Biz
Datasets
26 November 2018, Global Graph Database
PharmiWeb.com Market is projected to
be around USD 5.6
Neo4j nabs $80M Billion by 2024
Series E as graph 19 February 2019,
database tech flourishes Industry News Network
1 November 2018,
TechCrunch Trending News : Global
Graph Database Market
AI could help push 2019 - triAGENS
Neo4j graph database GmbH(Arango DB),
growth Neo4j, Inc, OrientDB
20 September 2018, Ltd, Cayley
TechCrunch 12 February 2019,
Winged Express
provided by Google
News provided by Google
News
Hadoop Architecture
In this post, we are going to discuss about Apache Hadoop 2.x Architecture and How it’s
components work in detail.
Hadoop 2.x Architecture

Apache Hadoop 2.x or later versions are using the following Hadoop Architecture. It is a Hadoop
2.x High-level Architecture. We will discuss in-detailed Low-level Architecture in coming sections.
 Hadoop Common Module is a Hadoop Base API (A Jar file) for all Hadoop Components.
All other components works on top of this module.
 HDFS stands for Hadoop Distributed File System. It is also know as HDFS V2 as it is part
of Hadoop 2.x with some enhanced features. It is used as a Distributed Storage System in
Hadoop Architecture.
 YARN stands for Yet Another Resource Negotiator. It is new Component in Hadoop 2.x
Architecture. It is also know as “MR V2”.
 MapReduce is a Batch Processing or Distributed Data Processing Module. It is also know as
“MR V1” as it is part of Hadoop 1.x with some updated features.
 Remaining all Hadoop Ecosystem components work on top of these three major
components: HDFS, YARN and MapReduce. We will discuss all Hadoop Ecosystem
components in-detail in my coming posts.
When compared to Hadoop 1.x, Hadoop 2.x Architecture is designed completely different. It has
added one new component : YARN and also updated HDFS and MapReduce component’s
Responsibilities.
Hadoop 2.x Major Components

Hadoop 2.x has the following three Major Components:
 HDFS
 YARN
 MapReduce
These three are also known as Three Pillars of Hadoop 2. Here major key component change is
YARN. It is really game changing component in BigData Hadoop System.
How Hadoop 2.x Major Components Works

Hadoop 2.x components follow this architecture to interact each other and to work parallel in a
reliable, highly available and fault-tolerant manner.
Hadoop 2.x Components High-Level Architecture
 All Master Nodes and Slave Nodes contains both MapReduce and HDFS Components.
 One Master Node has two components:
1. Resource Manager(YARN or MapReduce v2)
2. HDFS
It’s HDFS component is also knows as NameNode. It’s NameNode is used to store Meta
Data.
 In Hadoop 2.x, some more Nodes acts as Master Nodes as shown in the above diagram.
Each this 2nd level Master Node has 3 components:
1. Node Manager
2. Application Master
3. Data Node
 Each this 2nd level Master Node again contains one or more Slave Nodes as shown in the
above diagram.
 These Slave Nodes have two components:
1. Node Manager
2. HDFS
It’s HDFS component is also knows as Data Node. It’s Data Node component is used to
store actual our application Big Data. These nodes does not contain Application Master
component.
Hadoop 2.x Components In-detail Architecture
Hadoop 2.x Architecture Description

Resource Manager:
 Resource Manager is a Per-Cluster Level Component.
 Resource Manager is again divided into two components:
1. Scheduler
2. Application Manager
 Resource Manager’s Scheduler is :

1. Responsible to schedule required resources to Applications (that is Per-Application
Master).
2. It does only scheduling.
3. It does care about monitoring or tracking of those Applications.
Application Master:
 Application Master is a per-application level component. It is responsible for:

1. Managing assigned Application Life cycle.
2. It interacts with both Resource Manager’s Scheduler and Node Manager
3. It interacts with Scheduler to acquire required resources.

4. It interacts with Node Manager to execute assigned tasks and monitor those task’s
status.
Node Manager:
 Node Manager is a Per-Node Level component.

 It is responsible for:
1. Managing the life-cycle of the Container.
2. Monitoring each Container’s Resources utilization.
Container:
 Each Master Node or Slave Node contains set of Containers. In this diagram, Main Node’s
Name Node is not showing the Containers. However, it also contains a set of Containers.
 Container is a portion of Memory in HDFS (Either Name Node or Data Node).
 In Hadoop 2.x, Container is similar to Data Slots in Hadoop 1.x. We will see the major
differences between these two Components
Hive
Installation and Configuration
You can install a stable release of Hive by downloading

a tarball, or you can download the source code and
build Hive from that.
Running HiveServer2 and Beeline
Requirements
 Java 1.7
Note: Hive versions 1.2 onward require Java
1.7 or newer. Hive versions 0.14 to 1.1 work
with Java 1.6 as well. Users are strongly
advised to start moving to Java 1.8 (see HIVE-
8607).
 Hadoop 2.x (preferred), 1.x (not supported by

Hive 2.0.0 onward).
Hive versions up to 0.13 also supported Hadoop
0.20.x, 0.23.x.
 Hive is commonly used in production Linux

and Windows environment. Mac is a commonly
used development environment. The
instructions in this document are applicable to
Linux and Mac. Using it on Windows would
require slightly different steps.
Installing Hive from a Stable Release
Start by downloading the most recent stable release of

Hive from one of the Apache download mirrors (see
Hive Releases).
Next you need to unpack the tarball. This will result in

the creation of a subdirectory named hive-
x.y.z (where x.y.z is the release number):
$ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to

the installation directory:
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH
Building Hive from Source
The Hive GIT repository for the most recent Hive code
is located here: git clone https://git-wip-
us.apache.org/repos/asf/hive.git (the
master branch).
All release versions are in branches named "branch-

0.#" or "branch-1.#" or the upcoming "branch-2.#",
with the exception of release 0.8.1 which is in "branch-
0.8-r2". Any branches with other names are feature
branches for works-in-progress. See Understanding
Hive Branches for details.
As of 0.13, Hive is built using Apache Maven.
Compile Hive on master
To build the current Hive code from the master branch:
$ git clone https://git-wip-us.apache.org/

repos/asf/hive.git
$ cd hive
$ mvn clean package -Pdist
$ cd packaging/target/apache-hive-
{version}-SNAPSHOT-bin/apache-hive-
{version}-SNAPSHOT-bin
$ ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)

hcatalog / (hcatalog installation)
scripts / (upgrade scripts for hive-

metastore)
Here, {version} refers to the current Hive version.
If building Hive source using Maven (mvn), we will

refer to the directory "/packaging/target/apache-hive-
{version}-SNAPSHOT-bin/apache-hive-{version}-
SNAPSHOT-bin" as <install-dir> for the rest of the
page.
Compile Hive on branch-1
In branch-1, Hive supports both Hadoop 1.x and 2.x.

You will need to specify which version of Hadoop to
build against via a Maven profile. To build against
Hadoop 1.x use the profile hadoop-1; for Hadoop 2.x
use hadoop-2. For example to build against Hadoop
1.x, the above mvn command becomes:
$ mvn clean package -Phadoop-1,dist
Compile Hive Prior to 0.13 on Hadoop 0.20
Prior to Hive 0.13, Hive was built using Apache Ant.

To build an older version of Hive on Hadoop 0.20:
$ svn co
http://svn.apache.org/repos/asf/hive/branche
s/branch-{version} hive
$ cd hive
$ ant clean package

$ cd build/dist
# ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)
hcatalog / (hcatalog installation)
scripts / (upgrade scripts for hive-

metastore)
If using Ant, we will refer to the directory

"build/dist" as <install-dir>.
Compile Hive Prior to 0.13 on Hadoop 0.23
To build Hive in Ant against Hadoop 0.23, 2.0.0, or

other version, build with the appropriate flag; some
examples below:
$ ant clean package -
Dhadoop.version=0.23.3 -Dhadoop-
0.23.version=0.23.3 -Dhadoop.mr.rev=23
$ ant clean package -

Dhadoop.version=2.0.0-alpha -Dhadoop-
0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23
Running Hive
Hive uses Hadoop, so:
 you must have Hadoop in your path OR
 export HADOOP_HOME=<hadoop-
install-dir>
In addition, you must use below HDFS commands to

create /tmp and /user/hive/warehouse (aka
hive.metastore.warehouse.dir) and set
them chmod g+w before you can create a table in
Hive.
$ $HADOOP_HOME/bin/hadoop fs -
mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -
mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod
g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod
g+w /user/hive/warehouse
You may find it useful, though it's not necessary, to set

HIVE_HOME:
$ export HIVE_HOME=<hive-install-dir>
Running Hive CLI
To use the Hive command line interface (CLI) from the

shell:
$ $HIVE_HOME/bin/hive
Running HiveServer2 and Beeline
Starting from Hive 2.1, we need to run the schematool

command below as an initialization step. For example,
we can use "derby" as db type.
$ $HIVE_HOME/bin/schematool -dbType <db

type> -initSchema
HiveServer2 (introduced in Hive 0.11) has its own CLI

called Beeline. HiveCLI is now deprecated in favor of
Beeline, as it lacks the multi-user, security, and other
capabilities of HiveServer2. To run HiveServer2 and
Beeline from shell:
$ $HIVE_HOME/bin/hiveserver2
$ $HIVE_HOME/bin/beeline -u
jdbc:hive2://$HS2_HOST:$HS2_PORT
Beeline is started with the JDBC URL of the

HiveServer2, which depends on the address and port
where HiveServer2 was started. By default, it will be
(localhost:10000), so the address will look like
jdbc:hive2://localhost:10000.
Or to start Beeline and HiveServer2 in the same
process for testing purpose, for a similar user
experience to HiveCLI:
$ $HIVE_HOME/bin/beeline -u jdbc:hive2://
Running HCatalog
To run the HCatalog server from the shell in Hive

release 0.11.0 and later:
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh
To use the HCatalog command line interface (CLI) in

Hive release 0.11.0 and later:
$ $HIVE_HOME/hcatalog/bin/hcat
For more information, see HCatalog Installation from

Tarball and HCatalog CLI in the HCatalog manual.
Running WebHCat (Templeton)
To run the WebHCat server from the shell in Hive

release 0.11.0 and later:
$
$HIVE_HOME/hcatalog/sbin/webhcat_server.sh
For more information, see WebHCat Installation in the

WebHCat manual.
Configuration Management Overview
 Hive by default gets its configuration from

<install-dir>/conf/hive-
default.xml
 The location of the Hive configuration directory

can be changed by setting the
HIVE_CONF_DIR environment variable.
 Configuration variables can be changed by

(re-)defining them in
<install-dir>/conf/hive-site.xml
 Log4j configuration is stored in <install-

dir>/conf/hive-log4j.properties
 Hive configuration is an overlay on top of

Hadoop – it inherits the Hadoop configuration
variables by default.
 Hive configuration can be manipulated by:
 Editing hive-site.xml and defining any

desired variables (including Hadoop
variables) in it
 Using the set command (see next

section)
 Invoking Hive (deprecated), Beeline or

HiveServer2 using the syntax:
 $ bin/hive --hiveconf
x1=y1 --hiveconf x2=y2
//this sets the
variables x1 and x2 to
y1 and y2 respectively
 $ bin/hiveserver2 --
hiveconf x1=y1 --
hiveconf x2=y2 //this
sets server-side
y1 and y2 respectively
 $ bin/beeline --
hiveconf x1=y1 --
hiveconf x2=y2 //this
sets client-side
y1 and y2
respectively.
 Setting the HIVE_OPTS environment

variable to "--hiveconf x1=y1 --
hiveconf x2=y2" which does the
same as above.
Runtime Configuration
 Hive queries are executed using map-reduce

queries and, therefore, the behavior of such
queries can be controlled by the Hadoop
configuration variables.
 The HiveCLI (deprecated) and Beeline

command 'SET' can be used to set any Hadoop
(or Hive) configuration variable. For example:
beeline> SET
mapred.job.tracker=myhost.mycompany.co
m:50030;
beeline> SET -v;

The latter shows all the current settings.
Without the -v option only the variables that
differ from the base Hadoop configuration are
displayed.
Hive, Map-Reduce and Local-Mode
Hive compiler generates map-reduce jobs for most

queries. These jobs are then submitted to the Map-
Reduce cluster indicated by the variable:
mapred.job.tracker
While this usually points to a map-reduce cluster with

multiple nodes, Hadoop also offers a nifty option to run
map-reduce jobs locally on the user's workstation. This
can be very useful to run queries over small data sets –
in such cases local mode execution is usually
significantly faster than submitting jobs to a large
cluster. Data is accessed transparently from HDFS.
Conversely, local mode only runs with one reducer and
can be very slow processing larger data sets.
Starting with release 0.7, Hive fully supports local

mode execution. To enable this, the user can enable the
following option:
hive> SET mapreduce.framework.name=local;
In addition, mapred.local.dir should point to a

path that's valid on the local machine (for example
/tmp/<username>/mapred/local). (Otherwise,
the user will get an exception allocating local disk
space.)
Starting with release 0.7, Hive also supports a mode to

run map-reduce jobs in local-mode automatically. The
relevant options are
hive.exec.mode.local.auto,
hive.exec.mode.local.auto.inputbytes.
max, and
hive.exec.mode.local.auto.tasks.max:
hive> SET hive.exec.mode.local.auto=false;
Note that this feature is disabled by default. If enabled,

Hive analyzes the size of each map-reduce job in a
query and may run it locally if the following thresholds
are satisfied:
 The total input size of the job is lower than:

hive.exec.mode.local.auto.inputb
ytes.max (128MB by default)
 The total number of map-tasks is less than:

hive.exec.mode.local.auto.tasks.
max (4 by default)
 The total number of reduce tasks required is 1

or 0.
So for queries over small data sets, or for queries with

multiple map-reduce jobs where the input to
subsequent jobs is substantially smaller (because of
reduction/filtering in the prior job), jobs may be run
locally.
Note that there may be differences in the runtime

environment of Hadoop server nodes and the machine
running the Hive client (because of different jvm
versions or different software libraries). This can cause
unexpected behavior/errors while running in local
mode. Also note that local mode execution is done in a
separate, child jvm (of the Hive client). If the user so
wishes, the maximum amount of memory for this child
jvm can be controlled via the option
hive.mapred.local.mem. By default, it's set to
zero, in which case Hive lets Hadoop determine the
default memory limits of the child jvm.
Hive Logging
Hive uses log4j for logging. By default logs are not

emitted to the console by the CLI. The default logging
level is WARN for Hive releases prior to 0.13.0. Starting
with Hive 0.13.0, the default logging level is INFO.
The logs are stored in the directory /tmp/

<user.name>:
 /tmp/<user.name>/hive.log
Note: In local mode, prior to Hive 0.13.0 the log
file name was ".log" instead of "hive.log".
This bug was fixed in release 0.13.0 (see HIVE-
5528 and HIVE-5676).
To configure a different log location,

set hive.log.dir in $HIVE_HOME/conf/hive-
log4j.properties. Make sure the directory has the sticky
bit set (chmod 1777 <dir>).
 hive.log.dir=<other_location>
If the user wishes, the logs can be emitted to the

console by adding the arguments shown below:
 bin/hive --hiveconf
hive.root.logger=INFO,console
//for HiveCLI (deprecated)
 bin/hiveserver2 --hiveconf
hive.root.logger=INFO,console
Alternatively, the user can change the logging level
only by using:
hive.root.logger=INFO,DRFA //for
HiveCLI (deprecated)
hive.root.logger=INFO,DRFA
Another option for logging is TimeBasedRollingPolicy

(applicable for Hive 1.1.0 and above, HIVE-9001) by
providing DAILY option as shown below:
hive.root.logger=INFO,DAILY
//for HiveCLI (deprecated)
hive.root.logger=INFO,DAILY
Note that setting hive.root.logger via the 'set'

command does not change logging properties since
they are determined at initialization time.
Hive also stores query logs on a per Hive session basis

in /tmp/<user.name>/, but can be configured in
hive-site.xml with the hive.querylog.location
property. Starting with Hive 1.1.0, EXPLAIN
EXTENDED output for queries can be logged at the
INFO level by setting
the hive.log.explain.output property to true.
Logging during Hive execution on a Hadoop cluster is

controlled by Hadoop configuration. Usually Hadoop
will produce one log file per map and reduce task
stored on the cluster machine(s) where the task was
executed. The log files can be obtained by clicking
through to the Task Details page from the Hadoop
JobTracker Web UI.
When using local mode (using

mapreduce.framework.name=local), Hadoop/
Hive execution logs are produced on the client machine
itself. Starting with release 0.6 – Hive uses the hive-
exec-log4j.properties (falling back to hive-
log4j.properties only if it's missing) to
determine where these logs are delivered by default.
The default configuration file produces one log file per
query executed in local mode and stores it under
/tmp/<user.name>. The intent of providing a
separate configuration file is to enable administrators to
centralize execution log capture if desired (on a NFS
file server for example). Execution logs are invaluable
for debugging run-time errors.
For information about WebHCat errors and logging, see

Error Codes and Responses and Log Files in the
WebHCat manual.
Error logs are very useful to debug problems. Please

send them with any bugs (of which there are many!) to
hive-dev@hadoop.apache.org.
From Hive 2.1.0 onwards (with HIVE-13027), Hive

uses Log4j2's asynchronous logger by default. Setting
hive.async.log.enabled to false will disable
asynchronous logging and fallback to synchronous
logging. Asynchronous logging can give significant
performance improvement as logging will be handled
in a separate thread that uses the LMAX disruptor
queue for buffering log messages. Refer to
https://logging.apache.org/log4j/2.x/manual/async.html
for benefits and drawbacks.
HiveServer2 Logs
HiveServer2 operation logs are available to clients

starting in Hive 0.14. See HiveServer2 Logging for
configuration.
Audit Logs
Audit logs are logged from the Hive metastore server

for every metastore API invocation.
An audit log has the function and some of the relevant

function arguments logged in the metastore log file. It
is logged at the INFO level of log4j, so you need to
make sure that the logging at the INFO level is enabled
(see HIVE-3505). The name of the log entry is
"HiveMetaStore.audit".
Audit logs were added in Hive 0.7 for secure client

connections (HIVE-1948) and in Hive 0.10 for non-
secure connections (HIVE-3277; also see HIVE-2797).
Perf Logger
In order to obtain the performance metrics via the

PerfLogger, you need to set DEBUG level logging for
the PerfLogger class (HIVE-12675). This can be
achieved by setting the following in the log4j
properties file.
log4j.logger.org.apache.hadoop.hiv
e.ql.log.PerfLogger=DEBUG
If the logger level has already been set to DEBUG at
root via hive.root.logger, the above setting is not
required to see the performance logs.
DDL Operations
The Hive DDL operations are documented in Hive

Data Definition Language.
Creating Hive Tables
hive> CREATE TABLE pokes (foo INT, bar

STRING);
creates a table called pokes with two columns, the first

being an integer and the other a string.
hive> CREATE TABLE invites (foo INT, bar

STRING) PARTITIONED BY (ds STRING);
creates a table called invites with two columns and a

partition column called ds. The partition column is a
virtual column. It is not part of the data itself but is
derived from the partition that a particular dataset is
loaded into.
By default, tables are assumed to be of text input

format and the delimiters are assumed to be Â(ctrl-a).
Browsing through Tables
hive> SHOW TABLES;
lists all the tables.

hive> SHOW TABLES '.*s';
lists all the table that end with 's'. The pattern matching
follows Java regular expressions. Check out this link
for documentation
http://java.sun.com/javase/6/docs/api/java/util/regex/Pa
ttern.html.
hive> DESCRIBE invites;
shows the list of columns.
Altering and Dropping Tables
Table names can be changed and columns can be added

or replaced:
hive> ALTER TABLE events RENAME TO

3koobecaf;
hive> ALTER TABLE pokes ADD COLUMNS

(new_col INT);
hive> ALTER TABLE invites ADD COLUMNS

(new_col2 INT COMMENT 'a comment');
hive> ALTER TABLE invites REPLACE COLUMNS

(foo INT, bar STRING, baz INT COMMENT 'baz
replaces new_col2');
Note that REPLACE COLUMNS replaces all existing

columns and only changes the table's schema, not the
data. The table must use a native SerDe. REPLACE
COLUMNS can also be used to drop columns from the
table's schema:
hive> ALTER TABLE invites REPLACE COLUMNS

(foo INT COMMENT 'only keep the first
column');
Dropping tables:
hive> DROP TABLE pokes;
Metadata Store
Metadata is in an embedded Derby database whose

disk storage location is determined by the Hive
configuration variable named
javax.jdo.option.ConnectionURL. By
default this location is ./metastore_db (see
conf/hive-default.xml).
Right now, in the default configuration, this metadata

can only be seen by one user at a time.
Metastore can be stored in any database that is

supported by JPOX. The location and the type of the
RDBMS can be controlled by the two variables
javax.jdo.option.ConnectionURL and
javax.jdo.option.ConnectionDriverName
. Refer to JDO (or JPOX) documentation for more
details on supported databases. The database schema is
defined in JDO metadata annotations file
package.jdo at
src/contrib/hive/metastore/src/model.
In the future, the metastore itself can be a standalone

server.
If you want to run the metastore as a network server so

it can be accessed from multiple nodes, see Hive Using
Derby in Server Mode.
DML Operations
The Hive DML operations are documented in Hive

Data Manipulation Language.
Loading data from flat files into Hive:
hive> LOAD DATA LOCAL INPATH

'./examples/files/kv1.txt' OVERWRITE INTO
TABLE pokes;
Loads a file that contains two columns separated by

ctrl-a into pokes table. 'LOCAL' signifies that the input
file is on the local file system. If 'LOCAL' is omitted
then it looks for the file in HDFS.
The keyword 'OVERWRITE' signifies that existing

data in the table is deleted. If the 'OVERWRITE'
keyword is omitted, data files are appended to existing
data sets.
NOTES:
 NO verification of data against the schema is

performed by the load command.
 If the file is in hdfs, it is moved into the Hive-

controlled file system namespace.
The root of the Hive directory is specified by
the option
hive.metastore.warehouse.dir in
hive-default.xml. We advise users to
create this directory before trying to create
tables via Hive.

TABLE invites PARTITION (ds='2008-08-15');
TABLE invites PARTITION (ds='2008-08-08');
The two LOAD statements above load data into two

different partitions of the table invites. Table invites
must be created as partitioned by the key ds for this to
succeed.
hive> LOAD DATA INPATH

'/user/myname/kv2.txt' OVERWRITE INTO TABLE
invites PARTITION (ds='2008-08-15');
The above command will load data from an HDFS file/

directory to the table.
Note that loading data from HDFS will result in
moving the file/directory. As a result, the operation is
almost instantaneous.
SQL Operations
The Hive query operations are documented in Select.
Example Queries
Some example queries are shown below. They are

available in build/dist/examples/queries.
More are available in the Hive sources at
ql/src/test/queries/positive.
SELECTS and FILTERS
hive> SELECT a.foo FROM invites a WHERE

a.ds='2008-08-15';
selects column 'foo' from all rows of partition

ds=2008-08-15 of the invites table. The results
are not stored anywhere, but are displayed on the
console.
Note that in all the examples that follow, INSERT (into

a Hive table, local directory or HDFS directory) is
optional.
hive> INSERT OVERWRITE DIRECTORY

'/tmp/hdfs_out' SELECT a.* FROM invites a
WHERE a.ds='2008-08-15';
selects all rows from partition ds=2008-08-15 of

the invites table into an HDFS directory. The result
data is in files (depending on the number of mappers)
in that directory.
NOTE: partition columns if any are selected by the use
of *. They can also be specified in the projection
clauses.
Partitioned tables must always have a partition selected

in the WHERE clause of the statement.
hive> INSERT OVERWRITE LOCAL DIRECTORY

'/tmp/local_out' SELECT a.* FROM pokes a;
selects all rows from pokes table into a local directory.
hive> INSERT OVERWRITE TABLE events SELECT

a.* FROM profiles a;

a.* FROM profiles a WHERE a.key < 100;

'/tmp/reg_3' SELECT a.* FROM events a;

'/tmp/reg_4' select a.invites, a.pokes FROM
profiles a;
'/tmp/reg_5' SELECT COUNT(*) FROM invites a
WHERE a.ds='2008-08-15';

'/tmp/reg_5' SELECT a.foo, a.bar FROM
invites a;

'/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
selects the sum of a column. The avg, min, or max can

also be used. Note that for versions of Hive which don't
include HIVE-287, you'll need to use COUNT(1) in
place of COUNT(*).
GROUP BY
hive> FROM invites a INSERT OVERWRITE

TABLE events SELECT a.bar, count(*) WHERE
a.foo > 0 GROUP BY a.bar;

a.bar, count(*) FROM invites a WHERE a.foo >
0 GROUP BY a.bar;
Note that for versions of Hive which don't include

HIVE-287, you'll need to use COUNT(1) in place of
COUNT(*).
JOIN
hive> FROM pokes t1 JOIN invites t2 ON

(t1.bar = t2.bar) INSERT OVERWRITE TABLE
events SELECT t1.bar, t1.foo, t2.foo;
MULTITABLE INSERT
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.*
WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT

src.key, src.value WHERE src.key >= 100 and
src.key < 200
INSERT OVERWRITE TABLE dest3

PARTITION(ds='2008-04-08', hr='12') SELECT
src.key WHERE src.key >= 200 and src.key <
300
INSERT OVERWRITE LOCAL DIRECTORY

'/tmp/dest4.out' SELECT src.value WHERE
src.key >= 300;
STREAMING
hive> FROM invites a INSERT OVERWRITE

TABLE events SELECT TRANSFORM(a.foo, a.bar)
AS (oof, rab) USING '/bin/cat' WHERE a.ds >
'2008-08-09';
This streams the data in the map phase through the

script /bin/cat (like Hadoop streaming).
Similarly – streaming can be used on the reduce side
(please see the Hive Tutorial for examples).
Simple Example Use Cases
MovieLens User Ratings
First, create a table with tab-delimited text file format:
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
Then, download the data files from MovieLens 100k

on the GroupLens datasets page (which also has a
README.txt file and index of unzipped files):
wget
http://files.grouplens.org/datasets/movielen
s/ml-100k.zip
or:
curl --remote-name
http://files.grouplens.org/datasets/movielen
s/ml-100k.zip
Note: If the link to GroupLens datasets does not work,

please report it on HIVE-5341 or send a message to the
user@hive.apache.org mailing list.
Unzip the data files:
unzip ml-100k.zip
And load u.data into the table that was just created:
LOAD DATA LOCAL INPATH '<path>/u.data'

OVERWRITE INTO TABLE u_data;
Count the number of rows in table u_data:
SELECT COUNT(*) FROM u_data;
Note that for older versions of Hive which don't

include HIVE-287, you'll need to use COUNT(1) in
place of COUNT(*).
Now we can do some complex data analysis on the

table u_data:
Create weekday_mapper.py:
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime =

line.split('\t')
weekday =
datetime.datetime.fromtimestamp(float(unixti
me)).isoweekday()
print '\t'.join([userid, movieid, rating,

str(weekday)])
Use the mapper script:

CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
add FILE weekday_mapper.py;
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating,

unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
Note that if you're using Hive 0.5.0 or earlier you will

need to use COUNT(1) in place of COUNT(*).
Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters use the default.
For default Apache weblog, we can create a table with the following command.
More about RegexSerDe can be found here in HIVE-662 and HIVE-1719.
CREATE TABLE apachelog (

host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?:
([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;
Sqoop :
Sqoop is a collection of related tools. To use Sqoop, you specify the

tool you want to use and the arguments that control the tool.
If Sqoop is compiled from its own source, you can run Sqoop
without a formal installation process by running
the bin/sqoop program. Users of a packaged deployment of Sqoop
(such as an RPM shipped with Apache Bigtop) will see this program
installed as /usr/bin/sqoop. The remainder of this documentation
will refer to this program as sqoop. For example:
$ sqoop tool-name [tool-arguments]
Note
The following examples that begin with a $ character indicate that the
commands must be entered at a terminal prompt (such as bash).
The $ character represents the prompt itself; you should not start these
commands by typing a $. You can also enter commands inline in the text of
a paragraph; for example, sqoop help. These examples do not show
a $ prefix, but you should enter them the same way. Don’t confuse
the $ shell prompt in the examples with the $ that precedes an environment
variable name. For example, the string literal $HADOOP_HOME includes a
"$".
Sqoop ships with a help tool. To display a list of all available tools,
type the following command:
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with
database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display
the results
export Export an HDFS directory to a
database table
help List available commands
import Import a table from a database to
HDFS
import-all-tables Import tables from a database to
HDFS
import-mainframe Import mainframe datasets to HDFS
list-databases List available databases on a server
list-tables List available tables in a database
version Display version information
See 'sqoop help COMMAND' for information on a specific
command.
You can display help for a specific tool by entering: sqoop help
(tool-name); for example, sqoop help import.
You can also add the --help argument to any command: sqoop
import --help.
6.1. Using Command Aliases

In addition to typing the sqoop (toolname) syntax, you can use
alias scripts that specify the sqoop-(toolname) syntax. For
example, the scripts sqoop-import, sqoop-export, etc. each select
a specific tool.
6.2. Controlling the Hadoop Installation

You invoke Sqoop through the program launch capability provided
by Hadoop. The sqoop command-line program is a wrapper which
runs the bin/hadoop script shipped with Hadoop. If you have
multiple installations of Hadoop present on your machine, you can
select the Hadoop installation by setting
the $HADOOP_COMMON_HOME and $HADOOP_MAPRED_HOME environment
variables.
For example:
$ HADOOP_COMMON_HOME=/path/to/some/hadoop \
HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \
sqoop import --arguments...
or:
$ export HADOOP_COMMON_HOME=/some/path/to/hadoop
$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-
mapreduce
$ sqoop import --arguments...
If either of these variables are not set, Sqoop will fall back
to $HADOOP_HOME. If it is not set either, Sqoop will use the default
installation locations for Apache
Bigtop, /usr/lib/hadoop and/usr/lib/hadoop-mapreduce,
respectively.
The active Hadoop configuration is loaded
from $HADOOP_HOME/conf/, unless
the $HADOOP_CONF_DIR environment variable is set.
6.3. Using Generic and Specific Arguments

To control the operation of each Sqoop tool, you use generic and
specific arguments.
For example:
$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--connect-manager <class-name> Specify connection
manager class to use
--driver <class-name> Manually specify JDBC driver
class to use
--hadoop-mapred-home <dir> Override
$HADOOP_MAPRED_HOME
--help Print usage instructions
--password-file Set path for file containing
authentication password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while
working
--hadoop-home <dir> Deprecated. Override
$HADOOP_HOME
[...]
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file> specify an application
configuration file
-D <property=value> use value for given
property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma
separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma
separated jar files to include in the classpath.
-archives <comma separated list of archives> specify
comma separated archives to be unarchived on the compute
machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
You must supply the generic arguments -conf, -D, and so on after
the tool name but before any tool-specific arguments (such as --
connect). Note that generic Hadoop arguments are preceeded by a
single dash character (-), whereas tool-specific arguments start
with two dashes (--), unless they are single character arguments
such as -P.
The -conf, -D, -fs and -jt arguments control the configuration
and Hadoop server settings. For example, the -D
mapred.job.name=<job_name> can be used to set the name of the
MR job that Sqoop launches, if not specified, the name defaults to
the jar name for the job - which is derived from the used table
name.
The -files, -libjars, and -archives arguments are not typically
used with Sqoop, but they are included as part of Hadoop’s internal
argument-parsing system.
6.4. Using Options Files to Pass Arguments

When using Sqoop, the command line options that do not change
from invocation to invocation can be put in an options file for
convenience. An options file is a text file where each line identifies
an option in the order that it appears otherwise on the command
line. Option files allow specifying a single option on multiple lines by
using the back-slash character at the end of intermediate lines. Also
supported are comments within option files that begin with the
hash character. Comments must be specified on a new line and may
not be mixed with option text. All comments and empty lines are
ignored when option files are expanded. Unless options appear as
quoted strings, any leading or trailing spaces are ignored. Quoted
strings if used must not extend beyond the line on which they are
specified.
Option files can be specified anywhere in the command line as long
as the options within them follow the otherwise prescribed rules of
options ordering. For instance, regardless of where the options are
loaded from, they must follow the ordering such that generic
options appear first, tool specific options next, finally followed by
options that are intended to be passed to child programs.
To specify an options file, simply create an options file in a
convenient location and pass it to the command line via --
options-file argument.
Whenever an options file is specified, it is expanded on the
command line before the tool is invoked. You can specify more than
one option files within the same invocation if needed.
For example, the following Sqoop invocation for import can be
specified alternatively as shown below:
$ sqoop import --connect jdbc:mysql://localhost/db --
username foo --table TEST
$ sqoop --options-file /users/homer/work/import.txt --
table TEST
where the options file /users/homer/work/import.txt contains
the following:
import
--connect
jdbc:mysql://localhost/db
--username
foo
The options file can have empty lines and comments for readability
purposes. So the above example would work exactly the same if
the options file /users/homer/work/import.txtcontained the
following:
#
# Options file for Sqoop import
#
# Specifies the tool being invoked
import
# Connect parameter and value
--connect
jdbc:mysql://localhost/db
# Username parameter and value
--username
foo
#
# Remaining options should be specified in the command
line.
#
6.5. Using Tools

The following sections will describe each tool’s operation. The tools
are listed in the most likely order you will find them useful.
7. sqoop-import
7.1. Purpose
7.2. Syntax
7.2.1. Connecting to a Database Server
7.2.2. Selecting the Data to Import
7.2.3. Free-form Query Imports
7.2.4. Controlling Parallelism
7.2.5. Controlling Distributed Cache
7.2.6. Controlling the Import Process
7.2.7. Controlling transaction isolation
7.2.8. Controlling type mapping
7.2.9. Schema name handling
7.2.10. Incremental Imports
7.2.11. File Formats
7.2.12. Large Objects
7.2.13. Importing Data Into Hive
7.2.14. Importing Data Into HBase
7.2.15. Importing Data Into Accumulo
7.2.16. Additional Import Configuration Properties
7.3. Example Invocations
7.1. Purpose
The import tool imports an individual table from an RDBMS to
HDFS. Each row from a table is represented as a separate record in
HDFS. Records can be stored as text files (one record per line), or
in binary representation as Avro or SequenceFiles.
7.2. Syntax
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
While the Hadoop generic arguments must precede any import
arguments, you can type the import arguments in any order with
respect to one another.
Note
In this document, arguments are grouped into collections organized by
function. Some collections are present in several tools (for example, the
"common" arguments). An extended description of their functionality is
given only on the first presentation in this document.
Table 1. Common arguments

Argument Description
--connection-manager <class-
Specify connection manager class to use
name>
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
Set path for a file containing the authentication
--password-file
password
--verbose Print more information while working
--connection-param-file Optional properties file that provides connection
<filename> parameters
Set connection transaction isolation to read
--relaxed-isolation
uncommitted for the mappers.

Sqoop is designed to import tables from a database into HDFS. To
do so, you must specify a connect string that describes how to
connect to the database. The connect string is similar to a URL, and
is communicated to Sqoop with the --connect argument. This
describes the server and database to connect to; it may also specify
the port. For example:
$ sqoop import --connect
jdbc:mysql://database.example.com/employees
This string will connect to a MySQL database named employees on
the host database.example.com. It’s important that you do not use
the URL localhost if you intend to use Sqoop with a distributed
Hadoop cluster. The connect string you supply will be used on
TaskTracker nodes throughout your MapReduce cluster; if you
specify the literal name localhost, each node will connect to a
different database (or more likely, no database at all). Instead, you
should use the full hostname or IP address of the database host
that can be seen by all your remote nodes.
You might need to authenticate against the database before you
can access it. You can use the --username to supply a username to
the database. Sqoop provides couple of different ways to supply a
password, secure and non-secure, to the database which is detailed
below.
Secure way of supplying password to the database. You
should save the password in a file on the users home directory with
400 permissions and specify the path to that file using the --
password-file argument, and is the preferred method of entering
credentials. Sqoop will then read the password from the file and
pass it to the MapReduce cluster using secure means with out
exposing the password in the job configuration. The file containing
the password can either be on the Local FS or HDFS. For example:
jdbc:mysql://database.example.com/employees \
--username venkatesh --password-file $
{user.home}/.password
Warning
Sqoop will read entire content of the password file and use it as a password.
This will include any trailing white space characters such as new line
characters that are added by default by most of the text editors. You need to
make sure that your password file contains only characters that belongs to
your password. On the command line you can use command echo with
switch -n to store password without any trailing white space characters.
For example to store password secret you would call echo -n
"secret" > password.file.
Another way of supplying passwords is using the -P argument

which will read a password from a console prompt.
Protecting password from preying eyes. Hadoop 2.6.0 provides
an API to separate password storage from applications. This API is
called the credential provided API and there is a
new credential command line tool to manage passwords and their
aliases. The passwords are stored with their aliases in a keystore
that is password protected. The keystore password can be the
provided to a password prompt on the command line, via an
environment variable or defaulted to a software defined constant.
Please check the Hadoop documentation on the usage of this
facility.
Once the password is stored using the Credential Provider facility
and the Hadoop configuration has been suitably updated, all
applications can optionally use the alias in place of the actual
password and at runtime resolve the alias for the password to use.
Since the keystore or similar technology used for storing the
credential provider is shared across components, passwords for
various applications, various database and other passwords can be
securely stored in them and only the alias needs to be exposed in
configuration files, protecting the password from being visible.
Sqoop has been enhanced to allow usage of this funcionality if it is
available in the underlying Hadoop version being used. One new
option has been introduced to provide the alias on the command
line instead of the actual password (--password-alias). The
argument value this option is the alias on the storage associated
with the actual password. Example usage is as follows:
--username dbuser --password-alias mydb.password.alias
Similarly, if the command line option is not preferred, the alias can
be saved in the file provided with --password-file option. Along with
this, the Sqoop configuration parameter
org.apache.sqoop.credentials.loader.class should be set to the
classname that provides the alias
resolution: org.apache.sqoop.util.password.CredentialProvid
erPasswordLoader
Example usage is as follows (assuming .password.alias has the alias
for the real password) :
--username dbuser --password-file $
{user.home}/.password-alias
Warning
The --password parameter is insecure, as other users may be able to read
your password from the command-line arguments via the output of
programs such as ps. The -P argument is the preferred method over using
the --password argument. Credentials may still be transferred between
nodes of the MapReduce cluster using insecure means. For example:

--username aaron --password 12345
Sqoop automatically supports several databases, including MySQL.
Connect strings beginning with jdbc:mysql:// are handled
automatically in Sqoop. (A full list of databases with built-in support
is provided in the "Supported Databases" section. For some, you
may need to install the JDBC driver yourself.)
You can use Sqoop with any other JDBC-compliant database. First,
download the appropriate JDBC driver for the type of database you
want to import, and install the .jar file in
the $SQOOP_HOME/lib directory on your client machine. (This will
be /usr/lib/sqoop/lib if you installed from an RPM or Debian
package.) Each driver .jar file also has a specific driver class which
defines the entry-point to the driver. For example, MySQL’s
Connector/J library has a driver class of com.mysql.jdbc.Driver.
Refer to your database vendor-specific documentation to determine
the main driver class. This class must be provided as an argument
to Sqoop with --driver.
For example, to connect to a SQLServer database, first download
the driver from microsoft.com and install it in your Sqoop lib path.
Then run Sqoop. For example:
$ sqoop import --driver
com.microsoft.jdbc.sqlserver.SQLServerDriver \
--connect <connect-string> ...
When connecting to a database using JDBC, you can optionally
specify extra JDBC parameters via a property file using the
option --connection-param-file. The contents of this file are
parsed as standard Java properties and passed into the driver while
creating a connection.
Note
The parameters specified via the optional property file are only applicable
to JDBC connections. Any fastpath connectors that use connections other
than JDBC will ignore these parameters.
Table 2. Validation arguments More Details

Enable validation of data copied, supports
--validate
single table copy only.
--validator <class-name> Specify validator class to use.
--validation-threshold <class-name> Specify validation threshold class to use.
--validation-failurehandler <class- Specify validation failure handler class to
name> use.
Table 3. Import control arguments:

--append Append data to an existing dataset in HDFS
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data to Parquet Files
--boundary-query
Boundary query to use for creating splits
<statement>
--columns <col,col,col…> Columns to import from table
--delete-target-dir Delete the import target directory if it exists
--direct Use direct connector if exists for the database
--fetch-size <n> Number of entries to read from database at once.
--inline-lob-limit <n> Set the maximum size for an inline LOB
-m,--num-mappers <n> Use n map tasks to import in parallel
-e,--query <statement> Import the results of statement.
Column of the table used to split work units. Cannot be used
--split-by <column-name>
with --autoreset-to-one-mapper option.
Upper Limit for each split size. This only applies to Integer
--split-limit <n> and Date columns. For date or timestamp fields it is
calculated in seconds.
Import should use one mapper if a table has no primary key
--autoreset-to-one-mapper and no split-by column is provided. Cannot be used with --
split-by <col> option.
--table <table-name> Table to read
--target-dir <dir> HDFS destination dir
HDFS directory for temporary files created during import
--temporary-rootdir <dir>
(overrides default "_sqoop")
--warehouse-dir <dir> HDFS parent for table destination
--where <where clause> WHERE clause to use during import
-z,--compress Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)
--null-string <null-
The string to be written for a null value for string columns
string>
--null-non-string <null- The string to be written for a null value for non-string
string> columns
The --null-string and --null-non-string arguments are

optional.\ If not specified, then the string "null" will be used.
Sqoop typically imports data in a table-centric fashion. Use the --
table argument to select the table to import. For example, --
table employees. This argument can also identify a VIEWor other
table-like entity in a database.
By default, all columns within a table are selected for import.
Imported data is written to HDFS in its "natural order;" that is, a
table containing columns A, B, and C result in an import of data
such as:
A1,B1,C1
A2,B2,C2
...
You can select a subset of columns and control their ordering by
using the --columns argument. This should include a comma-
delimited list of columns to import. For example: --columns
"name,employee_id,jobtitle".
You can control which rows are imported by adding a
SQL WHERE clause to the import statement. By default, Sqoop
generates statements of the form SELECT <column list> FROM
<table name>. You can append a WHERE clause to this with the --
where argument. For example: --where "id > 400". Only rows
where the id column has a value greater than 400 will be imported.
By default sqoop will use query select min(<split-by>),
max(<split-by>) from <table name> to find out boundaries for
creating splits. In some cases this query is not the most optimal so
you can specify any arbitrary query returning two numeric columns
using --boundary-query argument.

Sqoop can also import the result set of an arbitrary SQL query.
Instead of using the --table, --columns and --where arguments,
you can specify a SQL statement with the --queryargument.
When importing a free-form query, you must specify a destination
directory with --target-dir.
If you want to import the results of a query in parallel, then each
map task will need to execute a copy of the query, with results
partitioned by bounding conditions inferred by Sqoop. Your query
must include the token $CONDITIONS which each Sqoop process will
replace with a unique condition expression. You must also select a
splitting column with --split-by.
For example:
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id ==
b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
Alternately, the query can be executed once and imported serially,
by specifying a single map task with -m 1:
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id ==
b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults
Note
If you are issuing the query wrapped with double quotes ("), you will have
to use \$CONDITIONS instead of just $CONDITIONS to disallow your
shell from treating it as a shell variable. For example, a double quoted query
may look like: "SELECT * FROM x WHERE a='foo' AND \
$CONDITIONS"
Note
The facility of using free-form query in the current version of Sqoop is
limited to simple queries where there are no ambiguous projections and
no OR conditions in the WHERE clause. Use of complex queries such as
queries that have sub-queries or joins leading to ambiguous projections can
lead to unexpected results.

Sqoop imports data in parallel from most database sources. You can
specify the number of map tasks (parallel processes) to use to
perform the import by using the -m or --num-mappers argument.
Each of these arguments takes an integer value which corresponds
to the degree of parallelism to employ. By default, four tasks are
used. Some databases may see improved performance by
increasing this value to 8 or 16. Do not increase the degree of
parallelism greater than that available within your MapReduce
cluster; tasks will run serially and will likely increase the amount of
time required to perform the import. Likewise, do not increase the
degree of parallism higher than that which your database can
reasonably support. Connecting 100 concurrent clients to your
database may increase the load on the database server to a point
where performance suffers as a result.
When performing parallel imports, Sqoop needs a criterion by which
it can split the workload. Sqoop uses a splitting column to split the
workload. By default, Sqoop will identify the primary key column (if
present) in a table and use it as the splitting column. The low and
high values for the splitting column are retrieved from the
database, and the map tasks operate on evenly-sized components
of the total range. For example, if you had a table with a primary
key column of id whose minimum value was 0 and maximum value
was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run
four processes which each execute SQL statements of the
form SELECT * FROM sometable WHERE id >= lo AND id < hi,
with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750,
1001) in the different tasks.
If the actual values for the primary key are not uniformly
distributed across its range, then this can result in unbalanced
tasks. You should explicitly choose a different column with the --
split-by argument. For example, --split-by employee_id.
Sqoop cannot currently split on multi-column indices. If your table
has no index column, or has a multi-column key, then you must
also manually choose a splitting column.
User can override the --num-mapers by using --split-
limit option. Using the --split-limit parameter places a limit on
the size of the split section created. If the size of the split created is
larger than the size specified in this parameter, then the splits
would be resized to fit within this limit, and the number of splits will
change according to that.This affects actual number of mappers. If
size of a split calculated based on provided --num-
mappers parameter exceeds --split-limit parameter then actual
number of mappers will be increased.If the value specified in --
split-limit parameter is 0 or negative, the parameter will be
ignored altogether and the split size will be calculated according to
the number of mappers.
If a table does not have a primary key defined and the --split-by
<col> is not provided, then import will fail unless the number of
mappers is explicitly set to one with the --num-mappers 1 option or
the --autoreset-to-one-mapper option is used. The option --
autoreset-to-one-mapper is typically used with the import-all-
tables tool to automatically handle tables without a primary key in a
schema.

Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache
every time when start a Sqoop job. When launched by Oozie this is
unnecessary since Oozie use its own Sqoop share lib which keeps
Sqoop dependencies in the distributed cache. Oozie will do the
localization on each worker node for the Sqoop dependencies only
once during the first Sqoop job and reuse the jars on worker node
for subsquencial jobs. Using option --skip-dist-cache in Sqoop
command when launched by Oozie will skip the step which Sqoop
copies its dependencies to job cache and save massive I/O.

By default, the import process will use JDBC which provides a
reasonable cross-vendor import channel. Some databases can
perform imports in a more high-performance fashion by using
database-specific data movement tools. For example, MySQL
provides the mysqldump tool which can export data from MySQL to
other systems very quickly. By supplying the --direct argument,
you are specifying that Sqoop should attempt the direct import
channel. This channel may be higher performance than using JDBC.
Details about use of direct mode with each specific RDBMS,
installation requirements, available options and limitations can be
found in Section 25, “Notes for specific connectors”.
By default, Sqoop will import a table named foo to a directory
named foo inside your home directory in HDFS. For example, if
your username is someuser, then the import tool will write
to /user/someuser/foo/(files). You can adjust the parent
directory of the import with the --warehouse-dir argument. For
example:
$ sqoop import --connnect <connect-str> --table foo --
warehouse-dir /shared \
...
This command would write to a set of files in
the /shared/foo/ directory.
You can also explicitly choose the target directory, like so:
$ sqoop import --connnect <connect-str> --table foo --
target-dir /dest \
...
This will import the files into the /dest directory. --target-dir is
incompatible with --warehouse-dir.
When using direct mode, you can specify additional arguments
which should be passed to the underlying tool. If the
argument -- is given on the command-line, then subsequent
arguments are sent directly to the underlying tool. For example, the
following adjusts the character set used by mysqldump:
$ sqoop import --connect jdbc:mysql://server.foo.com/db
--table bar \
--direct -- --default-character-set=latin1
By default, imports go to a new target location. If the destination
directory already exists in HDFS, Sqoop will refuse to import and
overwrite that directory’s contents. If you use the --
append argument, Sqoop will import data to a temporary directory
and then rename the files into the normal target directory in a
manner that does not conflict with existing filenames in that
directory.

By default, Sqoop uses the read committed transaction isolation in
the mappers to import data. This may not be the ideal in all ETL
workflows and it may desired to reduce the isolation guarantees.
The --relaxed-isolation option can be used to instruct Sqoop to
use read uncommitted isolation level.
The read-uncommitted isolation level is not supported on all
databases (for example, Oracle), so specifying the option --
relaxed-isolation may not be supported on all databases.

Sqoop is preconfigured to map most SQL types to appropriate Java
or Hive representatives. However the default mapping might not be
suitable for everyone and might be overridden by --map-column-
java (for changing mapping to Java) or --map-column-hive (for
changing Hive mapping).
Table 4. Parameters for overriding mapping
--map-column-java Override mapping from SQL to Java type for configured
<mapping> columns.
--map-column-hive Override mapping from SQL to Hive type for configured
<mapping> columns.
Sqoop is expecting comma separated list of mapping in form

<name of column>=<new type>. For example:
$ sqoop import ... --map-column-java
id=String,value=Integer
Notice that specifying commas in --map-column-hive option, you
should use URL encoded keys and values, for example, use
DECIMAL(1%2C%201) instead of DECIMAL(1, 1).
Sqoop will rise exception in case that some configured mapping will
not be used.

When sqoop imports data from an enterprise store, table and
column names may have characters that are not valid Java
identifier characters or Avro/Parquet identifiers. To address this,
sqoop translates these characters to _ as part of the schema
creation. Any column name starting with an _ (underscore)
character will be translated to have two underscore characters. For
example _AVRO will be converted to __AVRO.
In the case of HCatalog imports, column names are converted to
lower case when mapped to HCatalog columns. This may change in
future.
Sqoop provides an incremental import mode which can be used to
retrieve only rows newer than some previously-imported set of
rows.
The following arguments control incremental imports:
Table 5. Incremental import arguments:
Specifies the column to be examined when determining which rows to
--check-column import. (the column should not be of type
(col) CHAR/NCHAR/VARCHAR/VARNCHAR/
LONGVARCHAR/LONGNVARCHAR)
--incremental Specifies how Sqoop determines which rows are new. Legal values
(mode) for mode include append and lastmodified.
--last-value Specifies the maximum value of the check column from the previous
(value) import.
Sqoop supports two types of incremental

imports: append and lastmodified. You can use the --
incremental argument to specify the type of incremental import to
perform.
You should specify append mode when importing a table where new
rows are continually being added with increasing row id values. You
specify the column containing the row’s id with --check-column.
Sqoop imports rows where the check column has a value greater
than the one specified with --last-value.
An alternate table update strategy supported by Sqoop is
called lastmodified mode. You should use this when rows of the
source table may be updated, and each such update will set the
value of a last-modified column to the current timestamp. Rows
where the check column holds a timestamp more recent than the
timestamp specified with --last-value are imported.
At the end of an incremental import, the value which should be
specified as --last-value for a subsequent import is printed to the
screen. When running a subsequent import, you should specify --
last-value in this way to ensure you import only the new or
updated data. This is handled automatically by creating an
incremental import as a saved job, which is the preferred
mechanism for performing a recurring incremental import. See the
section on saved jobs later in this document for more information.

You can import data in one of two file formats: delimited text or
SequenceFiles.
Delimited text is the default import format. You can also specify it
explicitly by using the --as-textfile argument. This argument will
write string-based representations of each record to the output
files, with delimiter characters between individual columns and
rows. These delimiters may be commas, tabs, or other characters.
(The delimiters can be selected; see "Output line formatting
arguments.") The following is the results of an example text-based
import:
1,here is a message,2010-05-01
2,happy new year!,2010-01-01
3,another message,2009-11-12
Delimited text is appropriate for most non-binary data types. It also
readily supports further manipulation by other tools, such as Hive.
SequenceFiles are a binary format that store individual records in
custom record-specific data types. These data types are manifested
as Java classes. Sqoop will automatically generate these data types
for you. This format supports exact storage of all data in binary
representations, and is appropriate for storing binary data (for
example, VARBINARYcolumns), or data that will be principly
manipulated by custom MapReduce programs (reading from
SequenceFiles is higher-performance than reading from text files,
as records do not need to be parsed).
Avro data files are a compact, efficient binary format that provides
interoperability with applications written in other programming
languages. Avro also supports versioning, so that when, e.g.,
columns are added or removed from a table, previously imported
data files can be processed along with new ones.
By default, data is not compressed. You can compress your data by
using the deflate (gzip) algorithm with the -z or --
compress argument, or specify any Hadoop compression codec
using the --compression-codec argument. This applies to
SequenceFile, text, and Avro files.

Sqoop handles large objects (BLOB and CLOB columns) in particular
ways. If this data is truly large, then these columns should not be
fully materialized in memory for manipulation, as most columns
are. Instead, their data is handled in a streaming fashion. Large
objects can be stored inline with the rest of the data, in which case
they are fully materialized in memory on every access, or they can
be stored in a secondary storage file linked to the primary data
storage. By default, large objects less than 16 MB in size are stored
inline with the rest of the data. At a larger size, they are stored in
files in the _lobs subdirectory of the import target directory. These
files are stored in a separate format optimized for large record
storage, which can accomodate records of up to 2^63 bytes each.
The size at which lobs spill into separate files is controlled by the --
inline-lob-limit argument, which takes a parameter specifying
the largest lob size to keep inline, in bytes. If you set the inline LOB
limit to 0, all large objects will be placed in external storage.
Table 6. Output line formatting arguments:
--enclosed-by <char> Sets a required field enclosing character
--escaped-by <char> Sets the escape character
--fields-terminated-by
Sets the field separator character
<char>
--lines-terminated-by
Sets the end-of-line character
<char>
Uses MySQL’s default delimiter set: fields: , lines: \
--mysql-delimiters
n escaped-by: \ optionally-enclosed-by: '
--optionally-enclosed-by
Sets a field enclosing character
<char>
When importing to delimited files, the choice of delimiter is

important. Delimiters which appear inside string-based fields may
cause ambiguous parsing of the imported data by subsequent
analysis passes. For example, the string "Hello, pleased to meet
you" should not be imported with the end-of-field delimiter set to a
comma.
Delimiters may be specified as:
 a character (--fields-terminated-by X)
 an escape character (--fields-terminated-by \t).
Supported escape characters are:
 \b (backspace)
 \n (newline)
 \r (carriage return)
 \t (tab)
 \" (double-quote)
 \\' (single-quote)
 \\ (backslash)
 \0 (NUL) - This will insert NUL characters between
fields or lines, or will disable enclosing/escaping if
used for one of the --enclosed-by, --optionally-
enclosed-by, or --escaped-by arguments.
 The octal representation of a UTF-8 character’s code point.
This should be of the form \0ooo, where ooo is the octal
value. For example, --fields-terminated-by \001 would
yield the Â character.
 The hexadecimal representation of a UTF-8 character’s code
point. This should be of the form \0xhhh, where hhh is the
hex value. For example, --fields-terminated-by \
0x10 would yield the carriage return character.
The default delimiters are a comma (,) for fields, a newline (\n) for
records, no quote character, and no escape character. Note that this
can lead to ambiguous/unparsible records if you import database
records containing commas or newlines in the field data. For
unambiguous parsing, both must be enabled. For example, via --
mysql-delimiters.
If unambiguous delimiters cannot be presented, then
use enclosing and escaping characters. The combination of
(optional) enclosing and escaping characters will allow
unambiguous parsing of lines. For example, suppose one column of
a dataset contained the following values:
Some string, with a comma.
Another "string with quotes"
The following arguments would provide delimiters which can be
unambiguously parsed:
$ sqoop import --fields-terminated-by , --escaped-by \\
--enclosed-by '\"' ...
(Note that to prevent the shell from mangling the enclosing
character, we have enclosed that argument itself in single-quotes.)
The result of the above arguments applied to the above dataset
would be:
"Some string, with a comma.","1","2","3"...
"Another \"string with quotes\"","4","5","6"...
Here the imported strings are shown in the context of additional
columns ("1","2","3", etc.) to demonstrate the full effect of
enclosing and escaping. The enclosing character is only strictly
necessary when delimiter characters appear in the imported text.
The enclosing character can therefore be specified as optional:
$ sqoop import --optionally-enclosed-by '\"' (the rest as
above)...
Which would result in the following import:
"Some string, with a comma.",1,2,3...
"Another \"string with quotes\"",4,5,6...
Note
Even though Hive supports escaping characters, it does not handle escaping
of new-line character. Also, it does not support the notion of enclosing
characters that may include field delimiters in the enclosed string. It is
therefore recommended that you choose unambiguous field and record-
terminating delimiters without the help of escaping and enclosing characters
when working with Hive; this is due to limitations of Hive’s input parsing
abilities.
The --mysql-delimiters argument is a shorthand argument which

uses the default delimiters for the mysqldump program. If you use
the mysqldump delimiters in conjunction with a direct-mode import
(with --direct), very fast imports can be achieved.
While the choice of delimiters is most important for a text-mode
import, it is still relevant if you import to SequenceFiles with --as-
sequencefile. The generated class' toString()method will use
the delimiters you specify, so subsequent formatting of the output
data will rely on the delimiters you choose.
Table 7. Input parsing arguments:
--input-enclosed-by <char> Sets a required field encloser
--input-escaped-by <char> Sets the input escape character
--input-fields-terminated-by <char> Sets the input field separator
--input-lines-terminated-by <char> Sets the input end-of-line character
--input-optionally-enclosed-by <char> Sets a field enclosing character
When Sqoop imports data to HDFS, it generates a Java class which

can reinterpret the text files that it creates when doing a delimited-
format import. The delimiters are chosen with arguments such
as --fields-terminated-by; this controls both how the data is
written to disk, and how the generated parse() method
reinterprets this data. The delimiters used by the parse() method
can be chosen independently of the output arguments, by using --
input-fields-terminated-by, and so on. This is useful, for
example, to generate classes which can parse records created with
one set of delimiters, and emit the records to a different set of files
using a separate set of delimiters.
Table 8. Hive arguments:
--hive-home <dir> Override $HIVE_HOME
Import tables into Hive (Uses Hive’s default delimiters if
--hive-import
none are set.)
--hive-overwrite Overwrite existing data in the Hive table.
--create-hive-table If set, then the job will fail if the target hive
table exists. By default this property is false.
--hive-table <table-name> Sets the table name to use when importing to Hive.
Drops \n, \r, and \01 from string fields when importing to
--hive-drop-import-delims
Hive.
Replace \n, \r, and \01 from string fields with user defined
--hive-delims-replacement
string when importing to Hive.
--hive-partition-key Name of a hive field to partition are sharded on
--hive-partition-value String-value that serves as partition key for this imported
<v> into hive in this job.
Override default mapping from SQL type to Hive type for
configured columns. If specify commas in this argument,
--map-column-hive <map>
use URL encoded keys and values, for example, use
Sqoop’s import tool’s main function is to upload your data into files
in HDFS. If you have a Hive metastore associated with your HDFS
cluster, Sqoop can also import the data into Hive by generating and
executing a CREATE TABLE statement to define the data’s layout in
Hive. Importing data into Hive is as simple as adding the --hive-
import option to your Sqoop command line.
If the Hive table already exists, you can specify the --hive-
overwrite option to indicate that existing table in hive must be
replaced. After your data is imported into HDFS or this step is
omitted, Sqoop will generate a Hive script containing a CREATE
TABLE operation defining your columns using Hive’s types, and
a LOAD DATA INPATH statement to move the data files into Hive’s
warehouse directory.
The script will be executed by calling the installed copy of hive on
the machine where Sqoop is run. If you have multiple Hive
installations, or hive is not in your $PATH, use the --hive-
home option to identify the Hive installation directory. Sqoop will
use $HIVE_HOME/bin/hive from here.
Note
This function is incompatible with --as-avrodatafile and --as-
sequencefile.
Even though Hive supports escaping characters, it does not handle

escaping of new-line character. Also, it does not support the notion
of enclosing characters that may include field delimiters in the
enclosed string. It is therefore recommended that you choose
unambiguous field and record-terminating delimiters without the
help of escaping and enclosing characters when working with Hive;
this is due to limitations of Hive’s input parsing abilities. If you do
use --escaped-by, --enclosed-by, or --optionally-enclosed-
by when importing data into Hive, Sqoop will print a warning
message.
Hive will have problems using Sqoop-imported data if your
database’s rows contain string fields that have Hive’s default row
delimiters (\n and \r characters) or column delimiters (\
01 characters) present in them. You can use the --hive-drop-
import-delims option to drop those characters on import to give
Hive-compatible text data. Alternatively, you can use the --hive-
delims-replacement option to replace those characters with a
user-defined string on import to give Hive-compatible text data.
These options should only be used if you use Hive’s default
delimiters and should not be used if different delimiters are
specified.
Sqoop will pass the field and record delimiters through to Hive. If
you do not set any delimiters and do use --hive-import, the field
delimiter will be set to Â and the record delimiter will be set to \
n to be consistent with Hive’s defaults.
Sqoop will by default import NULL values as string null. Hive is
however using string \N to denote NULL values and therefore
predicates dealing with NULL (like IS NULL) will not work correctly.
You should append parameters --null-string and --null-non-
string in case of import job or --input-null-string and --
input-null-non-string in case of an export job if you wish to
properly preserve NULL values. Because sqoop is using those
parameters in generated code, you need to properly escape value \
N to \\N:
$ sqoop import ... --null-string '\\N' --null-non-string
'\\N'
The table name used in Hive is, by default, the same as that of the
source table. You can control the output table name with the --
hive-table option.
Hive can put data into partitions for more efficient query
performance. You can tell a Sqoop job to import data for Hive into a
particular partition by specifying the --hive-partition-key and --
hive-partition-value arguments. The partition value must be a
string. Please see the Hive documentation for more details on
partitioning.
You can import compressed tables into Hive using the --
compress and --compression-codec options. One downside to
compressing tables imported into Hive is that many codecs cannot
be split for processing by parallel map tasks. The lzop codec,
however, does support splitting. When importing tables with this
codec, Sqoop will automatically index the files for splitting and
configuring a new Hive table with the correct InputFormat. This
feature currently requires that all partitions of a table be
compressed with the lzop codec.
Table 9. HBase arguments:
--column-family <family> Sets the target column family for the import
--hbase-create-table If specified, create missing HBase tables
--hbase-row-key <col> Specifies which input column to use as the row key
In case, if input table contains composite
key, then <col> must be in the form of a
comma-separated list of composite key
attributes
Specifies an HBase table to use as the target instead of
--hbase-table <table-name>
HDFS
--hbase-bulkload Enables bulk loading

Sqoop supports additional import targets beyond HDFS and Hive.
Sqoop can also import records into a table in HBase.
By specifying --hbase-table, you instruct Sqoop to import to a
table in HBase rather than a directory in HDFS. Sqoop will import
data to the table specified as the argument to --hbase-table. Each
row of the input table will be transformed into an
HBase Put operation to a row of the output table. The key for each
row is taken from a column of the input. By default Sqoop will use
the split-by column as the row key column. If that is not specified,
it will try to identify the primary key column, if any, of the source
table. You can manually specify the row key column with --hbase-
row-key. Each output column will be placed in the same column
family, which must be specified with --column-family.
Note
This function is incompatible with direct import (parameter --direct).
If the input table has composite key, the --hbase-row-key must be

in the form of a comma-separated list of composite key attributes.
In this case, the row key for HBase row will be generated by
combining values of composite key attributes using underscore as a
separator. NOTE: Sqoop import for a table with composite key will
work only if parameter --hbase-row-key has been specified.
If the target table and column family do not exist, the Sqoop job
will exit with an error. You should create the target table and
column family before running an import. If you specify --hbase-
create-table, Sqoop will create the target table and column family
if they do not exist, using the default parameters from your HBase
configuration.
Sqoop currently serializes all values to HBase by converting each
field to its string representation (as if you were importing to HDFS
in text mode), and then inserts the UTF-8 bytes of this string in the
target cell. Sqoop will skip all rows containing null values in all
columns except the row key column.
To decrease the load on hbase, Sqoop can do bulk loading as
opposed to direct writes. To use bulk loading, enable it using --
hbase-bulkload.
Table 10. Accumulo arguments:
Specifies an Accumulo table to use as the target instead
--accumulo-table <table-nam>
of HDFS
--accumulo-column-family
Sets the target column family for the import
<family>
--accumulo-create-table If specified, create missing Accumulo tables
--accumulo-row-key <col> Specifies which input column to use as the row key
(Optional) Specifies a visibility token to apply to all
--accumulo-visibility <vis>
rows inserted into Accumulo. Default is the empty string.
(Optional) Sets the size in bytes of Accumulo’s write
--accumulo-batch-size <size>
buffer. Default is 4MB.
(Optional) Sets the max latency in milliseconds for the
--accumulo-max-latency <ms>
Accumulo batch writer. Default is 0.
--accumulo-zookeepers Comma-separated list of Zookeeper servers used by the
<host:port> Accumulo instance
--accumulo-instance <table-
Name of the target Accumulo instance
name>
--accumulo-user <username> Name of the Accumulo user to import as
--accumulo-password
Password for the Accumulo user
<password>
Sqoop supports importing records into a table in Accumulo
By specifying --accumulo-table, you instruct Sqoop to import to a
table in Accumulo rather than a directory in HDFS. Sqoop will
import data to the table specified as the argument to --accumulo-
table. Each row of the input table will be transformed into an
Accumulo Mutation operation to a row of the output table. The key
for each row is taken from a column of the input. By default Sqoop
will use the split-by column as the row key column. If that is not
specified, it will try to identify the primary key column, if any, of the
source table. You can manually specify the row key column with --
accumulo-row-key. Each output column will be placed in the same
column family, which must be specified with --accumulo-column-
family.
Note
This function is incompatible with direct import (parameter --direct),
and cannot be used in the same operation as an HBase import.
If the target table does not exist, the Sqoop job will exit with an
error, unless the --accumulo-create-table parameter is specified.
Otherwise, you should create the target table before running an
import.
Sqoop currently serializes all values to Accumulo by converting
each field to its string representation (as if you were importing to
HDFS in text mode), and then inserts the UTF-8 bytes of this string
in the target cell.
By default, no visibility is applied to the resulting cells in Accumulo,
so the data will be visible to any Accumulo user. Use the --
accumulo-visibility parameter to specify a visibility token to
apply to all rows in the import job.
For performance tuning, use the optional --accumulo-buffer-
size\ and --accumulo-max-latency parameters. See Accumulo’s
documentation for an explanation of the effects of these
parameters.
In order to connect to an Accumulo instance, you must specify the
location of a Zookeeper ensemble using the --accumulo-
zookeepers parameter, the name of the Accumulo instance (--
accumulo-instance), and the username and password to connect
with (--accumulo-user and --accumulo-password respectively).
Table 11. Code generation arguments:
--bindir <dir> Output directory for compiled objects
Sets the generated class name. This overrides --package-name.
--class-name <name>
When combined with --jar-file, sets the input class.
--jar-file <file> Disable code generation; use specified jar
--outdir <dir> Output directory for generated code
--package-name
Put auto-generated classes in this package
<name>
--map-column-java Override default mapping from SQL type to Java type for configured
<m> columns.
As mentioned earlier, a byproduct of importing a table to HDFS is a

class which can manipulate the imported data. If the data is stored
in SequenceFiles, this class will be used for the data’s serialization
container. Therefore, you should use this class in your subsequent
MapReduce processing of the data.
The class is typically named after the table; a table named foo will
generate a class named foo. You may want to override this class
name. For example, if your table is named EMPLOYEES, you may
want to specify --class-name Employee instead. Similarly, you can
specify just the package name with --package-name. The following
import generates a class named com.foocorp.SomeTable:
$ sqoop import --connect <connect-str> --table SomeTable
--package-name com.foocorp
The .java source file for your class will be written to the current
working directory when you run sqoop. You can control the output
directory with --outdir. For example, --outdir src/generated/.
The import process compiles the source into .class and .jar files;
these are ordinarily stored under /tmp. You can select an alternate
target directory with --bindir. For example, --bindir /scratch.
If you already have a compiled class that can be used to perform
the import and want to suppress the code-generation aspect of the
import process, you can use an existing jar and class by providing
the --jar-file and --class-name options. For example:
$ sqoop import --table SomeTable --jar-file
mydatatypes.jar \
--class-name SomeTableType
This command will load the SomeTableType class out
of mydatatypes.jar.
7.2.16. Additional Import Configuration

Properties
There are some additional properties which can be configured by
modifying conf/sqoop-site.xml. Properties can be specified the
same as in Hadoop configuration files, for example:
<property>
<name>property.name</name>
<value>property.value</value>
</property>
They can also be specified on the command line in the generic
arguments, for example:
sqoop import -D property.name=property.value ...
Table 12. Additional import configuration properties:
Controls how BigDecimal columns will formatted
when stored as a String. A value of true (default)
will use toPlainString to store them without an
sqoop.bigdecimal.format.string
exponent component (0.0000001); while a value
of false will use toString which may include an
exponent (1E-7)
When set to false (default), Sqoop will not add the
column used as a row key into the row data in
sqoop.hbase.add.row.key
HBase. When set to true, the column used as a row
key will be added to the row data in HBase.

The following examples illustrate how to use the import tool in a
variety of situations.
A basic import of a table named EMPLOYEES in the corp database:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --
table EMPLOYEES
A basic import requiring a login:
table EMPLOYEES \
--username SomeUser -P
Enter password: (hidden)
Selecting specific columns from the EMPLOYEES table:
table EMPLOYEES \
--columns "employee_id,first_name,last_name,job_title"
Controlling the import parallelism (using 8 parallel tasks):
table EMPLOYEES \
-m 8
Storing data in SequenceFiles, and setting the generated class
name to com.foocorp.Employee:
table EMPLOYEES \
--class-name com.foocorp.Employee --as-sequencefile
Specifying the delimiters to use in a text-mode import:
table EMPLOYEES \
--fields-terminated-by '\t' --lines-terminated-by '\n'
\
--optionally-enclosed-by '\"'
Importing the data to Hive:
table EMPLOYEES \
--hive-import
Importing only new employees:
table EMPLOYEES \
--where "start_date > '2010-01-01'"
Changing the splitting column from the default:
table EMPLOYEES \
--split-by dept_id
Verifying that an import was successful:
$ hadoop fs -ls EMPLOYEES
Found 5 items
drwxr-xr-x - someuser somegrp 0 2010-04-27
16:40 /user/someuser/EMPLOYEES/_logs
-rw-r--r-- 1 someuser somegrp 2913511 2010-04-27
16:40 /user/someuser/EMPLOYEES/part-m-00000
$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
0,joe,smith,engineering
1,jane,doe,marketing
...
Performing an incremental import of new data, after having already
imported the first 100,000 rows of a table:
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb
--table sometable \
--where "id > 100000" --target-dir
/incremental_dataset --append
An import of a table named EMPLOYEES in the corp database that
uses validation to validate the import using the table row count and
number of rows copied into HDFS: More Details
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
--table EMPLOYEES --validate
8. sqoop-import-all-tables
8.1. Purpose
8.2. Syntax
8.1. Purpose
The import-all-tables tool imports a set of tables from an
RDBMS to HDFS. Data from each table is stored in a separate
directory in HDFS.
For the import-all-tables tool to be useful, the following
conditions must be met:
 Each table must have a single-column primary key or --
autoreset-to-one-mapper option must be used.
 You must intend to import all columns of each table.
 You must not intend to use non-default splitting column, nor
impose any conditions via a WHERE clause.
8.2. Syntax
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
Although the Hadoop generic arguments must preceed any import
arguments, the import arguments can be entered in any order with
name>
--password-file
password
--relaxed-isolation

--direct Use direct import fast path
--inline-lob-limit <n> Set the maximum size for an inline LOB
Comma separated list of tables to exclude from import
--exclude-tables <tables>
process
Import should use one mapper if a table with no primary key
--autoreset-to-one-mapper
is encountered
These arguments behave in the same manner as they do when
used for the sqoop-import tool, but the --table, --split-by, --
columns, and --where arguments are invalid for sqoop-import-
all-tables. The --exclude-tables argument is for +sqoop-
import-all-tables only.
<char>
<char>
--mysql-delimiters
<char>


--hive-import
none are set.)
Hive.
--map-column-hive <map> Override default mapping from SQL type to Hive type for

--package-name <name> Put auto-generated classes in this package
The import-all-tables tool does not support the --class-

name argument. You may, however, specify a package with --
package-name in which all generated classes will be placed.

Import all tables from the corp database:
$ sqoop import-all-tables --connect
jdbc:mysql://db.foo.com/corp
Verifying that it worked:
$ hadoop fs -ls
Found 4 items
17:15 /user/someuser/EMPLOYEES
17:15 /user/someuser/PAYCHECKS
17:15 /user/someuser/DEPARTMENTS
17:15 /user/someuser/OFFICE_SUPPLIES
9. sqoop-import-mainframe
9.1. Purpose
9.2. Syntax
9.2.1. Connecting to a Mainframe
9.2.2. Selecting the Files to Import
9.2.6. File Formats
9.1. Purpose
The import-mainframe tool imports all sequential datasets in a
partitioned dataset(PDS) on a mainframe to HDFS. A PDS is akin to
a directory on the open systems. The records in a dataset can
contain only character data. Records will be stored with the entire
record as a single text field.
9.2. Syntax
9.2.6. File Formats
$ sqoop import-mainframe (generic-args) (import-args)
$ sqoop-import-mainframe (generic-args) (import-args)
While the Hadoop generic arguments must precede any import
arguments, you can type the import arguments in any order with
--connect <hostname> Specify mainframe host to connect
name>
--password-file
password

Sqoop is designed to import mainframe datasets into HDFS. To do
so, you must specify a mainframe host name in the Sqoop --
connect argument.
$ sqoop import-mainframe --connect z390
This will connect to the mainframe host z390 via ftp.
You might need to authenticate against the mainframe host to
access it. You can use the --username to supply a username to the
mainframe. Sqoop provides couple of different ways to supply a
password, secure and non-secure, to the mainframe which is
detailed below.
Secure way of supplying password to the mainframe. You
should save the password in a file on the users home directory with
400 permissions and specify the path to that file using the --
password-file argument, and is the preferred method of entering
credentials. Sqoop will then read the password from the file and
pass it to the MapReduce cluster using secure means with out
exposing the password in the job configuration. The file containing
the password can either be on the Local FS or HDFS.
Example:
$ sqoop import-mainframe --connect z390 \
--username david --password-file $
{user.home}/.password
Another way of supplying passwords is using the -P argument
which will read a password from a console prompt.
Warning
The --password parameter is insecure, as other users may be able to read
your password from the command-line arguments via the output of
programs such as ps. The -P argument is the preferred method over using
the --password argument. Credentials may still be transferred between
nodes of the MapReduce cluster using insecure means.
Example:
$ sqoop import-mainframe --connect z390 --username david
--password 12345
--delete-target-dir Delete the import target directory if it exists
--target-dir <dir> HDFS destination dir

You can use the --dataset argument to specify a partitioned
dataset name. All sequential datasets in the partitioned dataset will
be imported.

Sqoop imports data in parallel by making multiple ftp connections
to the mainframe to transfer multiple files simultaneously. You can
specify the number of map tasks (parallel processes) to use to
perform the import by using the -m or --num-mappers argument.
Each of these arguments takes an integer value which corresponds
to the degree of parallelism to employ. By default, four tasks are
used. You can adjust this value to maximize the data transfer rate
from the mainframe.

Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache
every time when start a Sqoop job. When launched by Oozie this is
unnecessary since Oozie use its own Sqoop share lib which keeps
Sqoop dependencies in the distributed cache. Oozie will do the
localization on each worker node for the Sqoop dependencies only
once during the first Sqoop job and reuse the jars on worker node
for subsquencial jobs. Using option --skip-dist-cache in Sqoop
command when launched by Oozie will skip the step which Sqoop
copies its dependencies to job cache and save massive I/O.

By default, Sqoop will import all sequential files in a partitioned
dataset pds to a directory named pds inside your home directory in
HDFS. For example, if your username is someuser, then the import
tool will write to /user/someuser/pds/(files). You can adjust the
parent directory of the import with the --warehouse-dir argument.
For example:
$ sqoop import-mainframe --connnect <host> --dataset foo
--warehouse-dir /shared \
...
This command would write to a set of files in
the /shared/pds/ directory.
You can also explicitly choose the target directory, like so:
$ sqoop import-mainframe --connnect <host> --dataset foo
--target-dir /dest \
...
This will import the files into the /dest directory. --target-dir is
incompatible with --warehouse-dir.
By default, imports go to a new target location. If the destination
directory already exists in HDFS, Sqoop will refuse to import and
overwrite that directory’s contents.
9.2.6. File Formats

By default, each record in a dataset is stored as a text record with a
newline at the end. Each record is assumed to contain a single text
field with the name DEFAULT_COLUMN. When Sqoop imports data
to HDFS, it generates a Java class which can reinterpret the text
files that it creates.
You can also import mainframe records to Sequence, Avro, or
Parquet files.
By default, data is not compressed. You can compress your data by
using the deflate (gzip) algorithm with the -z or --
compress argument, or specify any Hadoop compression codec
using the --compression-codec argument.
<char>
<char>
--mysql-delimiters
<char>
Since mainframe record contains only one field, importing to

delimited files will not contain any field delimiter. However, the field
may be enclosed with enclosing character or escaped by an
escaping character.
When Sqoop imports data to HDFS, it generates a Java class which

can reinterpret the text files that it creates when doing a delimited-
format import. The delimiters are chosen with arguments such
as --fields-terminated-by; this controls both how the data is
written to disk, and how the generated parse() method
reinterprets this data. The delimiters used by the parse() method
can be chosen independently of the output arguments, by using --
input-fields-terminated-by, and so on. This is useful, for
example, to generate classes which can parse records created with
one set of delimiters, and emit the records to a different set of files
using a separate set of delimiters.
--hive-import
none are set.)
Hive.

Sqoop’s import tool’s main function is to upload your data into files
in HDFS. If you have a Hive metastore associated with your HDFS
cluster, Sqoop can also import the data into Hive by generating and
executing a CREATE TABLE statement to define the data’s layout in
Hive. Importing data into Hive is as simple as adding the --hive-
import option to your Sqoop command line.
If the Hive table already exists, you can specify the --hive-
overwrite option to indicate that existing table in hive must be
replaced. After your data is imported into HDFS or this step is
omitted, Sqoop will generate a Hive script containing a CREATE
TABLE operation defining your columns using Hive’s types, and
a LOAD DATA INPATH statement to move the data files into Hive’s
warehouse directory.
The script will be executed by calling the installed copy of hive on
the machine where Sqoop is run. If you have multiple Hive
installations, or hive is not in your $PATH, use the --hive-
home option to identify the Hive installation directory. Sqoop will
use $HIVE_HOME/bin/hive from here.
Note
This function is incompatible with --as-avrodatafile and --as-
sequencefile.
Even though Hive supports escaping characters, it does not handle

escaping of new-line character. Also, it does not support the notion
of enclosing characters that may include field delimiters in the
enclosed string. It is therefore recommended that you choose
unambiguous field and record-terminating delimiters without the
help of escaping and enclosing characters when working with Hive;
this is due to limitations of Hive’s input parsing abilities. If you do
use --escaped-by, --enclosed-by, or --optionally-enclosed-
by when importing data into Hive, Sqoop will print a warning
message.
Hive will have problems using Sqoop-imported data if your
database’s rows contain string fields that have Hive’s default row
delimiters (\n and \r characters) or column delimiters (\
01 characters) present in them. You can use the --hive-drop-
import-delims option to drop those characters on import to give
Hive-compatible text data. Alternatively, you can use the --hive-
delims-replacement option to replace those characters with a
user-defined string on import to give Hive-compatible text data.
These options should only be used if you use Hive’s default
delimiters and should not be used if different delimiters are
specified.
Sqoop will pass the field and record delimiters through to Hive. If
you do not set any delimiters and do use --hive-import, the field
delimiter will be set to Â and the record delimiter will be set to \
n to be consistent with Hive’s defaults.
Sqoop will by default import NULL values as string null. Hive is
however using string \N to denote NULL values and therefore
predicates dealing with NULL (like IS NULL) will not work correctly.
You should append parameters --null-string and --null-non-
string in case of import job or --input-null-string and --
input-null-non-string in case of an export job if you wish to
properly preserve NULL values. Because sqoop is using those
parameters in generated code, you need to properly escape value \
N to \\N:
$ sqoop import ... --null-string '\\N' --null-non-string
'\\N'
The table name used in Hive is, by default, the same as that of the
source table. You can control the output table name with the --
hive-table option.
Hive can put data into partitions for more efficient query
performance. You can tell a Sqoop job to import data for Hive into a
particular partition by specifying the --hive-partition-key and --
hive-partition-value arguments. The partition value must be a
string. Please see the Hive documentation for more details on
partitioning.
You can import compressed tables into Hive using the --
compress and --compression-codec options. One downside to
compressing tables imported into Hive is that many codecs cannot
be split for processing by parallel map tasks. The lzop codec,
however, does support splitting. When importing tables with this
codec, Sqoop will automatically index the files for splitting and
configuring a new Hive table with the correct InputFormat. This
feature currently requires that all partitions of a table be
compressed with the lzop codec.
Table 24. HBase arguments:
--column-family <family> Sets the target column family for the import
--hbase-create-table If specified, create missing HBase tables
--hbase-row-key <col> Specifies which input column to use as the row key
In case, if input table contains composite
key, then <col> must be in the form of a
comma-separated list of composite key
attributes
Specifies an HBase table to use as the target instead of
--hbase-table <table-name>
HDFS
--hbase-bulkload Enables bulk loading

Sqoop supports additional import targets beyond HDFS and Hive.
Sqoop can also import records into a table in HBase.
By specifying --hbase-table, you instruct Sqoop to import to a
table in HBase rather than a directory in HDFS. Sqoop will import
data to the table specified as the argument to --hbase-table. Each
row of the input table will be transformed into an
HBase Put operation to a row of the output table. The key for each
row is taken from a column of the input. By default Sqoop will use
the split-by column as the row key column. If that is not specified,
it will try to identify the primary key column, if any, of the source
table. You can manually specify the row key column with --hbase-
row-key. Each output column will be placed in the same column
family, which must be specified with --column-family.
Note
This function is incompatible with direct import (parameter --direct).
If the input table has composite key, the --hbase-row-key must be

in the form of a comma-separated list of composite key attributes.
In this case, the row key for HBase row will be generated by
combining values of composite key attributes using underscore as a
separator. NOTE: Sqoop import for a table with composite key will
work only if parameter --hbase-row-key has been specified.
If the target table and column family do not exist, the Sqoop job
will exit with an error. You should create the target table and
column family before running an import. If you specify --hbase-
create-table, Sqoop will create the target table and column family
if they do not exist, using the default parameters from your HBase
configuration.
Sqoop currently serializes all values to HBase by converting each
field to its string representation (as if you were importing to HDFS
in text mode), and then inserts the UTF-8 bytes of this string in the
target cell. Sqoop will skip all rows containing null values in all
columns except the row key column.
To decrease the load on hbase, Sqoop can do bulk loading as
opposed to direct writes. To use bulk loading, enable it using --
hbase-bulkload.
Table 25. Accumulo arguments:
--accumulo-table <table-nam> Specifies an Accumulo table to use as the target instead
of HDFS
--accumulo-column-family
Sets the target column family for the import
<family>
--accumulo-create-table If specified, create missing Accumulo tables
--accumulo-row-key <col> Specifies which input column to use as the row key
(Optional) Specifies a visibility token to apply to all
--accumulo-visibility <vis>
rows inserted into Accumulo. Default is the empty string.
(Optional) Sets the size in bytes of Accumulo’s write
--accumulo-batch-size <size>
buffer. Default is 4MB.
(Optional) Sets the max latency in milliseconds for the
--accumulo-max-latency <ms>
Accumulo batch writer. Default is 0.
--accumulo-zookeepers Comma-separated list of Zookeeper servers used by the
<host:port> Accumulo instance
--accumulo-instance <table-
Name of the target Accumulo instance
name>
--accumulo-user <username> Name of the Accumulo user to import as
--accumulo-password
Password for the Accumulo user
<password>

Sqoop supports importing records into a table in Accumulo
By specifying --accumulo-table, you instruct Sqoop to import to a
table in Accumulo rather than a directory in HDFS. Sqoop will
import data to the table specified as the argument to --accumulo-
table. Each row of the input table will be transformed into an
Accumulo Mutation operation to a row of the output table. The key
for each row is taken from a column of the input. By default Sqoop
will use the split-by column as the row key column. If that is not
specified, it will try to identify the primary key column, if any, of the
source table. You can manually specify the row key column with --
accumulo-row-key. Each output column will be placed in the same
column family, which must be specified with --accumulo-column-
family.
Note
This function is incompatible with direct import (parameter --direct),
and cannot be used in the same operation as an HBase import.
If the target table does not exist, the Sqoop job will exit with an
error, unless the --accumulo-create-table parameter is specified.
Otherwise, you should create the target table before running an
import.
Sqoop currently serializes all values to Accumulo by converting
each field to its string representation (as if you were importing to
HDFS in text mode), and then inserts the UTF-8 bytes of this string
in the target cell.
By default, no visibility is applied to the resulting cells in Accumulo,
so the data will be visible to any Accumulo user. Use the --
accumulo-visibility parameter to specify a visibility token to
apply to all rows in the import job.
For performance tuning, use the optional --accumulo-buffer-
size\ and --accumulo-max-latency parameters. See Accumulo’s
documentation for an explanation of the effects of these
parameters.
In order to connect to an Accumulo instance, you must specify the
location of a Zookeeper ensemble using the --accumulo-
zookeepers parameter, the name of the Accumulo instance (--
accumulo-instance), and the username and password to connect
with (--accumulo-user and --accumulo-password respectively).
--class-name <name>
--package-name
<name>
<m> columns.
As mentioned earlier, a byproduct of importing a table to HDFS is a

class which can manipulate the imported data. You should use this
class in your subsequent MapReduce processing of the data.
The class is typically named after the partitioned dataset name; a
partitioned dataset named foo will generate a class named foo.
You may want to override this class name. For example, if your
partitioned dataset is named EMPLOYEES, you may want to
specify --class-name Employee instead. Similarly, you can specify
just the package name with --package-name. The following import
generates a class named com.foocorp.SomePDS:
$ sqoop import-mainframe --connect <host> --dataset
SomePDS --package-name com.foocorp
The .java source file for your class will be written to the current
working directory when you run sqoop. You can control the output
directory with --outdir. For example, --outdir src/generated/.
The import process compiles the source into .class and .jar files;
these are ordinarily stored under /tmp. You can select an alternate
target directory with --bindir. For example, --bindir /scratch.
If you already have a compiled class that can be used to perform
the import and want to suppress the code-generation aspect of the
import process, you can use an existing jar and class by providing
the --jar-file and --class-name options. For example:
$ sqoop import-mainframe --dataset SomePDS --jar-file
mydatatypes.jar \
--class-name SomePDSType
This command will load the SomePDSType class out
of mydatatypes.jar.
9.2.10. Additional Import Configuration

Properties
There are some additional properties which can be configured by
modifying conf/sqoop-site.xml. Properties can be specified the
same as in Hadoop configuration files, for example:
<property>
<name>property.name</name>
<value>property.value</value>
</property>
They can also be specified on the command line in the generic
arguments, for example:
sqoop import -D property.name=property.value ...

The following examples illustrate how to use the import tool in a
variety of situations.
A basic import of all sequential files in a partitioned dataset
named EMPLOYEES in the mainframe host z390:
$ sqoop import-mainframe --connect z390 --dataset
EMPLOYEES \
--username SomeUser -P
Enter password: (hidden)
Controlling the import parallelism (using 8 parallel tasks):
EMPLOYEES \
--username SomeUser --password-file mypassword -m 8
Importing the data to Hive:
EMPLOYEES \
--hive-import
10. sqoop-export
10.1. Purpose
10.2. Syntax
10.3. Inserts vs. Updates
10.4. Exports and Transactions
10.5. Failed Exports
10.1. Purpose
The export tool exports a set of files from HDFS back to an
RDBMS. The target table must already exist in the database. The
input files are read and parsed into a set of records according to the
user-specified delimiters.
The default operation is to transform these into a set
of INSERT statements that inject the records into the database. In
"update mode," Sqoop will generate UPDATE statements that
replace existing records in the database, and in "call mode" Sqoop
will make a stored procedure call for each record.
10.2. Syntax
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Although the Hadoop generic arguments must preceed any export
arguments, the export arguments can be entered in any order with
name>
--password-file
password
--relaxed-isolation
Table 28. Validation arguments More Details

Enable validation of data copied, supports
--validate
single table copy only.
--validator <class-name> Specify validator class to use.
--validation-threshold <class-name> Specify validation threshold class to use.
--validation-failurehandler <class- Specify validation failure handler class to
name> use.
Table 29. Export control arguments:

--columns <col,col,col…> Columns to export to table
--direct Use direct export fast path
--export-dir <dir> HDFS source path for the export
-m,--num-mappers <n> Use n map tasks to export in parallel
--table <table-name> Table to populate
--call <stored-proc-name> Stored Procedure to call
Anchor column to use for updates. Use a comma separated
--update-key <col-name>
list of columns if there are more than one column.
Specify how updates are performed when new rows are
--update-mode <mode>
found with non-matching keys in database.
Legal values for mode include updateonly (default)
and allowinsert.
--input-null-string <null-
The string to be interpreted as null for string columns
string>
--input-null-non-string
The string to be interpreted as null for non-string columns
<null-string>
--staging-table <staging- The table in which data will be staged before being
table-name> inserted into the destination table.
Indicates that any data present in the staging table can be
--clear-staging-table
deleted.
--batch Use batch mode for underlying statement execution.
The --export-dir argument and one of --table or --call are

required. These specify the table to populate in the database (or
the stored procedure to call), and the directory in HDFS that
contains the source data.
By default, all columns within a table are selected for export. You
can select a subset of columns and control their ordering by using
the --columns argument. This should include a comma-delimited
list of columns to export. For example: --columns
"col1,col2,col3". Note that columns that are not included in
the --columns parameter need to have either defined default value
or allow NULL values. Otherwise your database will reject the
imported data which in turn will make Sqoop job fail.
You can control the number of mappers independently from the
number of files present in the directory. Export performance
depends on the degree of parallelism. By default, Sqoop will use
four tasks in parallel for the export process. This may not be
optimal; you will need to experiment with your own particular
setup. Additional tasks may offer better concurrency, but if the
database is already bottlenecked on updating indices, invoking
triggers, and so on, then additional load may decrease
performance. The --num-mappers or -marguments control the
number of map tasks, which is the degree of parallelism used.
Some databases provides a direct mode for exports as well. Use
the --direct argument to specify this codepath. This may be
higher-performance than the standard JDBC codepath. Details
about use of direct mode with each specific RDBMS, installation
requirements, available options and limitations can be found
in Section 25, “Notes for specific connectors”.
The --input-null-string and --input-null-non-string argume
nts are optional. If --input-null-string is not specified, then the
string "null" will be interpreted as null for string-type columns. If --
input-null-non-string is not specified, then both the string "null"
and the empty string will be interpreted as null for non-string
columns. Note that, the empty string will be always interpreted as
null for non-string columns, in addition to other string if specified
by --input-null-non-string.
Since Sqoop breaks down export process into multiple transactions,
it is possible that a failed export job may result in partial data being
committed to the database. This can further lead to subsequent
jobs failing due to insert collisions in some cases, or lead to
duplicated data in others. You can overcome this problem by
specifying a staging table via the --staging-table option which
acts as an auxiliary table that is used to stage exported data. The
staged data is finally moved to the destination table in a single
transaction.
In order to use the staging facility, you must create the staging
table prior to running the export job. This table must be structurally
identical to the target table. This table should either be empty
before the export job runs, or the --clear-staging-table option
must be specified. If the staging table contains data and the --
clear-staging-table option is specified, Sqoop will delete all of
the data before starting the export job.
Note
Support for staging data prior to pushing it into the destination table is not
always available for --direct exports. It is also not available when
export is invoked using the --update-key option for updating existing
data, and when stored procedures are used to insert the data. It is best to
check the Section 25, “Notes for specific connectors” section to validate.
10.3. Inserts vs. Updates

By default, sqoop-export appends new rows to a table; each input
record is transformed into an INSERT statement that adds a row to
the target database table. If your table has constraints (e.g., a
primary key column whose values must be unique) and already
contains data, you must take care to avoid inserting records that
violate these constraints. The export process will fail if
an INSERT statement fails. This mode is primarily intended for
exporting records to a new, empty table intended to receive these
results.
If you specify the --update-key argument, Sqoop will instead
modify an existing dataset in the database. Each input record is
treated as an UPDATE statement that modifies an existing row. The
row a statement modifies is determined by the column name(s)
specified with --update-key. For example, consider the following
table definition:
CREATE TABLE foo(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
Consider also a dataset in HDFS containing records like these:
0,this is a test,42
1,some more data,100
...
Running sqoop-export --table foo --update-key id --
export-dir /path/to/data --connect … will run an export job
that executes SQL statements based on the data like so:
UPDATE foo SET msg='this is a test', bar=42 WHERE id=0;
UPDATE foo SET msg='some more data', bar=100 WHERE id=1;
...
If an UPDATE statement modifies no rows, this is not considered an
error; the export will silently continue. (In effect, this means that
an update-based export will not insert new rows into the database.)
Likewise, if the column specified with --update-key does not
uniquely identify rows and multiple rows are updated by a single
statement, this condition is also undetected.
The argument --update-key can also be given a comma separated
list of column names. In which case, Sqoop will match all keys from
this list before updating any existing record.
Depending on the target database, you may also specify the --
update-mode argument with allowinsert mode if you want to
update rows if they exist in the database already or insert rows if
they do not exist yet.

<char>
<char>
--mysql-delimiters
<char>
Sqoop automatically generates code to parse and interpret records

of the files containing the data to be exported back to the database.
If these files were created with non-default delimiters (comma-
separated fields with newline-separated records), you should
specify the same delimiters again so that Sqoop can parse your
files.
If you specify incorrect delimiters, Sqoop will fail to find enough
columns per line. This will cause export map tasks to fail by
throwing ParseExceptions.
--class-name <name>
--package-name
<name>
<m> columns.
If the records to be exported were generated as the result of a

previous import, then the original generated class can be used to
read the data back. Specifying --jar-file and --class-
name obviate the need to specify delimiters in this case.
The use of existing generated code is incompatible with --update-
key; an update-mode export requires new code generation to
perform the update. You cannot use --jar-file, and must fully
specify any non-default delimiters.
10.4. Exports and Transactions

Exports are performed by multiple writers in parallel. Each writer
uses a separate connection to the database; these have separate
transactions from one another. Sqoop uses the multi-
row INSERT syntax to insert up to 100 records per statement. Every
100 statements, the current transaction within a writer task is
committed, causing a commit every 10,000 rows. This ensures that
transaction buffers do not grow without bound, and cause out-of-
memory conditions. Therefore, an export is not an atomic process.
Partial results from the export will become visible before the export
is complete.
10.5. Failed Exports

Exports may fail for a number of reasons:
 Loss of connectivity from the Hadoop cluster to the
database (either due to hardware fault, or server software
crashes)
 Attempting to INSERT a row which violates a consistency
constraint (for example, inserting a duplicate primary key
value)
 Attempting to parse an incomplete or malformed record
from the HDFS source data
 Attempting to parse records using incorrect delimiters
 Capacity issues (such as insufficient RAM or disk space)
If an export map task fails due to these or other reasons, it will
cause the export job to fail. The results of a failed export are
undefined. Each export map task operates in a separate
transaction. Furthermore, individual map tasks commit their current
transaction periodically. If a task fails, the current transaction will
be rolled back. Any previously-committed transactions will remain
durable in the database, leading to a partially-complete export.

A basic export to populate a table named bar:
$ sqoop export --connect jdbc:mysql://db.example.com/foo
--table bar \
--export-dir /results/bar_data
This example takes the files in /results/bar_data and injects their
contents in to the bar table in the foo database
on db.example.com. The target table must already exist in the
database. Sqoop performs a set of INSERT INTO operations,
without regard for existing content. If Sqoop attempts to insert
rows which violate constraints in the database (for example, a
particular primary key value already exists), then the export fails.
Alternatively, you can specify the columns to be exported by
providing --columns "col1,col2,col3". Please note that columns
that are not included in the --columns parameter need to have
either defined default value or allow NULL values. Otherwise your
database will reject the imported data which in turn will make
Sqoop job fail.
Another basic export to populate a table named bar with validation
enabled: More Details
--table bar \
--export-dir /results/bar_data --validate
An export that calls a stored procedure named barproc for every
record in /results/bar_data would look like:
--call barproc \
--export-dir /results/bar_data
11. validation
11.1. Purpose
11.2. Introduction
11.3. Syntax
11.4. Configuration
11.5. Limitations
11.1. Purpose
Validate the data copied, either import or export by comparing the
row counts from the source and the target post copy.
11.2. Introduction
There are 3 basic interfaces: ValidationThreshold - Determines if
the error margin between the source and target are acceptable:
Absolute, Percentage Tolerant, etc. Default implementation is
AbsoluteValidationThreshold which ensures the row counts from
source and targets are the same.
ValidationFailureHandler - Responsible for handling failures: log an
error/warning, abort, etc. Default implementation is
LogOnFailureHandler that logs a warning message to the configured
logger.
Validator - Drives the validation logic by delegating the decision to
ValidationThreshold and delegating failure handling to
ValidationFailureHandler. The default implementation is
RowCountValidator which validates the row counts from source and
the target.
11.3. Syntax
$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)
Validation arguments are part of import and export arguments.
11.4. Configuration
The validation framework is extensible and pluggable. It comes with
default implementations but the interfaces can be extended to allow
custom implementations by passing them as part of the command
line arguments as described below.
Validator.
Property: validator
Description: Driver for validation,
must implement
org.apache.sqoop.validation.Validator
Supported values: The value has to be a fully qualified
class name.
Default value:
org.apache.sqoop.validation.RowCountValidator
Validation Threshold.
Property: validation-threshold
Description: Drives the decision based on the
validation meeting the
threshold or not. Must implement
org.apache.sqoop.validation.ValidationThreshold
class name.
Default value:
org.apache.sqoop.validation.AbsoluteValidationThreshold
Validation Failure Handler.
Property: validation-failurehandler
Description: Responsible for handling failures, must
implement
org.apache.sqoop.validation.ValidationFailureHandler
class name.
Default value:
org.apache.sqoop.validation.AbortOnFailureHandler
11.5. Limitations
Validation currently only validates data copied from a single table
into HDFS. The following are the limitations in the current
implementation:
 all-tables option
 free-form query option
 Data imported into Hive, HBase or Accumulo
 table import with --where argument
 incremental imports
A basic import of a table named EMPLOYEES in the corp database
that uses validation to validate the row counts:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
--table EMPLOYEES --validate
A basic export to populate a table named bar with validation
enabled:
--table bar \
--export-dir /results/bar_data --validate
Another example that overrides the validation args:
table EMPLOYEES \
--validate --validator
org.apache.sqoop.validation.RowCountValidator \
--validation-threshold \
org.apache.sqoop.validation.AbsoluteValidationThreshold \
--validation-failurehandler \
org.apache.sqoop.validation.AbortOnFailureHandler
12. Saved Jobs

Imports and exports can be repeatedly performed by issuing the
same command multiple times. Especially when using the
incremental import capability, this is an expected scenario.
Sqoop allows you to define saved jobs which make this process
easier. A saved job records the configuration information required
to execute a Sqoop command at a later time. The section on
the sqoop-job tool describes how to create and work with saved
jobs.
By default, job descriptions are saved to a private repository stored
in $HOME/.sqoop/. You can configure Sqoop to instead use a
shared metastore, which makes saved jobs available to multiple
users across a shared cluster. Starting the metastore is covered by
the section on the sqoop-metastore tool.
13. sqoop-job
13.1. Purpose
13.2. Syntax
13.3. Saved jobs and passwords
13.4. Saved jobs and incremental imports
13.1. Purpose
The job tool allows you to create and work with saved
jobs. Saved jobs remember the parameters used to
specify a job, so they can be re-executed by invoking the job by its
handle.
If a saved job is configured to perform an incremental import, state
regarding the most recently imported rows is updated in the saved
job to allow the job to continually import only the newest rows.
13.2. Syntax
$ sqoop job (generic-args) (job-args) [-- [subtool-name]
(subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name]
(subtool-args)]
Although the Hadoop generic arguments must preceed any job
arguments, the job arguments can be entered in any order with
Table 33. Job management options:
--create Define a new saved job with the specified job-id (name). A second Sqoop
<job-id> command-line, separated by a -- should be specified; this defines the saved job.
--delete
Delete a saved job.
<job-id>
--exec <job-
Given a job defined with --create, run the saved job.
id>
--show <job-
Show the parameters for a saved job.
id>
--list List all saved jobs
Creating saved jobs is done with the --create action. This

operation requires a -- followed by a tool name and its arguments.
The tool and its arguments will form the basis of the saved job.
Consider:
$ sqoop job --create myjob -- import --connect
jdbc:mysql://example.com/db \
--table mytable
This creates a job named myjob which can be executed later. The
job is not run. This job is now available in the list of saved jobs:
$ sqoop job --list
Available jobs:
myjob
We can inspect the configuration of a job with the show action:
$ sqoop job --show myjob
Job: myjob
Tool: import
Options:
----------------------------
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
...
And if we are satisfied with it, we can run the job with exec:
$ sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code
generation
...
The exec action allows you to override arguments of the saved job
by supplying them after a --. For example, if the database were
changed to require a username, we could specify the username and
password with:
$ sqoop job --exec myjob -- --username someuser -P
Enter password:
...
Table 34. Metastore connection options:
By default, a private metastore is instantiated in $HOME/.sqoop. If

you have configured a hosted metastore with the sqoop-
metastore tool, you can connect to it by specifying the --meta-
connect argument. This is a JDBC connect string just like the ones
used to connect to databases for import.
In conf/sqoop-site.xml, you can
configure sqoop.metastore.client.autoconnect.url with this
address, so you do not have to supply --meta-connect to use a
remote metastore. This parameter can also be modified to move
the private metastore to a location on your filesystem other than
your home directory.
If you
configure sqoop.metastore.client.enable.autoconnect with the
value false, then you must explicitly supply --meta-connect.
Table 35. Common options:
13.3. Saved jobs and passwords

The Sqoop metastore is not a secure resource. Multiple users can
access its contents. For this reason, Sqoop does not store
passwords in the metastore. If you create a job that requires a
password, you will be prompted for that password each time you
execute the job.
You can enable passwords in the metastore by
setting sqoop.metastore.client.record.password to true in the
configuration.
Note that you have to
set sqoop.metastore.client.record.password to true if you are
executing saved jobs via Oozie because Sqoop cannot prompt the
user to enter passwords while being executed as Oozie tasks.
13.4. Saved jobs and incremental imports

Incremental imports are performed by comparing the values in
a check column against a reference value for the most recent
import. For example, if the --incremental appendargument was
specified, along with --check-column id and --last-value 100,
all rows with id > 100 will be imported. If an incremental import is
run from the command line, the value which should be specified
as --last-value in a subsequent incremental import will be printed
to the screen for your reference. If an incremental import is run
from a saved job, this value will be retained in the saved job.
Subsequent runs of sqoop job --exec someIncrementalJob will
continue to import only newer rows than those previously imported.
14. sqoop-metastore
14.1. Purpose
14.2. Syntax
14.1. Purpose
The metastore tool configures Sqoop to host a shared metadata
repository. Multiple users and/or remote users can define and
execute saved jobs (created with sqoop job) defined in this
metastore.
Clients must be configured to connect to the metastore in sqoop-
site.xml or with the --meta-connect argument.
14.2. Syntax
$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)
Although the Hadoop generic arguments must preceed any
metastore arguments, the metastore arguments can be entered in
any order with respect to one another.
Table 36. Metastore management options:
--shutdown Shuts down a running metastore instance on the same machine.
Running sqoop-metastore launches a shared HSQLDB database

instance on the current machine. Clients can connect to this
metastore and create jobs which can be shared between users for
execution.
The location of the metastore’s files on disk is controlled by
the sqoop.metastore.server.location property in conf/sqoop-
site.xml. This should point to a directory on the local filesystem.
The metastore is available over TCP/IP. The port is controlled by
the sqoop.metastore.server.port configuration parameter, and
defaults to 16000.
Clients should connect to the metastore by
specifying sqoop.metastore.client.autoconnect.url or --meta-
connect with the value jdbc:hsqldb:hsql://<server-
name>:<port>/sqoop. For
example,jdbc:hsqldb:hsql://metaserver.example.com:16000/s
qoop.
This metastore may be hosted on a machine within the Hadoop
cluster, or elsewhere on the network.
15. sqoop-merge
15.1. Purpose
15.2. Syntax
15.1. Purpose
The merge tool allows you to combine two datasets where entries
in one dataset should overwrite entries of an older dataset. For
example, an incremental import run in last-modified mode will
generate multiple datasets in HDFS where successively newer data
appears in each dataset. The merge tool will "flatten" two datasets
into one, taking the newest available records for each primary key.
15.2. Syntax
$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)
Although the Hadoop generic arguments must preceed any merge
arguments, the job arguments can be entered in any order with
Table 37. Merge options:
Specify the name of the record-specific class to use during the merge
--class-name <class>
job.
--jar-file <file> Specify the name of the jar to load the record class from.
--merge-key <col> Specify the name of a column to use as the merge key.
--new-data <path> Specify the path of the newer dataset.
--onto <path> Specify the path of the older dataset.
--target-dir <path> Specify the target path for the output of the merge job.
The merge tool runs a MapReduce job that takes two directories as
input: a newer dataset, and an older one. These are specified
with --new-data and --onto respectively. The output of the
MapReduce job will be placed in the directory in HDFS specified
by --target-dir.
When merging the datasets, it is assumed that there is a unique
primary key value in each record. The column for the primary key is
specified with --merge-key. Multiple rows in the same dataset
should not have the same primary key, or else data loss may occur.
To parse the dataset and extract the key column, the auto-
generated class from a previous import must be used. You should
specify the class name and jar file with --class-name and --jar-
file. If this is not availab,e you can recreate the class using
the codegen tool.
The merge tool is typically run after an incremental import with the
date-last-modified mode (sqoop import --incremental
lastmodified …).
Supposing two incremental imports were performed, where some
older data is in an HDFS directory named older and newer data is
in an HDFS directory named newer, these could be merged like so:
$ sqoop merge --new-data newer --onto older --target-dir
merged \
--jar-file datatypes.jar --class-name Foo --merge-key
id
This would run a MapReduce job where the value in the id column
of each row is used to join rows; rows in the newer dataset will be
used in preference to rows in the older dataset.
This can be used with both SequenceFile-, Avro- and text-based
incremental imports. The file types of the newer and older datasets
must be the same.
16. sqoop-codegen
16.1. Purpose
16.2. Syntax
16.1. Purpose
The codegen tool generates Java classes which encapsulate and
interpret imported records. The Java definition of a record is
instantiated as part of the import process, but can also be
performed separately. For example, if Java source is lost, it can be
recreated. New versions of a class can be created which use
different delimiters between fields, and so on.
16.2. Syntax
$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
Although the Hadoop generic arguments must preceed any codegen
arguments, the codegen arguments can be entered in any order
with respect to one another.
name>
--password-file
password
--relaxed-isolation

--class-name <name>
--package-name
<name>
<m> columns.

<char>
<char>
--mysql-delimiters
<char>


--hive-import
none are set.)
Hive.
If Hive arguments are provided to the code generation tool, Sqoop

generates a file containing the HQL statements to create a table
and load data.
Recreate the record interpretation code for the employees table of
a corporate database:
$ sqoop codegen --connect
jdbc:mysql://db.example.com/corp \
--table employees
17. sqoop-create-hive-table
17.1. Purpose
17.2. Syntax
17.1. Purpose
The create-hive-table tool populates a Hive metastore with a
definition for a table based on a database table previously imported
to HDFS, or one planned to be imported. This effectively performs
the "--hive-import" step of sqoop-import without running the
preceeding import.
If data was already loaded to HDFS, you can use this tool to finish
the pipeline of importing the data to Hive. You can also create Hive
tables with this tool; data then can be imported and populated into
the target after a preprocessing step run by the user.
17.2. Syntax
$ sqoop create-hive-table (generic-args) (create-hive-
table-args)
$ sqoop-create-hive-table (generic-args) (create-hive-
table-args)
Although the Hadoop generic arguments must preceed any create-
hive-table arguments, the create-hive-table arguments can be
entered in any order with respect to one another.
name>
--password-file
password
--relaxed-isolation

--table The database table to read the definition from.

<char>
<char>
--mysql-delimiters
<char>
Do not use enclosed-by or escaped-by delimiters with output

formatting arguments used to import to Hive. Hive cannot currently
parse them.

Define in Hive a table named emps with a definition based on a
database table named employees:
$ sqoop create-hive-table --connect
jdbc:mysql://db.example.com/corp \
--table employees --hive-table emps
18. sqoop-eval
18.1. Purpose
18.2. Syntax
18.1. Purpose
The eval tool allows users to quickly run simple SQL queries
against a database; results are printed to the console. This allows
users to preview their import queries to ensure they import the
data they expect.
Warning
The eval tool is provided for evaluation purpose only. You can use it to
verify database connection from within the Sqoop or to test simple queries.
It’s not suppose to be used in production workflows.
18.2. Syntax
$ sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
Although the Hadoop generic arguments must preceed any eval
arguments, the eval arguments can be entered in any order with
name>
--password-file
password
--relaxed-isolation Set connection transaction isolation to read
Table 47. SQL evaluation arguments:

-e,--query <statement> Execute statement in SQL.

Select ten records from the employees table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
--query "SELECT * FROM employees LIMIT 10"
Insert a row into the foo table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
-e "INSERT INTO foo VALUES(42, 'bar')"
19. sqoop-list-databases
19.1. Purpose
19.2. Syntax
19.1. Purpose
List database schemas present on a server.
19.2. Syntax
$ sqoop list-databases (generic-args) (list-databases-
args)
$ sqoop-list-databases (generic-args) (list-databases-
args)
Although the Hadoop generic arguments must preceed any list-
databases arguments, the list-databases arguments can be entered
in any order with respect to one another.
name>
--password-file
password
--relaxed-isolation
Flume vs. Kafka vs. Kinesis - A Detailed Guide on Hadoop Ingestion Tools
As the amount of data available for systems to analyze is increasing by the day, the need for
newer faster ways to capture all this data in continuous streams is also arising. Apache
Hadoop is possibly one of the most widely used frameworks for distributed storage and
processing of Big Data data sets. And with the help of various ingestion tools for Hadoop, it is
now possible to capture raw sensor data as binary streams.
Three of the most popular Hadoop ingestion tools include Flume, Kafka and Kinesis. This post
aims at discussing the pros and cons of using each tool - from initial capturing of data to
monitoring and scaling.
Good Read: 10 Big Data Visualization Tools
Before we dive into this further, let us understand what a binary stream is. Most data that
becomes available - user logs, logs from IoT devices etc are streams of text events which are
generated by some user action. This data can be broken into chunks based on the event that
happened - user clicks on a button, a setting change and so on. A binary data stream is one in
which instead of breaking down the data stream by events, the data is collected in a
continuous stream at a specific rate. The ingest tools in question capture this data and then
push out the serialized data to Hadoop.
Flume vs. Kafka vs. Kinesis:
Now, back to the ingestion tools. Both Flume and Kafka are provided by Apache whereas
Kinesis is a fully managed service provided by Amazon .
Apache Flume:
Flume provides many pre-implemented sources for ingestion and also

allows custom stream implementations. It provides two implementation
patterns, Pollable source and Event Driven source. Which one you select
depends on what best describes your use case. For scalability a Flume
source hands off messages to a channel. Multiple channels allow horizontal
scaling as well.
Flume also allows configuring multiple collector hosts for continued availability in case of a
collector failure.
Apache Kafka:
Kafka is gaining popularity in the enterprise space as the ingestion tool to use. A streaming
interface on Kafka is called a producer. Kafka also provides many producer implementations
and also lets you implement your own interface. With Kafka, you need to build your
consumer’s ability to plug into the data - there is no default monitoring implementation.
Scalability on Kafka is achieved by using partitions configured right inside the producer. Data is
distributed across nodes in the cluster. A higher throughput requires more number of partitions.
The tricky part in this could be selecting the right partition scheme. Generally, metadata from
the source is used to partition the streams in a logical manner.
The best thing about Kafka is resiliency via distributed replicas. These replicas do not affect
the throughput in any way. Kafka is also a hot favorite among most enterprises.
AWS Kinesis:
Kinesis is similar to Kafka in many ways. It is a fully managed service which integrates really
well with other AWS services. This makes it easy to scale and process incoming information.
Kinesis, unlike Flume and Kafka, only provides example
implementations, there are no default producers available.
The one disadvantage Kinesis has over Kafka is that it is a cloud service. This introduces a
latency when communicating with an on-premise source compared to the Kafka on-premise
implementation.
So Which to Choose - Flume or Kafka of

Kinesis:
The final choice of ingestion tool really depends on your use case. If you want a highly fault-
tolerant, DIY solution and can have developers for supporting it, Kafka is definitely the way to
go. If you need something which is more out-of-the-box, use Kinesis or Flume. There again,
choose wisely depending on how the data will be consumed. Kafka and Kinesis pull data
whereas Flume pushes it out using something called data sinks.
There are other players as well like:
Apache Storm - also for data streaming but generally used for shorter terms, maybe an add-
on to your existing Hadoop environment
Chukwa (a Hadoop subproject) - devoted to large scale log collection and analysis. It is built
on top of HDFS and MapReduce and is highly scalable. It also includes a powerful monitoring
toolkit
Streaming data gives a business the opportunity to identify real-time business value. Knowing
the big players and which one works best for your use case is a great enabler for you to make
the right architectural decisions.
Flume :
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating
and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since
data sources are customizable, Flume can be used to transport massive
quantities of event data including but not limited to network traffic data, social-
media-generated data, email messages and pretty much any data source
possible.
Apache Flume is a top level project at the Apache Software Foundation.
System Requirements
1. Java Runtime Environment - Java 1.8 or later

2. Memory - Sufficient memory for configurations used by sources, channels
or sinks
3. Disk Space - Sufficient disk space for configurations used by channels or
sinks
4. Directory Permissions - Read/Write permissions for directories used by
agent
Architecture
Data flow model

A Flume event is defined as a unit of data flow having a byte payload and an
optional set of string attributes. A Flume agent is a (JVM) process that hosts
the components through which events flow from an external source to the next
destination (hop).
A Flume source consumes events delivered to it by an external source like a

web server. The external source sends events to Flume in a format that is
recognized by the target Flume source. For example, an Avro Flume source can
be used to receive Avro events from Avro clients or other Flume agents in the
flow that send events from an Avro sink. A similar flow can be defined using a
Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc
Client or Thrift clients written in any language generated from the Flume thrift
protocol.When a Flume source receives an event, it stores it into one or more
channels. The channel is a passive store that keeps the event until it’s
consumed by a Flume sink. The file channel is one example – it is backed by
the local filesystem. The sink removes the event from the channel and puts it
into an external repository like HDFS (via Flume HDFS sink) or forwards it to
the Flume source of the next Flume agent (next hop) in the flow. The source
and sink within the given agent run asynchronously with the events staged in
the channel.
Complex flows
Flume allows a user to build multi-hop flows where events travel through
multiple agents before reaching the final destination. It also allows fan-in and
fan-out flows, contextual routing and backup routes (fail-over) for failed hops.
Reliability
The events are staged in a channel on each agent. The events are then
delivered to the next agent or terminal repository (like HDFS) in the flow. The
events are removed from a channel only after they are stored in the channel of
next agent or in the terminal repository. This is a how the single-hop message
delivery semantics in Flume provide end-to-end reliability of the flow.
Flume uses a transactional approach to guarantee the reliable delivery of the
events. The sources and sinks encapsulate in a transaction the
storage/retrieval, respectively, of the events placed in or provided by a
transaction provided by the channel. This ensures that the set of events are
reliably passed from point to point in the flow. In the case of a multi-hop flow,
the sink from the previous hop and the source from the next hop both have
their transactions running to ensure that the data is safely stored in the
channel of the next hop.
Recoverability
The events are staged in the channel, which manages recovery from failure.
Flume supports a durable file channel which is backed by the local file system.
There’s also a memory channel which simply stores the events in an in-
memory queue, which is faster but any events still left in the memory channel
when an agent process dies can’t be recovered.
Setup
Setting up an agent
Flume agent configuration is stored in a local configuration file. This is a text
file that follows the Java properties file format. Configurations for one or more
agents can be specified in the same configuration file. The configuration file
includes properties of each source, sink and channel in an agent and how they
are wired together to form data flows.
Configuring individual components

Each component (source, sink or channel) in the flow has a name, type, and
set of properties that are specific to the type and instantiation. For example,
an Avro source needs a hostname (or IP address) and a port number to receive
data from. A memory channel can have max queue size (“capacity”), and an
HDFS sink needs to know the file system URI, path to create files, frequency of
file rotation (“hdfs.rollInterval”) etc. All such attributes of a component needs
to be set in the properties file of the hosting Flume agent.
Wiring the pieces together
The agent needs to know what individual components to load and how they are
connected in order to constitute the flow. This is done by listing the names of
each of the sources, sinks and channels in the agent, and then specifying the
connecting channel for each sink and source. For example, an agent flows
events from an Avro source called avroWeb to HDFS sink hdfs-cluster1 via a
file channel called file-channel. The configuration file will contain names of
these components and file-channel as a shared channel for both avroWeb
source and hdfs-cluster1 sink.
Starting an agent
An agent is started using a shell script called flume-ng which is located in the
bin directory of the Flume distribution. You need to specify the agent name,
the config directory, and the config file on the command line:
$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-

conf.properties.template
Now the agent will start running source and sinks configured in the given
properties file.
A simple example
Here, we give an example configuration file, describing a single-node Flume

deployment. This configuration lets a user generate events and subsequently
logs them to the console.
# example.conf: A single-node Flume configuration
# Name the components on this agent

a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source

a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink

a1.sinks.k1.type = logger
# Use a channel which buffers events in memory

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
This configuration defines a single agent named a1. a1 has a source that
listens for data on port 44444, a channel that buffers event data in memory,
and a sink that logs event data to the console. The configuration file names the
various components, then describes their types and configuration parameters.
A given configuration file might define several named agents; when a given
Flume process is launched a flag is passed telling it which named agent to
manifest.
Given this configuration file, we can start Flume as follows:
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -

Dflume.root.logger=INFO,console
Note that in a full deployment we would typically include one more option: --
conf=<conf-dir>. The <conf-dir> directory would include a shell script flume-
env.sh and potentially a log4j properties file. In this example, we pass a Java
option to force Flume to log to the console and we go without a custom
environment script.
From a separate terminal, we can then telnet port 44444 and send Flume an
event:
$ telnet localhost 44444

Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
Hello world! <ENTER>
OK
The original Flume terminal will output the event in a log message.
12/06/19 15:32:19 INFO source.NetcatSource: Source starting

12/06/19 15:32:19 INFO source.NetcatSource: Created
serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65
6C 6C 6F 20 77 6F 72 6C 64 21 0D Hello world!. }
Congratulations - you’ve successfully configured and deployed a Flume agent!

Subsequent sections cover agent configuration in much more detail.
Using environment variables in configuration files
Flume has the ability to substitute environment variables in the configuration.

For example:
a1.sources = r1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = ${NC_PORT}
NB: it currently works for values only, not for keys. (Ie. only on the “right side”
of the = mark of the config lines.)
This can be enabled via Java system properties on agent invocation by
setting propertiesImplementation =
org.apache.flume.node.EnvVarResolverProperties.
For example::
$ NC_PORT=44444 bin/flume-ng agent –conf conf –conf-file

example.conf –name a1 -Dflume.root.logger=INFO,console -
DpropertiesImplementation=org.apache.flume.node.EnvVarResolverPr
operties
Note the above is just an example, environment variables can be configured in
other ways, including being set in conf/flume-env.sh.
Logging raw data
Logging the raw stream of data flowing through the ingest pipeline is not
desired behaviour in many production environments because this may result in
leaking sensitive data or security related configurations, such as secret keys, to
Flume log files. By default, Flume will not log such information. On the other
hand, if the data pipeline is broken, Flume will attempt to provide clues for
debugging the problem.
One way to debug problems with event pipelines is to set up an
additional Memory Channel connected to a Logger Sink, which will output all
event data to the Flume logs. In some situations, however, this approach is
insufficient.
In order to enable logging of event- and configuration-related data, some Java
system properties must be set in addition to log4j properties.
To enable configuration-related logging, set the Java system property -
Dorg.apache.flume.log.printconfig=true. This can either be passed on the
command line or by setting this in the JAVA_OPTS variable in flume-env.sh.
To enable data logging, set the Java system property -
Dorg.apache.flume.log.rawdata=true in the same way described above. For
most components, the log4j logging level must also be set to DEBUG or TRACE
to make event-specific logging appear in the Flume logs.
Here is an example of enabling both configuration logging and raw data logging
while also setting the Log4j loglevel to DEBUG for console output:
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -
Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -
Dorg.apache.flume.log.rawdata=true
Zookeeper based Configuration
Flume supports Agent configurations via Zookeeper. This is an experimental

feature. The configuration file needs to be uploaded in the Zookeeper, under a
configurable prefix. The configuration file is stored in Zookeeper Node data.
Following is how the Zookeeper Node tree would look like for agents a1 and a2
- /flume
|- /a1 [Agent config file]
|- /a2 [Agent config file]
Once the configuration file is uploaded, start the agent with following options
$ bin/flume-ng agent –conf conf -z zkhost:2181,zkhost1:2181 -p
/flume –name a1 -Dflume.root.logger=INFO,console
Argument
Name Default Description
z – Zookeeper
connection
string. Comma
separated list of
hostname:port
p /flume Base Path in
Zookeeper to
store Agent
configurations
Installing third-party plugins
Flume has a fully plugin-based architecture. While Flume ships with many out-
of-the-box sources, channels, sinks, serializers, and the like, many
implementations exist which ship separately from Flume.
While it has always been possible to include custom Flume components by
adding their jars to the FLUME_CLASSPATH variable in the flume-env.sh file,
Flume now supports a special directory called plugins.d which automatically
picks up plugins that are packaged in a specific format. This allows for easier
management of plugin packaging issues as well as simpler debugging and
troubleshooting of several classes of issues, especially library dependency
conflicts.
The plugins.d directory
The plugins.d directory is located at $FLUME_HOME/plugins.d. At startup time,
the flume-ng start script looks in the plugins.d directory for plugins that
conform to the below format and includes them in proper paths when starting
up java.
Directory layout for plugins
Each plugin (subdirectory) within plugins.d can have up to three sub-

directories:
1. lib - the plugin’s jar(s)
2. libext - the plugin’s dependency jar(s)
3. native - any required native libraries, such as .so files
Example of two plugins within the plugins.d directory:
plugins.d/
plugins.d/custom-source-1/
plugins.d/custom-source-1/lib/my-source.jar
plugins.d/custom-source-1/libext/spring-core-2.5.6.jar
plugins.d/custom-source-2/
plugins.d/custom-source-2/lib/custom.jar
plugins.d/custom-source-2/native/gettext.so
Data ingestion
Flume supports a number of mechanisms to ingest data from external sources.
RPC
An Avro client included in the Flume distribution can send a given file to Flume
Avro source using avro RPC mechanism:
$ bin/flume-ng avro-client -H localhost -p 41414 -F /usr/logs/log.10
The above command will send the contents of /usr/logs/log.10 to to the Flume
source listening on that ports.
Executing commands
There’s an exec source that executes a given command and consumes the
output. A single ‘line’ of output ie. text followed by carriage return (‘\r’) or line
feed (‘\n’) or both together.
Network streams
Flume supports the following mechanisms to read data from popular log
stream types, such as:
1. Avro
2. Thrift
3. Syslog
4. Netcat
Setting multi-agent flow
In order to flow the data across multiple agents or hops, the sink of the
previous agent and source of the current hop need to be avro type with the
sink pointing to the hostname (or IP address) and port of the source.
Consolidation
A very common scenario in log collection is a large number of log producing

clients sending data to a few consumer agents that are attached to the storage
subsystem. For example, logs collected from hundreds of web servers sent to a
dozen of agents that write to HDFS cluster.
This can be achieved in Flume by configuring a number of first tier agents with
an avro sink, all pointing to an avro source of single agent (Again you could
use the thrift sources/sinks/clients in such a scenario). This source on the
second tier agent consolidates the received events into a single channel which
is consumed by a sink to its final destination.
Multiplexing the flow
Flume supports multiplexing the event flow to one or more destinations. This is
achieved by defining a flow multiplexer that can replicate or selectively route
an event to one or more channels.
The above example shows a source from agent “foo” fanning out the flow to
three different channels. This fan out can be replicating or multiplexing. In
case of replicating flow, each event is sent to all three channels. For the
multiplexing case, an event is delivered to a subset of available channels when
an event’s attribute matches a preconfigured value. For example, if an event
attribute called “txnType” is set to “customer”, then it should go to channel1
and channel3, if it’s “vendor” then it should go to channel2, otherwise
channel3. The mapping can be set in the agent’s configuration file.
Configuration
As mentioned in the earlier section, Flume agent configuration is read from a
file that resembles a Java property file format with hierarchical property
settings.
Defining the flow

To define the flow within a single agent, you need to link the sources and sinks
via a channel. You need to list the sources, sinks and channels for the given
agent, and then point the source and sink to a channel. A source instance can
specify multiple channels, but a sink instance can only specify one channel.
The format is as follows:
# list the sources, sinks and channels for the agent

<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>
# set channel for source

<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
# set channel for sink

<Agent>.sinks.<Sink>.channel = <Channel1>
For example, an agent named agent_foo is reading data from an external avro
client and sending it to HDFS via a memory channel. The config file
weblog.config could look like:

agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1
# set channel for source

agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1
# set channel for sink

agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1
This will make the events flow from avro-AppSrv-source to hdfs-Cluster1-sink

through the memory channel mem-channel-1. When the agent is started with
the weblog.config as its config file, it will instantiate that flow.
Configuring individual components
After defining the flow, you need to set properties of each source, sink and
channel. This is done in the same hierarchical namespace fashion where you
set the component type and other values for the properties specific to each
component:
# properties for sources

<Agent>.sources.<Source>.<someProperty> = <someValue>
# properties for channels

<Agent>.channel.<Channel>.<someProperty> = <someValue>
# properties for sinks

<Agent>.sources.<Sink>.<someProperty> = <someValue>
The property “type” needs to be set for each component for Flume to
understand what kind of object it needs to be. Each source, sink and channel
type has its own set of properties required for it to function as intended. All
those need to be set as needed. In the previous example, we have a flow from
avro-AppSrv-source to hdfs-Cluster1-sink through the memory channel mem-
channel-1. Here’s an example that shows configuration of each of those
components:
agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = hdfs-Cluster1-sink
agent_foo.channels = mem-channel-1
# set channel for sources, sinks
# properties of avro-AppSrv-source
agent_foo.sources.avro-AppSrv-source.type = avro
agent_foo.sources.avro-AppSrv-source.bind = localhost
agent_foo.sources.avro-AppSrv-source.port = 10000
# properties of mem-channel-1
agent_foo.channels.mem-channel-1.type = memory
agent_foo.channels.mem-channel-1.capacity = 1000
agent_foo.channels.mem-channel-1.transactionCapacity = 100
# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path =
hdfs://namenode/flume/webdata
#...
Adding multiple flows in an agent
A single Flume agent can contain several independent flows. You can list
multiple sources, sinks and channels in a config. These components can be
linked to form multiple flows:

<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink2>
Then you can link the sources and sinks to their corresponding channels (for
sources) of channel (for sinks) to setup two different flows. For example, if you
need to setup two flows in an agent, one going from an external avro client to
external HDFS and another from output of a tail to avro sink, then here’s a
config to do that:
# list the sources, sinks and channels in the agent

agent_foo.sources = avro-AppSrv-source1 exec-tail-source2
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2
# flow #1 configuration
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2
Configuring a multi agent flow
To setup a multi-tier flow, you need to have an avro/thrift sink of first hop
pointing to avro/thrift source of the next hop. This will result in the first Flume
agent forwarding events to the next Flume agent. For example, if you are
periodically sending files (1 file per event) using avro client to a local Flume
agent, then this local agent can forward it to another agent that has the
mounted for storage.
Weblog agent config:
# list sources, sinks and channels in the agent

agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = avro-forward-sink
agent_foo.channels = file-channel
# define the flow

agent_foo.sources.avro-AppSrv-source.channels = file-channel
agent_foo.sinks.avro-forward-sink.channel = file-channel
# avro sink properties

agent_foo.sinks.avro-forward-sink.type = avro
agent_foo.sinks.avro-forward-sink.hostname = 10.1.1.100
agent_foo.sinks.avro-forward-sink.port = 10000
# configure other pieces

#...
HDFS agent config:
# list sources, sinks and channels in the agent

agent_foo.sources = avro-collection-source
agent_foo.sinks = hdfs-sink
agent_foo.channels = mem-channel
# define the flow

agent_foo.sources.avro-collection-source.channels = mem-channel
agent_foo.sinks.hdfs-sink.channel = mem-channel
# avro source properties

agent_foo.sources.avro-collection-source.type = avro
agent_foo.sources.avro-collection-source.bind = 10.1.1.100
agent_foo.sources.avro-collection-source.port = 10000
# configure other pieces
#...
Here we link the avro-forward-sink from the weblog agent to the avro-
collection-source of the hdfs agent. This will result in the events coming from
the external appserver source eventually getting stored in HDFS.
Fan out flow
As discussed in previous section, Flume supports fanning out the flow from one
source to multiple channels. There are two modes of fan out, replicating and
multiplexing. In the replicating flow, the event is sent to all the configured
channels. In case of multiplexing, the event is sent to only a subset of
qualifying channels. To fan out the flow, one needs to specify a list of channels
for a source and the policy for the fanning it out. This is done by adding a
channel “selector” that can be replicating or multiplexing. Then further specify
the selection rules if it’s a multiplexer. If you don’t specify a selector, then by
default it’s replicating:
# List the sources, sinks and channels for the agent

<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
# set list of channels for source (separated by space)

<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>
# set channel for sinks

<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>
<Agent>.sources.<Source1>.selector.type = replicating
The multiplexing select has a further set of properties to bifurcate the flow.
This requires specifying a mapping of an event attribute to a set for channel.
The selector checks for each configured attribute in the event header. If it
matches the specified value, then that event is sent to all the channels mapped
to that value. If there’s no match, then the event is sent to set of channels
configured as default:
# Mapping for multiplexing selector

<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Channel2>
#...
<Agent>.sources.<Source1>.selector.default = <Channel2>
The mapping allows overlapping the channels for each value.

The following example has a single flow that multiplexed to two paths. The
agent named agent_foo has a single avro source and two channels linked to
two sinks:
# list the sources, sinks and channels in the agent

agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2
# set channels for source

agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-
channel-2
# set channel for sinks

agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2
# channel selector configuration

agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-
1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-
1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1
The selector checks for a header called “State”. If the value is “CA” then its
sent to mem-channel-1, if its “AZ” then it goes to file-channel-2 or if its “NY”
then both. If the “State” header is not set or doesn’t match any of the three,
then it goes to mem-channel-1 which is designated as ‘default’.
The selector also supports optional channels. To specify optional channels for a
header, the config parameter ‘optional’ is used in the following way:
# channel selector configuration

agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-
1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-
1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1
file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1
The selector will attempt to write to the required channels first and will fail the
transaction if even one of these channels fails to consume the events. The
transaction is reattempted on all of the channels. Once all required channels
have consumed the events, then the selector will attempt to write to the
optional channels. A failure by any of the optional channels to consume the
event is simply ignored and not retried.
If there is an overlap between the optional channels and required channels for
a specific header, the channel is considered to be required, and a failure in the
channel will cause the entire set of required channels to be retried. For
instance, in the above example, for the header “CA” mem-channel-1 is
considered to be a required channel even though it is marked both as required
and optional, and a failure to write to this channel will cause that event to be
retried on all channels configured for the selector.
Note that if a header does not have any required channels, then the event will
be written to the default channels and will be attempted to be written to the
optional channels for that header. Specifying optional channels will still cause
the event to be written to the default channels, if no required channels are
specified. If no channels are designated as default and there are no required,
the selector will attempt to write the events to the optional channels. Any
failures are simply ignored in that case.
SSL/TLS support
Several Flume components support the SSL/TLS protocols in order to

communicate with other systems securely.
SSL
server
or
Component client
Avro Source server
Avro Sink client
Thrift Source server
Thrift Sink client
Kafka Sourceclient
Kafka client
Channel
Kafka Sink client
HTTP server
Source
JMS Source client
Syslog TCP server
Source
Multiport server
Syslog TCP
Source
The SSL compatible components have several configuration parameters to set
up SSL, like enable SSL flag, keystore / truststore parameters (location,
password, type) and additional SSL parameters (eg. disabled protocols).
Enabling SSL for a component is always specified at component level in the
agent configuration file. So some components may be configured to use SSL
while others not (even with the same component type).
The keystore / truststore setup can be specified at component level or globally.
In case of the component level setup, the keystore / truststore is configured in
the agent configuration file through component specific parameters. The
advantage of this method is that the components can use different keystores
(if this would be needed). The disadvantage is that the keystore parameters
must be copied for each component in the agent configuration file. The
component level setup is optional, but if it is defined, it has higher precedence
than the global parameters.
With the global setup, it is enough to define the keystore / truststore
parameters once and use the same settings for all components, which means
less and more centralized configuration.
The global setup can be configured either through system properties or
through environment variables.
Environment
System property variable Description
javax.net.ssl.keyStore FLUME_SSL_K Keystore
EYSTORE_PATHlocation
javax.net.ssl.keyStorePassword FLUME_SSL_K Keystore
EYSTORE_PASS password
WORD
javax.net.ssl.keyStoreType FLUME_SSL_K Keystore
EYSTORE_TYP type (by
E default JKS)
javax.net.ssl.trustStore FLUME_SSL_TRTruststore
USTSTORE_PAT location
H
javax.net.ssl.trustStorePassword FLUME_SSL_TRTruststore
USTSTORE_PAS password
SWORD
javax.net.ssl.trustStoreType FLUME_SSL_TRTruststore
USTSTORE_TY type (by
PE default JKS)
flume.ssl.include.protocols FLUME_SSL_IN Protocols to
CLUDE_PROTO include when
COLS calculating
enabled
protocols. A
comma (,)
separated list.
Excluded
protocols will
be excluded
from this list
if provided.
flume.ssl.exclude.protocols FLUME_SSL_E Protocols to
Environment
System property variable Description
XCLUDE_PROT exclude when
OCOLS calculating
enabled
protocols. A
comma (,)
separated list.
flume.ssl.include.cipherSuites FLUME_SSL_IN Cipher suites
CLUDE_CIPHERto include
SUITES when
calculating
enabled
cipher suites.
A comma (,)
separated list.
Excluded
cipher suites
will be
excluded
from this list
if provided.
flume.ssl.exclude.cipherSuites FLUME_SSL_E Cipher suites
XCLUDE_CIPH to exclude
ERSUITES when
calculating
enabled
cipher suites.
A comma (,)
separated list.
The SSL system properties can either be passed on the command line or by
setting the JAVA_OPTS environment variable in conf/flume-env.sh. (Although,
using the command line is inadvisable because the commands including the
passwords will be saved to the command history.)
export JAVA_OPTS="$JAVA_OPTS
-Djavax.net.ssl.keyStore=/path/to/keystore.jks"
export JAVA_OPTS="$JAVA_OPTS -
Djavax.net.ssl.keyStorePassword=password"
Flume uses the system properties defined in JSSE (Java Secure Socket
Extension), so this is a standard way for setting up SSL. On the other hand,
specifying passwords in system properties means that the passwords can be
seen in the process list. For cases where it is not acceptable, it is also be
possible to define the parameters in environment variables. Flume initializes
the JSSE system properties from the corresponding environment variables
internally in this case.
The SSL environment variables can either be set in the shell environment
before starting Flume or in conf/flume-env.sh. (Although, using the command
line is inadvisable because the commands including the passwords will be
saved to the command history.)
export FLUME_SSL_KEYSTORE_PATH=/path/to/keystore.jks
export FLUME_SSL_KEYSTORE_PASSWORD=password
Please note:
 SSL must be enabled at component level. Specifying the global SSL
parameters alone will not have any effect.
 If the global SSL parameters are specified at multiple levels, the priority
is the following (from higher to lower):
 component parameters in agent config
 system properties
 environment variables
 If SSL is enabled for a component, but the SSL parameters are not
specified in any of the ways described above, then
 in case of keystores: configuration error
 in case of truststores: the default truststore will be used
(jssecacerts / cacerts in Oracle JDK)
 The trustore password is optional in all cases. If not specified, then no
integrity check will be performed on the truststore when it is opened by
the JDK.
Source and sink batch sizes and channel transaction capacities
Sources and sinks can have a batch size parameter that determines the
maximum number of events they process in one batch. This happens within a
channel transaction that has an upper limit called transaction capacity. Batch
size must be smaller than the channel’s transaction capacity. There is an
explicit check to prevent incompatible settings. This check happens whenever
the configuration is read.
Flume Sources
Avro Source
Listens on Avro port and receives events from external Avro client streams.
When paired with the built-in Avro Sink on another (previous hop) Flume
agent, it can create tiered collection topologies. Required properties are
in bold.
Property Name Default Description
channels –
type – The component type
name, needs to
be avro
bind – hostname or IP address
to listen on
port – Port # to bind to
threads – Maximum number of
worker threads to
spawn
selector.type
selector.*
interceptors – Space-separated list of
interceptors
interceptors.*
compression-type none This can be “none” or
“deflate”. The
compression-type must
match the
compression-type of
matching AvroSource
ssl false Set this to true to
enable SSL encryption.
If SSL is enabled, you
must also specify a
“keystore” and a
“keystore-password”,
either through
component level
parameters (see below)
or as global SSL
parameters
(see SSL/TLS
supportsection).
keystore – This is the path to a
Java keystore file. If
not specified here, then
the global keystore will
be used (if defined,
otherwise
configuration error).
keystore-password – The password for the
Java keystore. If not
specified here, then the
global keystore
password will be used
(if defined, otherwise
keystore-type JKS The type of the Java
keystore. This can be
“JKS” or “PKCS12”. If
the global keystore
type will be used (if
defined, otherwise the
default is JKS).
exclude-protocols SSLv3 Space-separated list of
SSL/TLS protocols to
exclude. SSLv3 will
always be excluded in
addition to the
protocols specified.
include-protocols – Space-separated list of
include. The enabled
protocols will be the
included protocols
without the excluded
protocols. If included-
protocols is empty, it
includes every
supported protocols.
exclude-cipher-suites – Space-separated list of
cipher suites to
exclude.
include-cipher-suites – Space-separated list of
cipher suites to
cipher suites will be
the included cipher
suites without the
excluded cipher suites.
If included-cipher-
suites is empty, it
includes every
supported cipher suites.
ipFilter false Set this to true to
enable ipFiltering for
netty
ipFilterRules – Define N netty ipFilter
pattern rules with this
config.
Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
Example of ipFilterRules
ipFilterRules defines N netty ipFilters separated by a comma a pattern rule
must be in this format.
<’allow’ or deny>:<’ip’ or ‘name’ for computer name>:<pattern> or
allow/deny:ip/name:pattern
example: ipFilterRules=allow:ip:127.*,allow:name:localhost,deny:ip:*
Note that the first rule to match will apply as the example below shows from a
client on the localhost
This will Allow the client on localhost be deny clients from any other ip
“allow:name:localhost,deny:ip:” This will deny the client on localhost be allow
clients from any other ip “deny:name:localhost,allow:ip:“
Thrift Source
Listens on Thrift port and receives events from external Thrift client streams.
When paired with the built-in ThriftSink on another (previous hop) Flume
agent, it can create tiered collection topologies. Thrift source can be configured
to start in secure mode by enabling kerberos authentication. agent-principal
and agent-keytab are the properties used by the Thrift source to authenticate
to the kerberos KDC. Required properties are in bold.
channels –
name, needs to
be thrift
bind – hostname or IP
address to listen on
threads – Maximum number of
worker threads to
spawn
selector.type
selector.*
interceptors – Space separated list of
interceptors
interceptors.*
enable SSL
encryption. If SSL is
enabled, you must
also specify a
either through
component level
parameters (see
below) or as global
SSL parameters
(see SSL/TLS
support section)
not specified here,
then the global
keystore will be used
specified here, then
the global keystore
“JKS” or “PKCS12”.
If not specified here,
then the global
keystore type will be
used (if defined,
otherwise the default
is JKS).
exclude-protocols SSLv3 Space-separated list
of SSL/TLS protocols
to exclude. SSLv3
will always be
excluded in addition
to the protocols
specified.
include-protocols – Space-separated list
of SSL/TLS protocols
to include. The
enabled protocols will
be the included
protocols without the
excluded protocols. If
included-protocols is
empty, it includes
every supported
protocols.
exclude-cipher-suites – Space-separated list
of cipher suites to
exclude.
include-cipher-suites – Space-separated list
of cipher suites to
the included cipher
suites without the
excluded cipher
suites.
kerberos false Set to true to enable
kerberos
authentication. In
kerberos mode, agent-
principal and agent-
keytab are required
for successful
authentication. The
Thrift source in
secure mode, will
accept connections
only from Thrift
clients that have
kerberos enabled and
are successfully
authenticated to the
kerberos KDC.
agent-principal – The kerberos
principal used by the
Thrift Source to
authenticate to the
kerberos KDC.
agent-keytab —- The keytab location
used by the Thrift
Source in
combination with the
agent-principal to
authenticate to the
kerberos KDC.
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
Exec Source
Exec source runs a given Unix command on start-up and expects that process
to continuously produce data on standard out (stderr is simply discarded,
unless property logStdErr is set to true). If the process exits for any reason,
the source also exits and will produce no further data. This means
configurations such as cat[named pipe] or tail -F [file] are going to produce the
desired results where as date will probably not - the former two commands
produce streams of data where as the latter produces a single event and exits.
Required properties are in bold.
Property
channels –
type – The component
type name, needs
to be exec
Property
command – The command to
execute
shell – A shell
invocation used
to run the
command. e.g.
/bin/sh -c.
Required only for
commands
relying on shell
features like
wildcards, back
ticks, pipes etc.
restartThrottle 10000 Amount of time
(in millis) to wait
before attempting
a restart
restart false Whether the
executed cmd
should be
restarted if it dies
logStdErr false Whether the
command’s
stderr should be
logged
batchSize 20 The max number
of lines to read
and send to the
channel at a time
batchTimeout 3000 Amount of time
(in milliseconds)
to wait, if the
buffer size was
not reached,
before data is
pushed
downstream
selector.type replicating replicating or
multiplexing
selector.* Depends on the
selector.type
value
interceptors – Space-separated
list of
interceptors
interceptors.*
Warning
The problem with ExecSource and other asynchronous sources is that the
source can not guarantee that if there is a failure to put the event into the
Channel the client knows about it. In such cases, the data will be lost. As a
for instance, one of the most commonly requested features is the tail -
F [file] -like use case where an application writes to a log file on disk and
Flume tails the file, sending each line as an event. While this is possible,
there’s an obvious problem; what happens if the channel fills up and Flume
can’t send an event? Flume has no way of indicating to the application
writing the log file that it needs to retain the log or that the event hasn’t
been sent, for some reason. If this doesn’t make sense, you need only know
this: Your application can never guarantee data has been received when
using a unidirectional asynchronous interface such as ExecSource! As an
extension of this warning - and to be completely clear - there is absolutely
zero guarantee of event delivery when using this source. For stronger
reliability guarantees, consider the Spooling Directory Source, Taildir Source
or direct integration with Flume via the SDK.
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
The ‘shell’ config is used to invoke the ‘command’ through a command shell
(such as Bash or Powershell). The ‘command’ is passed as an argument to
‘shell’ for execution. This allows the ‘command’ to use features from the shell
such as wildcards, back ticks, pipes, loops, conditionals etc. In the absence of
the ‘shell’ config, the ‘command’ will be invoked directly. Common values for
‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.
a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done
JMS Source
JMS Source reads messages from a JMS destination such as a queue or topic.
Being a JMS application it should work with any JMS provider but has only
been tested with ActiveMQ. The JMS source provides configurable batch size,
message selector, user/pass, and message to flume event converter. Note that
the vendor provided JMS jars should be included in the Flume classpath using
plugins.d directory (preferred), –classpath on command line, or via
FLUME_CLASSPATH variable in flume-env.sh.
channels –
type – The component type name,
needs to be jms
initialContextFactory – Inital Context Factory, e.g:
org.apache.activemq.jndi.Act
iveMQInitialContextFactory
connectionFactory – The JNDI name the
connection factory should
appear as
providerURL – The JMS provider URL
destinationName – Destination name
destinationType – Destination type (queue or
topic)
messageSelector – Message selector to use when
creating the consumer
userName – Username for the destination/
provider
passwordFile – File containing the password
for the destination/provider
batchSize 100 Number of messages to
consume in one batch
converter.type DEFAULT Class to use to convert
messages to flume events.
See below.
converter.* – Converter properties.
converter.charset UTF-8 Default converter only.
Charset to use when
converting JMS
TextMessages to byte arrays.
createDurableSubscription false Whether to create durable
subscription. Durable
subscription can only be used
with destinationType topic. If
true, “clientId” and
“durableSubscriptionName”
have to be specified.
clientId – JMS client identifier set on
Connection right after it is
created. Required for durable
subscriptions.
durableSubscriptionName – Name used to identify the
durable subscription.
Required for durable
subscriptions.
JMS message converter

The JMS source allows pluggable converters, though it’s likely the default
converter will work for most purposes. The default converter is able to convert
Bytes, Text, and Object messages to FlumeEvents. In all cases, the properties
in the message are added as headers to the FlumeEvent.
BytesMessage:
Bytes of message are copied to body of the FlumeEvent. Cannot

convert more than 2GB of data per message.
TextMessage:
Text of message is converted to a byte array and copied to the body of

the FlumeEvent. The default converter uses UTF-8 by default but this
is configurable.
ObjectMessage:
Object is written out to a ByteArrayOutputStream wrapped in an

ObjectOutputStream and the resulting array is copied to the body of
the FlumeEvent.
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.initialContextFactory =
org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE
SSL and JMS Source
JMS client implementations typically support to configure SSL/TLS via some

Java system properties defined by JSSE (Java Secure Socket Extension).
Specifying these system properties for Flume’s JVM, JMS Source (or more
precisely the JMS client implementation used by the JMS Source) can connect
to the JMS server through SSL (of course only when the JMS server has also
been set up to use SSL). It should work with any JMS provider and has been
tested with ActiveMQ, IBM MQ and Oracle WebLogic.
The following sections describe the SSL configuration steps needed on the
Flume side only. You can find more detailed descriptions about the server side
setup of the different JMS providers and also full working configuration
examples on Flume Wiki.
SSL transport / server authentication:
If the JMS server uses self-signed certificate or its certificate is signed by a
non-trusted CA (eg. the company’s own CA), then a truststore (containing the
right certificate) needs to be set up and passed to Flume. It can be done via
the global SSL parameters. For more details about the global SSL setup, see
the SSL/TLS support section.
Some JMS providers require SSL specific JNDI Initial Context Factory and/or
Provider URL settings when using SSL (eg. ActiveMQ uses ssl:// URL prefix
instead of tcp://). In this case the source properties (initialContextFactory and/
or providerURL) have to be adjusted in the agent config file.
Client certificate authentication (two-way SSL):
JMS Source can authenticate to the JMS server through client certificate
authentication instead of the usual user/password login (when SSL is used and
the JMS server is configured to accept this kind of authentication).
The keystore containing Flume’s key used for the authentication needs to be
configured via the global SSL parameters again. For more details about the
global SSL setup, see the SSL/TLS support section.
The keystore should contain only one key (if multiple keys are present, then
the first one will be used). The key password must be the same as the
keystore password.
In case of client certificate authentication, it is not needed to specify
the userName / passwordFile properties for the JMS Source in the Flume agent
config file.
Please note:
There are no component level configuration parameters for JMS Source unlike
in case of other components. No enable SSL flag either. SSL setup is controlled
by JNDI/Provider URL settings (ultimately the JMS server settings) and by the
presence / absence of the truststore / keystore.
Spooling Directory Source
This source lets you ingest data by placing files to be ingested into a “spooling”
directory on disk. This source will watch the specified directory for new files,
and will parse events out of new files as they appear. The event parsing logic is
pluggable. After a given file has been fully read into the channel, completion by
default is indicated by renaming the file or it can be deleted or the trackerDir is
used to keep track of processed files.
Unlike the Exec source, this source is reliable and will not miss data, even if
Flume is restarted or killed. In exchange for this reliability, only immutable,
uniquely-named files must be dropped into the spooling directory. Flume tries
to detect these problem conditions and will fail loudly if they are violated:
1. If a file is written to after being placed into the spooling directory, Flume
will print an error to its log file and stop processing.
2. If a file name is reused at a later time, Flume will print an error to its log
file and stop processing.
To avoid the above issues, it may be useful to add a unique identifier (such as
a timestamp) to log file names when they are moved into the spooling
directory.
Despite the reliability guarantees of this source, there are still cases in which
events may be duplicated if certain downstream failures occur. This is
consistent with the guarantees offered by other Flume components.
channels –
type – The component type name, needs to be spooldir.
spoolDir – The directory from which to read files from.
fileSuffix .COMPLETED Suffix to append to completely ingested files
deletePolicy never When to delete completed
files: never or immediate
fileHeader false Whether to add a header storing the absolute path
filename.
fileHeaderKey file Header key to use when appending absolute path
filename to event header.
basenameHeader false Whether to add a header storing the basename of the
file.
basenameHeaderKey basename Header Key to use when appending basename of file to
event header.
includePattern ^.*$ Regular expression specifying which files to include. It
can used together with ignorePattern. If a file
matches
both ignorePattern and includePattern rege
x, the file is ignored.
ignorePattern ^$ Regular expression specifying which files to ignore
(skip). It can used together with includePattern.
If a file matches
both ignorePattern and includePattern rege
x, the file is ignored.
trackerDir .flumespool Directory to store metadata related to processing of
files. If this path is not an absolute path, then it is
interpreted as relative to the spoolDir.
trackingPolicy rename The tracking policy defines how file processing is
tracked. It can be “rename” or “tracker_dir”. This
parameter is only effective if the deletePolicy is
“never”. “rename” - After processing files they get
renamed according to the fileSuffix parameter.
“tracker_dir” - Files are not renamed but a new empty
file is created in the trackerDir. The new tracker file
name is derived from the ingested one plus the
fileSuffix.
consumeOrder oldest In which order files in the spooling directory will be
consumed oldest, youngest and random. In case
of oldest and youngest, the last modified time of
the files will be used to compare the files. In case of a
tie, the file with smallest lexicographical order will be
consumed first. In case of random any file will be
picked randomly. When
using oldestand youngest the whole directory
will be scanned to pick the oldest/youngest file, which
might be slow if there are a large number of files, while
using random may cause old files to be consumed
very late if new files keep coming in the spooling
directory.
pollDelay 500 Delay (in milliseconds) used when polling for new
files.
recursiveDirectorySearch false Whether to monitor sub directories for new files to
read.
maxBackoff 4000 The maximum time (in millis) to wait between
consecutive attempts to write to the channel(s) if the
channel is full. The source will start at a low backoff
and increase it exponentially each time the channel
throws a ChannelException, upto the value specified
by this parameter.
batchSize 100 Granularity at which to batch transfer to the channel
inputCharset UTF-8 Character set used by deserializers that treat the input
file as text.
decodeErrorPolicy FAIL What to do when we see a non-decodable character in
the input file. FAIL: Throw an exception and fail to
parse the file. REPLACE: Replace the unparseable
character with the “replacement character” char,
typically Unicode U+FFFD.IGNORE: Drop the
unparseable character sequence.
deserializer LINE Specify the deserializer used to parse the file into
events. Defaults to parsing each line as an event. The
class specified must
implement EventDeserializer.Builder.
deserializer.* Varies per event deserializer.
bufferMaxLines – (Obselete) This option is now ignored.
bufferMaxLineLength 5000 (Deprecated) Maximum length of a line in the commit
buffer. Use deserializer.maxLineLength instead.
selector.type replicating replicating or multiplexing
selector.* Depends on the selector.type value
interceptors – Space-separated list of interceptors
interceptors.*
Example for an agent named agent-1:
a1.channels = ch-1
a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true
Event Deserializers
The following event deserializers ship with Flume.
LINE
This deserializer generates one event per line of text input.

deserializer.maxLineLength 2048 Maximum
number of
characters to
include in a
single event.
If a line
exceeds this
length, it is
truncated,
and the
remaining
characters on
the line will
appear in a
subsequent
event.
deserializer.outputCharset UTF-8 Charset to
use for
encoding
events put
into the
channel.
AVRO
This deserializer is able to read an Avro container file, and it generates one
event per Avro record in the file. Each event is annotated with a header that
indicates the schema used. The body of the event is the binary Avro record
data, not including the schema or the rest of the container file elements.
Note that if the spool directory source must retry putting one of these events
onto a channel (for example, because the channel is full), then it will reset and
retry from the most recent Avro container file sync point. To reduce potential
event duplication in such a failure scenario, write sync markers more
frequently in your Avro input files.
BlobDeserializer
This deserializer reads a Binary Large Object (BLOB) per event, typically one
BLOB per file. For example a PDF or JPG file. Note that this approach is not
suitable for very large objects because the entire BLOB is buffered in RAM.
Taildir Source
Note
This source is provided as a preview feature. It does not work on Windows.
Watch the specified files, and tail them in nearly real-time once detected new
lines appended to the each files. If the new lines are being written, this source
will retry reading them in wait for the completion of the write.
This source is reliable and will not miss data even when the tailing files rotate.
It periodically writes the last read position of each files on the given position
file in JSON format. If Flume is stopped or down for some reason, it can restart
tailing from the position written on the existing position file.
In other use case, this source can also start tailing from the arbitrary position
for each files using the given position file. When there is no position file on the
specified path, it will start tailing from the first line of each files by default.
Files will be consumed in order of their modification time. File with the oldest
modification time will be consumed first.
This source does not rename or delete or do any modifications to the file being
tailed. Currently this source does not support tailing binary files. It reads text
files line by line.
channels –
type name, needs
to be TAILDIR.
filegroups – Space-separated
list of file groups.
Each file group
indicates a set of
files to be tailed.
filegroups.<filegroupN – Absolute path of
ame> the file group.
Regular
expression (and
not file system
patterns) can be
used for filename
only.
positionFile ~/.flume/ File in JSON
taildir_positio format to record
n.json the inode, the
absolute path and
the last position
of each tailing
file.
headers.<filegroupNam – Header value
e>.<headerKey> which is the set
with header key.
Multiple headers
can be specified
for one file
group.
byteOffsetHeader false Whether to add
the byte offset of
a tailed line to a
header called
‘byteoffset’.
skipToEnd false Whether to skip
the position to
EOF in the case
of files not
written on the
position file.
idleTimeout 120000 Time (ms) to
close inactive
files. If the
closed file is
appended new
lines to, this
source will
automatically re-
open it.
writePosInterval 3000 Interval time
(ms) to write the
last position of
each file on the
position file.
batchSize 100 Max number of
lines to read and
send to the
channel at a time.
Using the default
is usually fine.
maxBatchCount Long.MAX_ Controls the
VALUE number of
batches being
read
consecutively
from the same
file. If the source
is tailing multiple
files and one of
them is written at
a fast rate, it can
prevent other
files to be
processed,
because the busy
file would be
read in an
endless loop. In
this case lower
this value.
backoffSleepIncrement 1000 The increment
for time delay
before
reattempting to
poll for new data,
when the last
attempt did not
find any new
data.
maxBackoffSleep 5000 The max time
delay between
each reattempt to
poll for new data,
when the last
attempt did not
find any new
data.
cachePatternMatching true Listing
directories and
applying the
filename regex
pattern may be
time consuming
for directories
containing
thousands of
files. Caching the
list of matching
files can improve
performance. The
order in which
files are
consumed will
also be cached.
Requires that the
file system keeps
track of
modification
times with at
least a 1-second
granularity.
fileHeader false Whether to add a
header storing
the absolute path
filename.
fileHeaderKey file Header key to
use when
appending
absolute path
filename to event
header.
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000
Twitter 1% firehose Source (experimental)
Warning
This source is highly experimental and may change between minor versions
of Flume. Use at your own risk.
Experimental source that connects via Streaming API to the 1% sample twitter
firehose, continously downloads tweets, converts them to Avro format and
sends Avro events to a downstream Flume sink. Requires the consumer and
access tokens and secrets of a Twitter developer account. Required properties
are in bold.
channels –
type – The component type name, needs to
be org.apache.flume.source.twitter.Twitter
Source
consumerKey – OAuth consumer key
consumerSecret – OAuth consumer secret
accessToken – OAuth access token
accessTokenSecret – OAuth token secret
maxBatchSize 1000 Maximum number of twitter messages to put in a single batch
maxBatchDurationMillis 1000 Maximum number of milliseconds to wait before closing a
batch
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = org.apache.flume.source.twitter.TwitterSource
a1.sources.r1.consumerKey = YOUR_TWITTER_CONSUMER_KEY
a1.sources.r1.consumerSecret = YOUR_TWITTER_CONSUMER_SECRET
a1.sources.r1.accessToken = YOUR_TWITTER_ACCESS_TOKEN
a1.sources.r1.accessTokenSecret = YOUR_TWITTER_ACCESS_TOKEN_SECRET
a1.sources.r1.maxBatchSize = 10
a1.sources.r1.maxBatchDurationMillis = 200
Kafka Source
Kafka Source is an Apache Kafka consumer that reads messages from Kafka
topics. If you have multiple Kafka sources running, you can configure them
with the same Consumer Group so each will read a unique set of partitions for
the topics. This currently supports Kafka server releases 0.10.1.0 or higher.
Testing was done up to 2.0.1 that was the highest avilable version at the time
of the release.
Note
The Kafka Source overrides two Kafka consumer parameters:
auto.commit.enable is set to “false” by the source and every batch is
committed. Kafka source guarantees at least once strategy of messages
retrieval. The duplicates can be present when the source starts. The Kafka
Source also provides defaults for the
key.deserializer(org.apache.kafka.common.serialization.StringSerializer) and
value.deserializer(org.apache.kafka.common.serialization.ByteArraySerializer
). Modification of these parameters is not recommended.
Deprecated Properties
topic – Use kafka.topics
groupId flume Use kafka.consumer.group.id
zookeeperConnect – Is no longer supported by kafka
consumer client since 0.9.x. Use
kafka.bootstrap.servers to
establish connection with kafka
cluster
migrateZookeeperOffsets true When no Kafka stored offset is
found, look up the offsets in
Zookeeper and commit them to
Kafka. This should be true to
support seamless Kafka client
migration from older versions of
Flume. Once migrated this can be
set to false, though that should
generally not be required. If no
Zookeeper offset is found, the
Kafka configuration
kafka.consumer.auto.offset.reset
defines how offsets are handled.
Check Kafka documentation for
details
Example for topic subscription by comma-separated topic list.
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id
Example for topic subscription by regex
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics.regex = ^topic[0-9]$
# the default kafka.consumer.group.id=flume is used
Security and Kafka Source:

Secure authentication as well as data encryption is supported on the
communication channel between Flume and Kafka. For secure authentication
SASL/GSSAPI (Kerberos V5) or SSL (even though the parameter is named SSL,
the actual protocol is a TLS implementation) can be used from Kafka version
0.9.0.
As of now data encryption is solely provided by SSL/TLS.
Setting kafka.consumer.security.protocol to any of the following value means:
 SASL_PLAINTEXT - Kerberos or plaintext authentication with no data
encryption
 SASL_SSL - Kerberos or plaintext authentication with data encryption
 SSL - TLS based encryption with optional authentication.
Warning
There is a performance degradation when SSL is enabled, the magnitude of
which depends on the CPU type and the JVM implementation.
Reference: Kafka security overview and the jira for tracking this
issue: KAFKA-2561
TLS and Kafka Source:

Please read the steps described in Configuring Kafka Clients SSL to learn about
additional configuration settings for fine tuning for example any of the
following: security provider, cipher suites, enabled protocols, truststore or
keystore types.
Example configuration with server side authentication and data encryption.
a1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.source1.kafka.bootstrap.servers = kafka-1:9093,kafka-
2:9093,kafka-3:9093
a1.sources.source1.kafka.topics = mytopic
a1.sources.source1.kafka.consumer.group.id = flume-consumer
a1.sources.source1.kafka.consumer.security.protocol = SSL
# optional, the global truststore can be used alternatively
a1.sources.source1.kafka.consumer.ssl.truststore.location=/path/to/
truststore.jks
a1.sources.source1.kafka.consumer.ssl.truststore.password=<password to
access the truststore>
Specyfing the truststore is optional here, the global truststore can be used
instead. For more details about the global SSL setup, see the SSL/TLS
support section.
Note: By default the property ssl.endpoint.identification.algorithm is not
defined, so hostname verification is not performed. In order to enable
hostname verification, set the following properties
a1.sources.source1.kafka.consumer.ssl.endpoint.identification.algorithm=HTTP
S
Once enabled, clients will verify the server’s fully qualified domain name
(FQDN) against one of the following two fields:
1. Common Name (CN) https://tools.ietf.org/html/rfc6125#section-2.3
2. Subject Alternative Name
(SAN) https://tools.ietf.org/html/rfc5280#section-4.2.1.6
If client side authentication is also required then additionally the following
needs to be added to Flume agent configuration or the global SSL setup can be
used (see SSL/TLS support section). Each Flume agent has to have its client
certificate which has to be trusted by Kafka brokers either individually or by
their signature chain. Common example is to sign each client certificate by a
single Root CA which in turn is trusted by Kafka brokers.
# optional, the global keystore can be used alternatively

a1.sources.source1.kafka.consumer.ssl.keystore.location=/path/to/
client.keystore.jks
a1.sources.source1.kafka.consumer.ssl.keystore.password=<password to
access the keystore>
If keystore and key use different password protection

then ssl.key.password property will provide the required additional secret for
both consumer keystores:
a1.sources.source1.kafka.consumer.ssl.key.password=<password to access
the key>
Kerberos and Kafka Source:

To use Kafka source with a Kafka cluster secured with Kerberos, set
the consumer.security.protocol properties noted above for consumer. The
Kerberos keytab and principal to be used with Kafka brokers is specified in a
JAAS file’s “KafkaClient” section. “Client” section describes the Zookeeper
connection if needed. See Kafka doc for information on the JAAS file contents.
The location of this JAAS file and optionally the system wide kerberos
configuration can be specified via JAVA_OPTS in flume-env.sh:
JAVA_OPTS="$JAVA_OPTS -Djava.security.krb5.conf=/path/to/krb5.conf"
JAVA_OPTS="$JAVA_OPTS
-Djava.security.auth.login.config=/path/to/flume_jaas.conf"
Example secure configuration using SASL_PLAINTEXT:
2:9093,kafka-3:9093
a1.sources.source1.kafka.consumer.security.protocol = SASL_PLAINTEXT
a1.sources.source1.kafka.consumer.sasl.mechanism = GSSAPI
a1.sources.source1.kafka.consumer.sasl.kerberos.service.name = kafka
Example secure configuration using SASL_SSL:
2:9093,kafka-3:9093
a1.sources.source1.kafka.consumer.security.protocol = SASL_SSL
a1.sources.source1.kafka.consumer.sasl.mechanism = GSSAPI
a1.sources.source1.kafka.consumer.sasl.kerberos.service.name = kafka
# optional, the global truststore can be used alternatively
a1.sources.source1.kafka.consumer.ssl.truststore.location=/path/to/
truststore.jks
a1.sources.source1.kafka.consumer.ssl.truststore.password=<password to
access the truststore>
Sample JAAS file. For reference of its content please see client config sections
of the desired authentication mechanism (GSSAPI/PLAIN) in Kafka
documentation of SASL configuration. Since the Kafka Source may also connect
to Zookeeper for offset migration, the “Client” section was also added to this
example. This won’t be needed unless you require offset migration, or you
require this section for other secure components. Also please make sure that
the operating system user of the Flume processes has read privileges on the
jaas and keytab files.
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/path/to/keytabs/flume.keytab"
principal="flume/flumehost1.example.com@YOURKERBEROSREALM";
};
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/path/to/keytabs/flume.keytab"
principal="flume/flumehost1.example.com@YOURKERBEROSREALM";
};
NetCat TCP Source
A netcat-like source that listens on a given port and turns each line of text into
an event. Acts like nc -k -l [host] [port]. In other words, it opens a specified
port and listens for data. The expectation is that the supplied data is newline
separated text. Each line of text is turned into a Flume event and sent via the
connected channel.
channels –
type name, needs
to be netcat
bind – Host name or IP
address to bind to
max-line-length 512 Max line length
per event body
(in bytes)
ack-every-event true Respond with an
“OK” for every
event received
multiplexing
selector.type
value
list of
interceptors
interceptors.*
a1.sources = r1
a1.channels = c1
NetCat UDP Source
As per the original Netcat (TCP) source, this source that listens on a given port
and turns each line of text into an event and sent via the connected channel.
Acts like nc -u -k -l [host] [port].
channels –
type name, needs
to
be netcatudp
bind – Host name or IP
address to bind to
remoteAddressHeader –
multiplexing
selector.type
value
list of
interceptors
interceptors.*
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcatudp
Sequence Generator Source
A simple sequence generator that continuously generates events with a

counter that starts from 0, increments by 1 and stops at totalEvents. Retries
when it can’t send events to the channel. Useful mainly for testing. During
retries it keeps the body of the retried messages the same as before so that
the number of unique events - after de-duplication at destination - is expected
to be equal to the specified totalEvents. Required properties are in bold.
Property
channels –
type name, needs
to be seq
selector.type replicating or
multiplexing
selector.* replicating Depends on the
selector.type
value
list of
interceptors
interceptors.*
batchSize 1 Number of
events to attempt
to process per
request loop.
totalEvents Long.MAX_ Number of
VALUE unique events
sent by the
source.
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = seq
Syslog Sources
Reads syslog data and generate Flume events. The UDP source treats an entire
message as a single event. The TCP sources create a new event for each string
of characters separated by a newline (‘n’).
Syslog TCP Source

The original, tried-and-true syslog TCP source.
channels –
name, needs to
be syslogtcp
host – Host name or IP
address to bind to
eventSize 2500 Maximum size of a
single event line, in
bytes
keepFields none Setting this to ‘all’ will
preserve the Priority,
Timestamp and
Hostname in the body
of the event. A spaced
separated list of fields
to include is allowed as
well. Currently, the
following fields can be
included: priority,
version, timestamp,
hostname. The values
‘true’ and ‘false’ have
been deprecated in
favor of ‘all’ and
‘none’.
clientIPHeader – If specified, the IP
address of the client
will be stored in the
header of each event
using the header name
specified here. This
allows for interceptors
and channel selectors
to customize routing
logic based on the IP
address of the client.
Do not use the standard
Syslog header names
here (like _host_)
because the event
header will be
overridden in that case.
clientHostnameHeader – If specified, the host
name of the client will
be stored in the header
of each event using the
header name specified
here. This allows for
interceptors and
channel selectors to
customize routing logic
based on the host name
of the client.
Retrieving the host
name may involve a
name service reverse
lookup which may
affect the performance.
Syslog header names
here (like _host_)
because the event
header will be
multiplexing
selector.type value
interceptors
interceptors.*
enable SSL encryption.
If SSL is enabled, you
must also specify a
either through
component level
parameters (see below)
or as global SSL
parameters
(see SSL/TLS
supportsection).
the global keystore will
be used (if defined,
otherwise
specified here, then the
global keystore
“JKS” or “PKCS12”. If
the global keystore
type will be used (if
defined, otherwise the
default is JKS).
exclude. SSLv3 will
always be excluded in
addition to the
protocols will be the
included protocols
without the excluded
protocols. If included-
protocols is empty, it
includes every
supported protocols.
exclude-cipher-suites – Space-separated list of
cipher suites to
exclude.
include-cipher-suites – Space-separated list of
cipher suites to
the included cipher
suites without the
excluded cipher suites.
If included-cipher-
suites is empty, it
includes every
For example, a syslog TCP source for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogtcp
a1.sources.r1.host = localhost
Multiport Syslog TCP Source
This is a newer, faster, multi-port capable version of the Syslog TCP source.
Note that the ports configuration setting has replaced port. Multi-port
capability means that it can listen on many ports at once in an efficient
manner. This source uses the Apache Mina library to do that. Provides support
for RFC-3164 and many common RFC-5424 formatted messages. Also provides
the capability to configure the character set used on a per-port basis.
channels –
type – The component type name,
needs to
be multiport_syslogtc
p
host – Host name or IP address to
bind to.
ports – Space-separated list (one or
more) of ports to bind to.
eventSize 2500 Maximum size of a single
event line, in bytes.
keepFields none Setting this to ‘all’ will
preserve the Priority,
Timestamp and Hostname in
the body of the event. A
spaced separated list of fields
to include is allowed as well.
Currently, the following fields
can be included: priority,
version, timestamp, hostname.
The values ‘true’ and ‘false’
have been deprecated in favor
of ‘all’ and ‘none’.
portHeader – If specified, the port number
will be stored in the header of
each event using the header
name specified here. This
allows for interceptors and
channel selectors to customize
routing logic based on the
incoming port.
clientIPHeader – If specified, the IP address of
the client will be stored in the
header of each event using the
header name specified here.
This allows for interceptors
and channel selectors to
customize routing logic based
on the IP address of the client.
Syslog header names here
(like _host_) because the
event header will be
clientHostnameHeader – If specified, the host name of
the client will be stored in the
header of each event using the
header name specified here.
This allows for interceptors
and channel selectors to
customize routing logic based
on the host name of the client.
Retrieving the host name may
involve a name service reverse
lookup which may affect the
performance. Do not use the
standard Syslog header names
here (like _host_) because the
event header will be
charset.default UTF-8 Default character set used
while parsing syslog events
into strings.
charset.port.<port> – Character set is configurable
on a per-port basis.
batchSize 100 Maximum number of events
to attempt to process per
request loop. Using the default
is usually fine.
readBufferSize 1024 Size of the internal Mina read
buffer. Provided for
performance tuning. Using the
default is usually fine.
numProcessors (auto-detected) Number of processors
available on the system for
use while processing
messages. Default is to auto-
detect # of CPUs using the
Java Runtime API. Mina will
spawn 2 request-processing
threads per detected CPU,
which is often reasonable.
selector.type replicating replicating, multiplexing, or
custom
selector.* – Depends on
the selector.type value
interceptors.
interceptors.*
ssl false Set this to true to enable SSL
encryption. If SSL is enabled,
you must also specify a
“keystore” and a “keystore-
password”, either through
component level parameters
(see below) or as global SSL
parameters (see SSL/TLS
support section).
keystore – This is the path to a Java
keystore file. If not specified
here, then the global keystore
will be used (if defined,
otherwise configuration error).
keystore-password – The password for the Java
keystore. If not specified here,
then the global keystore
password will be used (if
defined, otherwise
keystore-type JKS The type of the Java keystore.
This can be “JKS” or
“PKCS12”. If not specified
here, then the global keystore
type will be used (if defined,
otherwise the default is JKS).
exclude. SSLv3 will always
be excluded in addition to the
SSL/TLS protocols to include.
The enabled protocols will be
the included protocols without
the excluded protocols. If
included-protocols is empty, it
includes every supported
protocols.
exclude-cipher-suites – Space-separated list of cipher
suites to exclude.
include-cipher-suites – Space-separated list of cipher
suites to include. The enabled
cipher suites will be the
included cipher suites without
the excluded cipher suites. If
included-cipher-suites is
empty, it includes every
For example, a multiport syslog TCP source for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = multiport_syslogtcp
a1.sources.r1.host = 0.0.0.0
a1.sources.r1.ports = 10001 10002 10003
a1.sources.r1.portHeader = port
Syslog UDP Source

channels –
type name, needs
to
be syslogudp
host – Host name or IP
address to bind to
keepFields false Setting this to
true will preserve
the Priority,
Timestamp and
Hostname in the
body of the event.
clientIPHeader – If specified, the
IP address of the
client will be
stored in the
header of each
event using the
header name
specified here.
This allows for
interceptors and
channel selectors
to customize
routing logic
based on the IP
address of the
client. Do not use
the standard
Syslog header
names here (like
_host_) because
the event header
will be
overridden in that
case.
clientHostnameHeader – If specified, the
host name of the
client will be
stored in the
header of each
event using the
header name
specified here.
This allows for
interceptors and
channel selectors
to customize
routing logic
based on the host
name of the
client. Retrieving
the host name
may involve a
name service
reverse lookup
which may affect
the performance.
Do not use the
standard Syslog
header names
here (like _host_)
because the event
header will be
overridden in that
case.
multiplexing
selector.type
value
list of
interceptors
interceptors.*
For example, a syslog UDP source for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogudp
a1.sources.r1.host = localhost
HTTP Source
A source which accepts Flume Events by HTTP POST and GET. GET should be
used for experimentation only. HTTP requests are converted into flume events
by a pluggable “handler” which must implement the HTTPSourceHandler
interface. This handler takes a HttpServletRequest and returns a list of flume
events. All events handled from one Http request are committed to the channel
in one transaction, thus allowing for increased efficiency on channels like the
file channel. If the handler throws an exception, this source will return a HTTP
status of 400. If the channel is full, or the source is unable to append events to
the channel, the source will return a HTTP 503 - Temporarily unavailable
status.
All events sent in one post request are considered to be one batch and inserted
into the channel in one transaction.
This source is based on Jetty 9.4 and offers the ability to set additional Jetty-
specific parameters which will be passed directly to the Jetty components.
Deprecated Properties
keystorePassword – Use keystore-password.
Deprecated value will be
overwritten with the new
one.
excludeProtocols SSLv3 Use exclude-protocols.
Deprecated value will be
overwritten with the new
one.
enableSSL false Use ssl. Deprecated
value will be overwritten
with the new one.
N.B. Jetty-specific settings are set using the setter-methods on the objects
listed above. For full details see the Javadoc for these classes
(QueuedThreadPool,HttpConfiguration, SslContextFactory and ServerConnector
).
When using Jetty-specific setings, named properites above will take
precedence (for example excludeProtocols will take precedence over
SslContextFactory.ExcludeProtocols). All properties will be inital lower case.
An example http source for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.handler = org.example.rest.RestHandler

Big Data

Uploaded by

Document Information

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Big Data

Uploaded by

Copyright:

Table of contents

Apache Spark vs Hadoop: Introduction to Hadoop..................................................................12

Resilient Distributed Dataset (RDD) is a fundamental data structure of Spark. It is an

Apache Spark vs Hadoop: Parameters to Compare

Batch Processing vs Stream Processing

Now coming back to Apache Spark vs Hadoop, YARN is a basically a batch-

Hadoop supports Kerberos for authentication, but it is difficult to handle.

Use-cases where Hadoop fits best:

Use-cases where Spark fits best:

Real-Time Big Data Analysis:

Introduction of Apache Spark solved these problems to a great extent. Spark

 Fast data processing. As we know, Spark allows in-memory processing. As a

Node abstraction in Apache Mesos (source)

Framework scheduling in Mesos (source)

Long Running Services

The Aurora Mesos Framework

Singularity Scheduler Dependencies

Optional Slave Components

Chapel: Productive Parallel Programming

DPark is a Python clone of Spark, MapReduce(R) alike computing framework supporting

$ sudo apt-get install libtool pkg-config build-essential autoconf automake

$ pip install dpark

2.use api callsite graph

2.get log hubdir from log like logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b-

UI examples for features

./src/exelixi.py -n localhost:5050 | ./bin/install.sh

Low latency on minimal resources

Variety of Sources and Sinks

High level API

Exactly once processing

Features implemented or in short-term development for Open MPI include:

 Full MPI-3.1 standards  Many OS's supported (32 and 64

Spark Vs Other tools :

Spark Streaming vs Flink vs Storm vs Kafka

Datastorage for big Data:

There are 4 basic types of NoSQL databases:

Performance is enhanced to a great degree because of the cache mechanisms that

 Get(key), returns the value associated with the provided key.

 Put(key, value), associates the value with the key.

In column-oriented NoSQL database, data is stored in cells grouped in columns of data

 ColumnFamily: ColumnFamily is a single structure that can group Columns and

 Sparse – some cells can be empty

 Distributed – data is partitioned across many hosts

 Persistent – stored to disk

 Multidimensional – more than 1 dimension

 Map – key and value

 Sorted – maps are generally not sorted but this one is

City Pincode Strength Project

 The outermost keys 3PillarNoida, 3PillarCluj, 3PillarTimisoara and 3PillarFairfax are

 ‘address’ and ‘details’ are called column families.

 The column-family ‘address’ has columns ‘city’ and ‘pincode’.

 The column-family details’ has columns ‘strength’ and ‘projects’.

MeshBase– It is a perfect option where standalone deployment is required.

Editorial information provided by DB-Engines

Related products and services

Couchbase: Distributed NoSQL document-oriented database that is optimized for interactive

Nodes Cassandra HBase MongoDB Couchbase

Nodes Cassandra HBase MongoDB

Apache HBase: Why We Use It and Believe In It

Apache HBase is an open-source, distributed, versioned, non-relational database

Consistent reads and writes

Why do we use HBase?

Benefits of HBase within Splice Machine include:

Strong consistency – writes and reads are always consistent as compared

Splice Machine has an innovative integration with HBase, including:

Asynchronous write pipeline which supports non-blocking, parallel writes

10 reasons we should use couchbase Database :