Professional Documents
Culture Documents
HDFS
HDFS creates an abstraction of resources, let me simplify it for you. Similar as
virtualization, you can see HDFS logically as a single unit for storing Big Data, but
actually you are storing your data across multiple nodes in a distributed fashion.
Here, you have master-slave architecture. In HDFS, Namenode is a master node and
Datanodes are slaves.
NameNode
It is the master daemon that maintains and manages the DataNodes (slave nodes). It
records the metadata of all the files stored in the cluster, e.g. location of blocks
stored, the size of the files, permissions, hierarchy, etc. It records each and every
change that takes place to the file system metadata.
For example, if a file is deleted in HDFS, the NameNode will immediately record this
in the EditLog. It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of
all the blocks in HDFS and in which nodes these blocks are stored.
DataNode
These are slave daemons which runs on each slave machine. The actual data is
stored on DataNodes. They are responsible for serving read and write requests from
the clients. They are also responsible for creating blocks, deleting blocks and
replicating the same based on the decisions taken by the NameNode.
YA
RN
YARN performs all your processing activities by allocating resources and scheduling
tasks. It has two major daemons, i.e. ResourceManager and NodeManager.
ResourceManager
It is a cluster level (one for each cluster) component and runs on the master
machine. It manages resources and schedule applications running on top of YARN.
NodeManager
It is a node level component (one on each node) and runs on each slave machine. It
is responsible for managing containers and monitoring resource utilization in each
container. It also keeps track of node health and log management. It continuously
communicates with ResourceManager to remain up-to-date. So, you can perform
parallel processing on HDFS using MapReduce.
To learn more about Hadoop, you can go through this Hadoop Tutorial blog. Now, that
we are all set with Hadoop introduction, let’s move on to Spark introduction.
Apache Spark and Scala Certification Training Watch The Course Preview
Apache Spark
vs Hadoop:
Introduction to Apache Spark
Apache Spark is a framework for real time data analytics in a distributed computing
environment. It executes in-memory computations to increase speed of data
processing. It is faster for processing large scale data as it exploits in-memory
computations and other optimizations. Therefore, it requires high processing
power.
1. Spark Core – Spark Core is the base engine for large-scale parallel and
distributed data processing. Further, additional libraries which are built atop
the core allow diverse workloads for streaming, SQL, and machine learning. It
is responsible for memory management and fault recovery, scheduling,
distributing and monitoring jobs on a cluster & interacting with storage
systems
2. Spark Streaming – Spark Streaming is the component of Spark which is used to
process real-time streaming data. Thus, it is a useful addition to the core
Spark API. It enables high-throughput and fault-tolerant stream processing of
live data streams
3. Spark SQL: Spark SQL is a new module in Spark which integrates relational
processing with Spark’s functional programming API. It supports querying
data either via SQL or via the Hive Query Language. For those of you familiar
with RDBMS, Spark SQL will be an easy transition from your earlier tools
where you can extend the boundaries of traditional relational data
processing.
4. GraphX: GraphX is the Spark API for graphs and graph-parallel computation.
Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. At
a high-level, GraphX extends the Spark RDD abstraction by introducing the
Resilient Distributed Property Graph: a directed multigraph with properties
attached to each vertex and edge.
5. MLlib (Machine Learning): MLlib stands for Machine Learning Library. Spark
MLlib is used to perform machine learning in Apache Spark.
As you can see, Spark comes packed with high-level libraries, including support for
R, SQL, Python, Scala, Java etc. These standard libraries increase the seamless
integrations in complex workflow. Over this, it also allows various sets of services to
integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc. to
increase its capabilities.
To learn more about Apache Spark, you can go through this Spark Tutorial blog. Now
the ground is all set for Apache Spark vs Hadoop. Let’s move ahead and compare
Apache Spark with Hadoop on different parameters to understand their strengths.
Spark is fast because it has in-memory processing. It can also use disk for data that
doesn’t all fit into memory. Spark’s in-memory processing delivers near real-time
analytics. This makes Spark suitable for credit card processing system, machine
learning, security analytics and Internet of Things sensors.
Hadoop was originally setup to continuously gather data from multiple sources
without worrying about the type of data and storing it across distributed
environment. MapReduce uses batch processing. MapReduce was never built for
real-time processing, main idea behind YARN is parallel processing over distributed
dataset.
The problem with comparing the two is that they perform processing differently.
Ease of Use
Spark comes with user-friendly APIs for Scala, Java, Python, and Spark SQL. Spark
SQL is very similar to SQL, so it becomes easier for SQL developers to learn it. Spark
also provides an interactive shell for developers to query & perform other actions, &
have immediate feedback.
You can ingest data in Hadoop easily either by using shell or integrating it with
multiple tools like Sqoop, Flume etc. YARN is just a processing framework and it can
be integrated with multiple tools like Hive and Pig. HIVE is a data warehousing
component which performs reading, writing and managing large data sets in a
distributed environment using SQL-like interface. You can go through this Hadoop
ecosystem blog to know about the various tools that can be integrated with Hadoop.
Costs
Hadoop and Spark are both Apache open source projects, so there’s no cost for the
software. Cost is only associated with the infrastructure. Both the products are
designed in such a way that it can run on commodity hardware with low TCO.
Now you may be wondering the ways in which they are different. Storage &
processing in Hadoop is disk-based & Hadoop uses standard amounts of memory.
So, with Hadoop we need a lot of disk space as well as faster disks. Hadoop also
requires multiple systems to distribute the disk I/O.
Due to Apache Spark’s in memory processing it requires a lot of memory, but it can
deal with a standard speed & amount of disk. As disk space is a relatively
inexpensive commodity and since Spark does not use disk I/O for processing,
instead it requires large amounts of RAM for executing everything in memory. Thus,
Spark system incurs more cost.
But yes, one important thing to keep in mind is that Spark’s technology reduces the
number of required systems. It needs significantly fewer systems that cost more. So,
there will be a point at which Spark reduces the costs per unit of computation even
with the additional RAM requirement.
Data Processing
There are two types of data processing: Batch Processing & Stream Processing.
Batch Processing: Batch processing has been crucial to big data world. In simplest
term, batch processing is working with high data volumes collected over a period.
In batch processing data is first collected and then processed results are produced at
a later stage.
Batch processing is an efficient way of processing large, static data sets. Generally,
we perform batch processing for archived data sets. For example, calculating
average income of a country or evaluating the change in e-commerce in last decade.
Stream processing: Stream processing is the current trend in the big data world. Need
of the hour is speed and real-time information, which is what steam processing
does. Batch processing does not allow businesses to quickly react to changing
business needs in real time, stream processing has seen a rapid growth in demand.
Spark performs similar operations, but it uses in-memory processing and optimizes
the steps. GraphX allows users to view the same data as graphs and as collections.
Users can also transform and join graphs with Resilient Distributed Datasets (RDDs).
Fault Tolerance
Hadoop and Spark both provides fault tolerance, but both have different approach.
For HDFS and YARN both, master daemons (i.e. NameNode & ResourceManager
respectively) checks heartbeat of slave daemons (i.e. DataNode & NodeManager
respectively). If any slave daemon fails, master daemons reschedules all pending
and in-progress operations to another slave. This method is effective, but it can
significantly increase the completion times for operations with single failure also.
As Hadoop uses commodity hardware, another way in which HDFS ensures fault
tolerance is by replicating data.
As we discussed above, RDDs are building blocks of Apache Spark. RDDs provide
fault tolerance to Spark. They can refer to any dataset present in external storage
system like HDFS, HBase, shared filesystem. They can be operated parallelly.
RDDs can persist a dataset in memory across operations, which makes future
actions 10 times much faster. If a RDD is lost, it will automatically be recomputed by
using the original transformations. This is how Spark provides fault-tolerance.
Security
Spark currently supports authentication via a shared secret. Spark can integrate
with HDFS and it can use HDFS ACLs and file-level permissions. Spark can also run
on YARN leveraging the capability of Kerberos.
Real-time data analysis means processing data generated by the real-time event
streams coming in at the rate of millions of events per second, Twitter data for
instance. The strength of Spark lies in its abilities to support streaming of data along
with distributed processing. This is a useful combination that delivers near real-
time processing of data. MapReduce is handicapped of such an advantage as it was
designed to perform batch cum distributed processing on large amounts of data.
Real-time data can still be processed on MapReduce but its speed is nowhere close
to that of Spark.
Spark claims to process data 100x faster than MapReduce, while 10x faster with the
disks.
Graph Processing:
Most graph processing algorithms like page rank perform multiple iterations over
the same data and this requires a message passing mechanism. We need to program
MapReduce explicitly to handle such multiple iterations over the same data.
Roughly, it works like this: Read data from the disk and after a particular iteration,
write results to the HDFS and then read data from the HDFS for next the iteration.
This is very inefficient since it involves reading and writing data to the disk which
involves heavy I/O operations and data replication across the cluster for fault
tolerance. Also, each MapReduce iteration has very high latency, and the next
iteration can begin only after the previous job has completely finished.
Also, message passing requires scores of neighboring nodes in order to evaluate the
score of a particular node. These computations need messages from its neighbors
(or data across multiple stages of the job), a mechanism that MapReduce lacks.
Different graph processing tools such as Pregel and GraphLab were designed in
order to address the need for an efficient platform for graph processing algorithms.
These tools are fast and scalable, but are not efficient for creation and post-
processing of these complex multi-stage algorithms.
Almost all machine learning algorithms work iteratively. As we have seen earlier,
iterative algorithms involve I/O bottlenecks in the MapReduce implementations.
MapReduce uses coarse-grained tasks (task-level parallelism) that are too heavy for
iterative algorithms. Spark with the help of Mesos – a distributed system kernel,
caches the intermediate dataset after each iteration and runs multiple iterations on
this cached dataset which reduces the I/O and helps to run the algorithm faster in a
fault tolerant manner.
Spark has a built-in scalable machine learning library called MLlib which contains
high-quality algorithms that leverages iterations and yields better results than one
pass approximations sometimes used on MapReduce.
Resource sharing across the cluster increases throughput and utilization (source)
Mesos is essentially data center kernel—which means it's the software that actually isolates the
running workloads from each other. It still needs additional tooling to let engineers get their
workloads running on the system and to manage when those jobs actually run. Otherwise, some
workloads might consume all the resources, or important workloads might get bumped by less-
important workloads that happen to require more resources.Hence Mesos needs more than just a
kernel—Chronos scheduler, a cron replacement for automatically starting and stopping services
(and handling failures) that runs on top of Mesos. The other part of the Mesos is Marathon that
provides API for starting, stopping and scaling services (and Chronos could be one of those
services).
Workloads in Chronos and Marathon (source)
Architecture
Mesos consists of a master process that manages slave daemons running on each cluster node, and
frameworks that run tasks on these slaves. The master implements fine-grained sharing across
frameworks using resource offers. Each resource offer is a list of free resources on multiple slaves.
The master decides how many resources to offer to each framework according to an organizational
policy, such as fair sharing or priority. To support a diverse set of inter-framework allocation
policies, Mesos lets organizations define their own policies via a pluggable allocation module.
Mesos architecture with two running frameworks (source)
Each framework running on Mesos consists of two components: a scheduler that registers with the
master to be offered resources, and an executor process that is launched on slave nodes to run the
framework's tasks. While the master determines how many resources to offer to each framework,
the frameworks' schedulers select which of the offered resources to use. When a framework accepts
offered resources, it passes Mesos a description of the tasks it wants to launch on them.
DevOps tooling
Vamp is a deployment and workflow tool for container orchestration systems, including Mesos/
Marathon. It brings canary releasing, A/B testing, auto scaling and self healing through a web
UI, CLI and REST API.
Aurora
An important tool that has evolved out of the Mesos environment is Aurora, which recently graduated
from the Apache Incubator and is now a full Apache project Figure 1. According to the project website,,
"Aurora runs applications and services across a shared pool of machines, and is responsible for
keeping them running, forever. When machines experience failure, Aurora intelligently reschedules
those jobs onto healthy machines" [2]. In other words, Aurora is a little like an init tool for data centers
and cloud-based virtual environments.
Figure 1: Aurora is a Mesos Framework;
Mesos is in turn an Apache project.
The Aurora project has many fathers: In addition to its kinship with Apache and Mesos, Aurora was
initially supported by Twitter, and Google was at least indirectly an inspiration for the project. The
beginnings of Aurora date back to 2010. Bill Farner, a member of the research team at Twitter, launched
a project to facilitate the operation of tweeting infrastructure. The IT landscape of the short message
service had grown considerably at that time. The operations team was faced with thousands of
computers and hundreds of applications. Added to this was the constant rollout of new software
versions.
Bill Farner had previously worked at Google and had some experience working with Google's Borg
cluster manager [3]. In the early years, development took place only within Twitter and behind closed
doors. However, more and more employees contributed to the development, and Aurora became
increasingly important for the various Twitter services. Eventually, the opening of the project in the
direction of the open source community was a natural step to maintain such a fast-growing software
project. Aurora has been part of the Apache family since 2013.
What is Singularity
Singularity is a platform that enables deploying and running services and scheduled jobs in the
cloud or data centers. Combined with Apache Mesos, it provides efficient management of the
underlying processes life cycle and effective use of cluster resources.
Singularity is an essential part of the HubSpot Platform and is ideal for deploying micro-services. It
is optimized to manage thousands of concurrently running processes in hundreds of servers.
How it Works
Singularity is an Apache Mesos framework. It runs as a task scheduler on top of Mesos Clusters
taking advantage of Apache Mesos' scalability, fault-tolerance, and resource isolation. Apache
Mesos is a cluster manager that simplifies the complexity of running different types of applications
on a shared pool of servers. In Mesos terminology, Mesos applications that use the Mesos APIs to
schedule tasks in a cluster are called frameworks.
There are different types of frameworks and most frameworks concentrate on a specific type of task
(e.g. long-running vs scheduled cron-type jobs) or supporting a specific domain and relevant
technology (e.g. data processing with hadoop jobs vs data processing with spark).
Singularity tries to be more generic by combining long-running tasks and job scheduling
functionality in one framework to support many of the common process types that developers need
to deploy every day to build modern web applications and services. While Mesos allows multiple
frameworks to run in parallel, it greatly simplifies the PaaS architecture by having a consistent and
uniform set of abstractions and APIs for handling deployments across the organization.
Additionally, it reduces the amount of framework boilerplate that must be supported - as all Mesos
frameworks must keep state, handle failures, and properly interact with the Mesos APIs. These are
the main reasons HubSpot engineers initiated the development of a new framework. As of this
moment, Singularity supports the following process types:
Web Services. These are long running processes which expose an API and may run with
multiple load balanced instances. Singularity supports automatic configurable health
checking of the instances at the process and API endpoint level as well as load balancing.
Singularity will automatically restart these tasks when they fail or exit.
Workers. These are long running processes, similar to web services, but do not expose an
API. Queue consumers are a common type of worker processes. Singularity does automatic
health checking, cool-down and restart of worker instances.
Scheduled (CRON-type) Jobs. These are tasks that periodically run according to a
provided CRON schedule. Scheduled jobs will not be restarted when they fail unless
instructed to do so. Singularity will run them again on the next scheduling cycle.
On-Demand Processes. These are manually run processes that will be deployed and ready
to run but Singularity will not automatically run them. Users can start them through an API
call or using the Singularity Web UI, which allows them to pass command line parameters
on-demand.
Singularity Components
Mesos frameworks have two major components. A scheduler component that registers with the
Mesos master to be offered resources and an executor component that is launched on cluster slave
nodes by the Mesos slave process to run the framework tasks.
The Mesos master determines how many resources are offered to each framework and the
framework scheduler selects which of the offered resources to use to run the required tasks. Mesos
slaves do not directly run the tasks but delegate the running to the appropriate executor that has
knowledge about the nature of the allocated task and the special handling that might be required.
As depicted in the figure, Singularity implements the two basic framework components as well as a
few more to solve common complex / tedious problems such as task cleanup and log tailing /
archiving without requiring developers to implement it for each task they want to run:
Singularity Scheduler
The scheduler is the core of Singularity: a DropWizard API that implements the Mesos Scheduler
Driver. The scheduler matches client deploy requests to Mesos resource offers and acts as a web
service offering a JSON REST API for accepting deploy requests.
Clients use the Singularity API to register the type of deployable item that they want to run (web
service, worker, cron job) and the corresponding runtime settings (cron schedule, # of instances,
whether instances are load balanced, rack awareness, etc.).
After a deployable item (a request, in API terms) has been registered, clients can post Deploy
requests for that item. Deploy requests contain information about the command to run, the executor
to use, executor specific data, required cpu, memory and port resources, health check URLs and a
variety of other runtime configuration options. The Singularity scheduler will then attempt to match
Mesos offers (which in turn include resources as well as rack information and what else is running
on slave hosts) with its list of Deploy requests that have yet to be fulfilled.
Rollback of failed deploys, health checking and load balancing are also part of the advanced
functionality the Singularity Scheduler offers. A new deploy for a long runing service will run as
shown in the diagram below.
When a service or worker instance fails in a new deploy, the Singularity scheduler will rollback all
instances to the version running before the deploy, keeping the deploys always consistent. After the
scheduler makes sure that a Mesos task (corresponding to a service instance) has entered the
TASK_RUNNING state it will use the provided health check URL and the specified health check
timeout settings to perform health checks. If health checks go well, the next step is to perform load
balancing of service instances. Load balancing is attempted only if the corresponding deployable
item has been defined to be loadBalanced. To perform load balancing between service instances,
Singularity supports a rich integration with a specific Load Balancer API. Singularity will post
requests to the Load Balancer API to add the newly deployed service instances and to remove those
that were previously running. Check Integration with Load Balancers to learn more. Singularity also
provides generic webhooks which allow third party integrations, which can be registered to follow
request, deploy, or task updates.
Slave Placement
When matching a Mesos resource offer to a deploy, Singularity can use one of several strategies to
determine if the host in the offer is appropriate for the task in question, or SlavePlacement in
Singularity terms. Available placement strategies are:
GREEDY: uses whatever slaves are available
SEPARATE_BY_DEPLOY/SEPARATE: ensures no 2 instances / tasks of the same request and
deploy id are ever placed on the same slave
SEPARATE_BY_REQUEST: ensures no two tasks belonging to the same request (regardless if
deploy id) are placed on the same host
OPTIMISTIC: attempts to spread out tasks but may schedule some on the same slave
SPREAD_ALL_SLAVES: ensure the task is running on every slave. Some behaviour as
SEPARATE_BY_DEPLOY but with autoscaling the Request to keep instances equal number of
slaves.
Slave placement can also be impacted by slave attributes. There are three scenarios that Singularity
supports:
1. Specific Slaves -> For a certain request, only run it on slaves with matching attributes - In
this case, you would specify requiredSlaveAttributes in the json for your request, and
the tasks for that request would only be scheduled on slaves that have all of those attributes.
2. Reserved Slaves -> Reserve a slave for specific requests, only run those requests on those
slaves - In your Singularity config, specify the reserveSlavesWithAttributes field.
Singularity will then only schedule tasks on slaves with those attributes if the request's
required attributes also match those.
3. Test Group of Slaves -> Reserve a slave for specific requests, but don't restrict the requests
to that slave - In your Singularity config, specify the reserveSlavesWithAttributes
field as in the previous example. But, in the request json, specify the
allowedSlaveAttributes field. Then, the request will be allowed to run elsewhere in
the cluster, but will also have the matching attributes to run on the reserved slave.
Dpark
## Then just pip install dpark (``sudo`` maybe needed if you encounter permission
problem).
Example
for word counting (wc.py):
from dpark import DparkContext
ctx = DparkContext()
file = ctx.textFile("/tmp/words.txt")
words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))
wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()
print wc
This script can run locally or on a Mesos cluster without any modification, just using
different command-line arguments:
$ python wc.py
$ python wc.py -m process
$ python wc.py -m host[:port]
See examples/ for more use cases.
Configuration
DPark can run with Mesos 0.9 or higher.
If a $MESOS_MASTER environment variable is set, you can use a shortcut and run DPark with
Mesos just by typing
$ python wc.py -m mesos
$MESOS_MASTER can be any scheme of Mesos master, such as
$ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master
In order to speed up shuffling, you should deploy Nginx at port 5055 for accessing data in
DPARK_WORK_DIR (default is /tmp/dpark), such as:
server {
listen 5055;
server_name localhost;
root /tmp/dpark/;
}
UI
2 DAGs:
1.stage graph: stage is a running unit, contain a set of task, each run same ops for a
split of rdd.
UI when running
Just open the url from log like start listening on Web UI http://server_01:40812 .
UI after running
1.before run, config LOGHUB & LOGHUB_PATH_FORMAT in dpark.conf, pre-create
LOGHUB_DIR.
rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey()
rdd.map(m).collect()
rdd.map(m).collect()
combine nodes iff with same lineage, form a logic tree inside stage, then each node
contain a PIPELINE of rdds.
rdd1 = get_rdd()
rdd2 = dc.union([get_rdd() for i in range(2)])
rdd3 = get_rdd().groupByKey()
dc.union([rdd1, rdd2, rdd3]).collect()
Exelixi
Exelixi is a distributed framework based on Apache Mesos, mostly implemented in Python
using gevent for high-performance concurrency It is intended to run cluster computing
jobs (partitioned batch jobs, which include some messaging) in pure Python. By default, it
runs genetic algorithms at scale. However, it can handle a broad range of other problem
domains by using --uow command line option to override the UnitOfWorkFactory class
definition.
Please see the project wiki for more details, including a tutorial on how to build Mesos-
based frameworks.
Quick Start
To check out the GA on a laptop (with Python 2.7 installed), simply run:
./src/ga.py
Otherwise, to run at scale, the following steps will help you get Exelixi running on Apache
Mesos. For help in general with command line options:
./src/exelixi.py -h
The following instructions are based on using the Elastic Mesos service, which uses Ubuntu
Linux servers running on Amazon AWS. Even so, the basic outline of steps shown here
apply in general.
First, launch an Apache Mesos cluster. Once you have confirmation that your cluster is
running (e.g., Elastic Mesos sends you an email messages with a list of masters and slaves)
then use ssh to login on any of the masters:
ssh -A -l ubuntu <master-public-ip>
You must install the Python bindings for Apache Mesos. The default version of Mesos
changes in this code as there are updates to Elastic Mesos, since the tutorials are based on
that service. You can check http://mesosphere.io/downloads/ for the latest. If you run
Mesos in different environment, simply make a one-line change to the EGG environment
variable in the bin/local_install.sh script. Also, you need to install the Exelixi source.
On the Mesos master, download the master branch of the Exelixi code repo on GitHub and
install the required libraries:
wget https://github.com/ceteri/exelixi/archive/master.zip ; \
unzip master.zip ; \
cd exelixi-master ; \
./bin/local_install.sh
If you've customized the code by forking your own GitHub code repo, then substitute that
download URL instead. Alternatively, if you've customized by subclassing the
uow.UnitOfWorkFactory default GA, then place that Python source file into the src/
subdirectory.
Next, run the installation command on the master, to set up each of the slaves:
./src/exelixi.py -m localhost:5050 -w 2
Once everything has been set up successfully, the log file in exelixi.log will show a line:
all worker services launched and init tasks completed
Flink
Apache Flink is an open source streaming platform which provides you tremendous capabilities to
run real-time data processing pipelines in a fault-tolerant way at a scale of millions of events per
second.
The key point is that it does all this using the minimum possible resources at single millisecond
latencies.
So how does it manage that and what makes it better than other solutions in the same domain?
Fault tolerance
Flink provides robust fault-tolerance using checkpointing (periodically saving internal state to
external sources such as HDFS).
However, Flink’s checkpointing mechanism can be made incremental (save only the changes and
not the whole state) which really reduces the amount of data in HDFS and the I/O duration. The
checkpointing overhead is almost negligible which enables users to have large states inside Flink
applications.
Flink also provides a high availability setup through zookeeper. This is for re-spawning the job in
the cases when the driver (which is known as JobManager in Flink) crashes due to some error.
SQL Support
Like Spark streaming Flink also provides a SQL API interface which makes writing a job easier for
people with non programming background. Flink SQL is maturing day by day and is already being
used by companies such as UBER and Alibaba to do analytics on real time data.
Environment Support
A Flink job can be run in a distributed system or in local machine. The program can run on mesos,
yarn, kubernetes as well as standalone mode (e.g. in docker containers). Since Flink 1.4, Hadoop is
not a pre-requisite which opens up a number of possibilities for places to run a flink job.
Awesome community
Flink has a great dev community which allows for frequent new features and bug fixes as well as
great tools to ease the developer effort further. Some of these tools are:
Flink Tensorflow — Run Tensorflow graphs as a Flink process
Flink HTM —Anomaly detection in a stream in Flink
Tink — A temporal graph library build on top of Flink
Flink SQL and Complex Event Processing (CEP) were also initially developed by Alibaba and
contributed back to flink.
Apache Hama
Apache Hama is a BSP (Bulk Synchronous Parallel) computing framework
on top of HDFS (Hadoop Distributed File System) for massive scientific
computations such as matrix, graph and network algorithms.
This release is the first release as a top level project, contains two
significant new features (Message Compressor, complete clone of the
Google's Pregel) and many improvements for computing system
performance and durability.
MPI
The Open MPI Project is an open source Message Passing Interface implementation that is developed and
maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to
combine the expertise, technologies, and resources from all across the High Performance Computing
community in order to build the best MPI library available. Open MPI offers advantages for system and
software vendors, application developers and computer science researchers.
1. Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak,
Amazon S3 (Dynamo)}
2. Document-based Store- It stores documents made up of tagged elements.
{Example- CouchDB}
3. Column-based Store- Each storage block contains data from only one column,
{Example- HBase, Cassandra}
4. Graph-based-A network database that uses edges and nodes to represent and
store data. {Example- Neo4J}
1. Key Value Store NoSQL Database
The schema-less format of a key value database like Riak is just about what you need for
your storage needs. The key can be synthetic or auto-generated while the value can be
String, JSON, BLOB (basic large object) etc.
The key value type basically, uses a hash table in which there exists a unique key and a
pointer to a particular item of data. A bucket is a logical group of keys – but they don’t
physically group the data. There can be identical keys in different buckets.
When we try and reflect back on the CAP theorem, it becomes quite clear that key value
stores are great around the Availability and Partition aspects but definitely lack in
Consistency.
Example: Consider the data subset represented in the following table. Here the key is the
name of the 3Pillar country name, while the value is a list of addresses of 3PiIllar centers
in that country.
Key Value
“India” {“B-25, Sector-58, Noida, India – 201301”
“Romania {“IMPS Moara Business Center, Buftea No. 1, Cluj-Napoca, 400606″,City Business Center, Coriolan
” Brediceanu No. 10, Building B, Timisoara, 300011”}
“US” {“3975 Fair Ridge Drive. Suite 200 South, Fairfax, VA 22033”}
The key can be synthetic or auto-generated while the value can be String, JSON, BLOB
(basic large object) etc.
This key/value type database allow clients to read and write values using a key as follows:
Multi-get(key1, key2, .., keyN), returns the list of values associated with the list of keys.
Delete(key), removes the entry for the key from the data store.
While Key/value type database seems helpful in some cases, but it has some weaknesses
as well. One, is that the model will not provide any kind of traditional database capabilities
(such as atomicity of transactions, or consistency when multiple transactions are executed
simultaneously). Such capabilities must be provided by the application itself.
Secondly, as the volume of data increases, maintaining unique values as keys may
become more difficult; addressing this issue requires the introduction of some complexity
in generating character strings that will remain unique among an extremely large set of
keys.
Riak and Amazon’s Dynamo are the most popular key-value store NoSQL databases.
2. Document Store NoSQL Database
The data which is a collection of key value pairs is compressed as a document store quite
similar to a key-value store, but the only difference is that the values stored (referred to as
“documents”) provide some structure and encoding of the managed data. XML, JSON
(Java Script Object Notation), BSON (which is a binary encoding of JSON objects) are
some common standard encodings.
The following example shows data values collected as a “document” representing the
names of specific retail stores. Note that while the three examples all represent locations,
the representative models are different.
{officeName:”3Pillar Noida”,
{Street: “B-25, City:”Noida”, State:”UP”, Pincode:”201301”}
}
{officeName:”3Pillar Timisoara”,
{Boulevard:”Coriolan Brediceanu No. 10”, Block:”B, Ist Floor”, City: “Timisoara”, Pincode:
300011”}
}
{officeName:”3Pillar Cluj”,
{Latitude:”40.748328”, Longitude:”-73.985560”}
}
One key difference between a key-value store and a document store is that the latter
embeds attribute metadata associated with stored content, which essentially provides a
way to query the data based on the contents. For example, in the above example, one
could search for all documents in which “City” is “Noida” that would deliver a result set
containing all documents associated with any “3Pillar Office” that is in that particular city.
Apache CouchDB is an example of a document store. CouchDB uses JSON to store data,
JavaScript as its query language using MapReduce and HTTP for an API. Data and
relationships are not stored in tables as is a norm with conventional relational databases
but in fact are a collection of independent documents.
The fact that document style databases are schema-less makes adding fields to JSON
documents a simple task without having to define changes first.
Couchbase and MongoDB are the most popular document based databases.
3. Column Store NoSQL Database–
In comparison, most relational DBMS store data in rows, the benefit of storing data in
columns, is fast search/ access and data aggregation. Relational databases store a single
row as a continuous disk entry. Different rows are stored in different places on disk while
Columnar databases store all the cells corresponding to a column as a continuous disk
entry thus makes the search/access faster.
For example: To query the titles from a bunch of a million articles will be a painstaking
task while using relational databases as it will go over each location to get item titles. On
the other hand, with just one disk access, title of all the items can be obtained.
Data Model
Key: the permanent name of the record. Keys have different numbers of columns, so
the database can scale in an irregular way.
Keyspace: This defines the outermost level of an organization, typically the name of
the application. For example, ‘3PillarDataBase’ (database name).
Column: It has an ordered list of elements aka tuple with a name and a value defined.
The best known examples are Google’s BigTable and HBase & Cassandra that were
inspired from BigTable.
BigTable, for instance is a high performance, compressed and proprietary data storage
system owned by Google. It has the following attributes:
{
3PillarNoida: {
city: Noida
pincode: 201301
},
details: {
strength: 250
projects: 20
}
}
{
3PillarCluj: {
address: {
city: Cluj
pincode: 400606
},
details: {
strength: 200
projects: 15
}
},
{
3PillarTimisoara: {
address: {
city: Timisoara
pincode: 300011
},
details: {
strength: 150
projects: 10
}
}
{
3PillarFairfax : {
address: {
city: Fairfax
pincode: VA 22033
},
details: {
strength: 100
projects: 5
}
}
Google’s BigTable, HBase and Cassandra are the most popular column store based
databases.
4. Graph Base NoSQL Database
In a Graph Base NoSQL Database, you will not find the rigid format of SQL or the tables
and columns representation, a flexible graphical representation is instead used which is
perfect to address scalability concerns. Graph structures are used with edges, nodes and
properties which provides index-free adjacency. Data can be easily transformed from one
model to the other using a Graph Base NoSQL database.
These databases that uses edges and nodes to represent and store data.
These nodes are organised by some relationships with one another, which is
represented by edges between the nodes.
Both the nodes and the relationships have some defined properties.
The following are some of the features of the graph based database, which are explained
on the basis of the example below:
Labeled, directed, attributed multi-graph : The graphs contains the nodes which are
labelled properly with some properties and these nodes have some relationship with one
another which is shown by the directional edges. For example: in the following
representation, “Alice knows Bob” is shown by an edge that also has some properties.
While relational database models can replicate the graphical ones, the edge would require
a join which is a costly proposition.
UseCase–
Any ‘Recommended for You’ rating you see on e-commerce websites (book/video renting
sites) is often derived by taking into account how other users have rated the product in
question. Arriving at such a UseCase is made easy using Graph databases.
InfoGrid and Infinite Graph are the most popular graph based databases. InfoGrid allows
the connection of as many edges (Relationships) and nodes (MeshObjects), making it
easier to represent hyperlinked and complex set of information.
There are two kinds of GraphDatabase offered by InfoGrid, these include the following:
NetMeshBase – It is ideally suited for large distributed graphs and has additional
capabilities to communicate with other similar NetMeshbase.
System Properties Comparison Cassandra vs.
HBase vs. MongoDB
Providers of
DBaaS offerings,
please contact us to
be listed.
Implementation
Java Java C++
language
BSD Linux Linux
Server operating Linux Unix OS X
systems OS X Windows using Solaris
Windows Cygwin Windows
schema-free
Although schema-
free, documents of
the same collection
often follow the
Data scheme schema-free schema-free
same structure.
Optionally impose
all or part of a
schema by defining
a JSON schema.
yes string,
Typing
integer, double,
predefined data
yes no decimal, boolean,
types such as float
date, object_id,
or date
geospatial
XML support
Some form of
processing data in
XML format, e.g.
support for XML no no
data structures,
and/or support for
XPath, XQuery or
XSLT.
Secondary indexes restricted only no yes
equality queries,
not always the best
performing
solution
Read-only SQL
SQL-like SELECT,
SQL Support of queries via the
DML and DDL no
SQL MongoDB
statements (CQL)
Connector for BI
Proprietary
protocol CQL Java API
proprietary
APIs and other (Cassandra Query RESTful HTTP
protocol using
access methods Language, an SQL- API
JSON
like language) Thrift
Thrift
Supported C# C Actionscript
programming C++ C# inofficial driver
languages Clojure C++ C
Erlang Groovy C#
Go Java C++
Haskell PHP Clojure inofficial
Java Python driver
JavaScript Scala ColdFusion
Node.js inofficial driver
Perl D inofficial
PHP driver
Python Dart inofficial
Ruby driver
Scala Delphi inofficial
driver
Erlang
Go inofficial
driver
Groovy
inofficial driver
Haskell
Java
JavaScript
Lisp inofficial
driver
Lua inofficial
driver
MatLab
inofficial driver
Perl
PHP
PowerShell
inofficial driver
Prolog inofficial
driver
Python
R inofficial
driver
Ruby
Scala
Smalltalk
inofficial driver
Server-side scripts yes
Stored no Coprocessors in JavaScript
procedures Java
Triggers yes yes no
Partitioning
methods
Sharding no
Methods for
"single point of Sharding Sharding
storing different
failure"
data on different
nodes
Replication selectable
methods replication factor
Methods for Representation selectable Master-slave
redundantly storing of geographical replication factor replication
data on multiple distribution of
nodes servers is possible
MapReduce
Offers an API for
yes yes yes
user-defined Map/
Reduce methods
Eventual Eventual
Consistency Consistency Consistency
concepts Immediate Immediate
Immediate
Methods to ensure Consistency can Consistency can
Consistency
consistency in a be individually be individually
distributed system decided for each decided for each
write operation write operation
no typically not
Foreign keys used, however
Referential no no similar
integrity functionality with
DBRef possible
Transaction
concepts
no Atomicity Multi-document
Support to ensure
and isolation are ACID Transactions
data integrity after no
supported for with snapshot
non-atomic
single operations isolation
manipulations of
data
Concurrency
Support for
concurrent yes yes yes
manipulation of
data
Durability yes yes yes optional
Support for making
data persistent
In-memory
capabilities Is yes In-memory
there an option to storage engine
define some or all no no introduced with
structures to be MongoDB version
held in-memory 3.2
only.
Access Control
Access rights for Lists (ACL)
User concepts Access rights for
users can be Implementation
Access control users and roles
defined per object based on Hadoop
and ZooKeeper
More information provided by the system vendor
Cassandra HBase MongoDB
Apache Cassandra MongoDB is the
is the leading next-generation
NoSQL, distributed database that helps
Specific
database businesses
characteristics
management transform their
system, well... industries...
» more » more
No single point of By offering the
failure ensures best of traditional
Competitive 100% availability . databases as well
advantages Operational as the flexibility,
simplicity for... scale,...
» more » more
Internet of Things Internet of Things
(IOT), fraud (Bosch, Silver
detection Spring Networks)
Typical application
applications, Mobile (The
scenarios
recommendation Weather Channel,
engines, product... ADP,...
» more » more
ADP, Adobe,
Apple, Netflix,
Amadeus,
Uber, ING,,
AstraZeneca,
Intuit,Fidelity, NY
Barclays, BBVA,
Key customers Times, Outbrain,
Bond, Bosch,
BazaarVoice,
Cisco, CERN,
Best...
City...
» more
» more
Market metrics Cassandra is used 40 million
by 40% of the downloads
Fortune 100. (growing at 30
» more thousand
downloads per
day). 6,600+
customers....
» more
Apache license MongoDB
Pricing for database server:
commercial Free Software
Licensing and distributions Foundation's GNU
pricing models provided by AGPL v3.0.
DataStax and Commercial
available... licenses...
» more » more
We invite representatives of system vendors to contact us for updating and extending the system
information,
and for displaying vendor-provided information such as key customers, competitive advantages and
market metrics.
Navicat for
MongoDB gives
you a highly
effective GUI
interface for
MongoDB
database
management,
administration and
development.
» more
ScaleGrid: Deploy,
monitor, backup
and scale
MongoDB in the
cloud with the #1
Database-as-a-
Service (DBaaS)
platform.
» more
Benchmarking NoSQL Databases: Cassandra vs. MongoDB vs. HBase vs. Couchbase
Understanding the performance behavior of a NoSQL database like Apache Cassandra™ under
various conditions is critical. Conducting a formal proof of concept (POC) in the environment in
which the database will run is the best way to evaluate platforms. POC processes that include the
right benchmarks such as production configurations, parameters and anticipated data and concurrent
user workloads give both IT and business stakeholders powerful insight about platforms under
consideration and a view for how business applications will perform in production.
Independent benchmark analyses and testing of various NoSQL platforms under big data,
production-level workloads have been performed over the years and have consistently identified
Apache Cassandra as the platform of choice for businesses interested in adopting NoSQL as the
database for modern Web, mobile and IOT applications.
One benchmark analysis (Solving Big Data Challenges for Enterprise Application Performance
Management) by engineers at the University of Toronto, which in evaluating six different data
stores, found Apache Cassandra the “clear winner throughout our experiments”. Also, End Point
Corporation, a database and open source consulting company, benchmarked the top NoSQL
databases including: Apache Cassandra, Apache HBase, Couchbase, and MongoDB using a variety
of different workloads on AWS EC2.
The databases involved were:
Apache Cassandra: Highly scalable, high performance distributed database designed to handle
large amounts of data across many commodity servers, providing high availability with no single
point of failure.
Apache HBase: Open source, non-relational, distributed database modeled after Google’s BigTable
and is written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop
project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like
capabilities for Hadoop.
MongoDB: Cross-platform document-oriented database system that eschews the traditional table-
based relational database structure in favor of JSON-like documents with dynamic schemas making
the integration of data in certain types of applications easier and faster.
End Point conducted the benchmark of these NoSQL database options on Amazon Web Services
EC2 instances, which is an industry-standard platform for hosting horizontally scalable services. In
order to minimize the effect of AWS CPU and I/O variability, End Point performed each test 3 times
on 3 different days. New EC2 instances were used for each test run to further reduce the impact of
any “lame instance” or “noisy neighbor” effects sometimes experienced in cloud environments, on
any one test.
NoSQL Database Performance Testing Results
When it comes to performance, it should be noted that there is (to date) no single “winner takes all”
among the top NoSQL databases or any other NoSQL engine for that matter. Depending on the use
case and deployment conditions, it is almost always possible for one NoSQL database to outperform
another and yet lag its competitor when the rules of engagement change. Here are a couple
snapshots of the performance benchmark to give you a sense of how each NoSQL database stacks
up.
Throughput by Workload
Each workload appears below with the throughput/operations-per-second (more is better) graphed
vertically, the number of nodes used for the workload displayed horizontally, and a table with the
result numbers following each graph.
Load process
For load, Couchbase, HBase, and MongoDB all had to be configured for non-durable writes to
complete in a reasonable amount of time, with Cassandra being the only database performing
durable write operations. Therefore, the numbers below for Couchbase, HBase, and MongoDB
represent non-durable write metrics.
What is HBase?
HBase supports random, real-time read/write access with a goal of hosting very
large tables atop clusters of commodity hardware. HBase features include:
HBase uses ZooKeeper for coordination of “truth” across the cluster. As region
servers come online, they register themselves with ZooKeeper as members of the
cluster. Region servers have shards of data (partitions of a database table) called
“regions”.
When a change is made to a row, it is updated in a persistent Write-Ahead-Log
(WAL) file and Memstore, the sorted memory cache for HBase. Once Memstore
fills, its changes are “flushed” to HFiles in HDFS. The WAL ensures that HBase
does not lose the change if Memstore loses its data before it is written to an HFile.
During a read, HBase checks to see if the data exists first in Memstore, which can
provide the fastest response with direct memory access. If the data is not in
Memstore, HBase will retrieve the data from the HFile.
HFiles are replicated by HDFS, typically to at least 3 nodes. HBase always writes to
the local node first and then replicates to other nodes. In the event of a node failure,
HBase will assign the regions to another node that has a local HFile copy replicated
by HDFS.
Splice Machine has chosen to replace the storage engine in Apache Derby (our
customized SQL-database) with HBase to leverage its ability scale out on
commodity hardware. HBase co-processors are used to embed Splice Machine in
each distributed HBase region (i.e., data shard). This enables Splice Machine to
achieve massive parallelization by pushing the computation down to each data
shard.
Redis:
When speed is a priority
If you need a high performance database, Redis is hard to beat. Redis can be a good way to increase
the speed of an existing application.
Not a typical standalone database
Panoply integrates all your data sources and connects to any BI
visualization tool
View Integrations
Redis is typically not a standalone database for most companies. With Craigslist, Redis is used
alongside their existing primary database (i.e. MongoDB). While it is possible for a company to use
Redis by itself, it’s uncommon.
When you have a well-planned design
With Redis, you have a variety of data structures that include hashes, sets, and lists but you have to
define explicitly how the data will be stored. You have to plan your design and decide in advance
how you want to store and then organize your data. Redis is not well suited for prototyping. If you
regularly do prototyping, MongoDB is probably a better choice.
If your database will have a predictable size (or will stay the same size), Redis can be used to
increase lookup speed. If you know your key, then lookup will be very fast. For applications that
need real-time data, Redis can be great because it can look up information about specific keys
quickly.
The size of your data is stable
While it is possible to do master-slave replication using Redis, being able to distribute your keys
between multiple instances must be done within your application. This makes Redis harder to scale
than MongoDB.
MongoDB:
More flexible
If you’re not certain how you will query your data, MongoDB is probably a better choice.
MongoDB is better for prototyping and is helpful when you need to get data into a database and
you’re not certain of the final design from the beginning.
Easier to learn
With MongoDB, it is easier to query data because MongoDB uses a more consistent structure.
MongoDB typically has a shorter learning curve.
Better use of large, growing data sets
If you need to store data in large data sets with the potential for rapid growth, MongoDB is a good
choice. MongoDB is also easier to administer as the database grows.
If you want to have an even better understanding about MongoDB and its differences with other
NoSQL databases
ystem Properties Comparison GraphDB vs. Neo4j
Please select another system to include it in the comparison.
Our visitors often compare GraphDB and Neo4j with Microsoft Azure
Cosmos DB, Amazon Neptune and MongoDB.
Editorial information provided by DB-Engines
Name GraphDB X Neo4j X
Enterprise RDF and
graph database with
efficient reasoning, Open source graph
Description
cluster and external database
index synchronization
support
Graph DBMS
Graph DBMS DB-Engines
Primary database model
RDF store Ranking
Trend Chart
Score 0.86
Rank #152 Overall
Score 47.86
Graph
#10 Rank #22 Overall
DBMS
#1 Graph DBMS
RDF
#6
stores
Website graphdb.ontotext.com neo4j.com
Technical graphdb.ontotext.com
neo4j.com/docs
documentation /documentation
Developer Ontotext Neo4j, Inc.
Initial release 2000 2007
Current release 8.8, January 2019 3.5.1, December 2018
License commercial Open Source
Cloud-based only no no
DBaaS
offerings (sponsored
links)
Implementation
Java Java, Scala
language
All OS with a Java VM Linux
Server operating Linux OS X
systems OS X Solaris
Windows Windows
schema-free and
schema-free and
Data scheme OWL/RDFS-schema
schema-optional
support
Typing yes yes
XML support no
yes, supports real-
time synchronization
and indexing in
Secondary indexes SOLR/Elastic yes
search/Lucene and
GeoSPARQL geometry
data indexes
SPARQL is used as
SQL no
query language
GeoSPARQL Bolt protocol
Java API Cypher query
RDF4J API language
APIs and other access RIO Java API
methods Sail API Neo4j-OGM
Sesame REST HTTP RESTful HTTP API
Protocol Spring Data Neo4j
SPARQL 1.1 TinkerPop 3
.Net
Clojure
.Net Elixir
C# Go
Clojure Groovy
Java Haskell
Supported programming
JavaScript (Node.js) Java
languages
PHP JavaScript
Python Perl
Ruby PHP
Scala Python
Ruby
Scala
Server-side scripts Java Server Plugin yes
Triggers no yes
Partitioning methods none none
Master-master Causal Clustering
Replication methods
replication using Raft protocol
MapReduce no no
Immediate Causal and Eventual
Consistency, Eventual Consistency
consistency configurable in Causal
Consistency
(configurable in Cluster setup
concepts
cluster mode per Immediate
master or individual Consistency in stand-
client request) alone mode
Foreign keys yes yes
Transaction concepts ACID ACID
Concurrency yes yes
Durability yes yes
Users, roles and
permissions.
Pluggable
User concepts yes authentication with
supported standards
(LDAP, Active
Directory, Kerberos)
More information provided by the system vendor
GraphDB Neo4j
GraphDB Enterprise is Neo4j is a native
a high-performance graph database
semantic repository platform that is built
Specific characteristics
created by to store, query,
Ontotext.... analyze...
» more » more
GraphDB allows you Neo4j database is the
to link text and data only transactional
in big knowledge database that
Competitive advantages
graphs. It offers an combines everything
easy... you need...
» more » more
Real-Time
Metadata enrichment Recommendations
and management, Master Data
Typical application linked open data Management Identity
scenarios publishing, semantic and Access
inferencing... Management
» more Network...
» more
BBC, Press
Over 300 commercial
Association, Financial
customers and over
Times, DK,
750 startups use
Key customers Euromoney, The
Neo4j. Flagship
British Museum,
customers...
Getty...
» more
» more
GraphDB is the most Neo4j boasts the
utilized semantic world's largest graph
triplestore for database ecosystem
Market metrics
mission-critical with more than a 15
enterprise... million...
» more » more
GraphDB-Free is free GPL v3 license that
to use. SE and can be used all the
Licensing and pricing Enterprise are places where you
models licensed per CPU-Core might use MySQL.
used. Perpetual,... Neo4j Commercial...
» more » more
Graph Algorithms in
Neo4j: Strongly
Scientific Literature
Connected
Monitoring for Drug
Components
Safety Signals
18 February 2019
8 February 2019
#GraphCast: You
Ontotext Platform: A
Know We Have an
Global View Across
Online Meetup, Right?
Knowledge Graphs
[Also, Salmon]
and Content
17 February 2019
Annotations
This Week in Neo4j –
25 January 2019
Time Based Graph
Ontotext Platform:
Versioning, Pearson
Semantic Annotation
Coefficient, Neo4j
News Quality Assurance &
Multi DC, Modeling
Inter-Annotator
Provenance
Agreement
16 February 2019
18 January 2019
Apache Spark
Ontotext Platform:
Developers Have
Knowledge Quality via
Voted to Include
Efficient Annotation at
Cypher in Spark 3.0
Scale
[Update]
11 January 2019
15 February 2019
Top 5 Technology
Host an Event (Big or
Trends to Track in
Small) with Us for
2019
Global Graph
4 January 2019
Celebration Day
14 February 2019
Related products and services
We invite representatives of vendors of related products to contact us for presenting information
about their offerings here.
More resources
GraphDB Neo4j
DB-Engines blog posts MySQL, PostgreSQL
and Redis are the
winners of the March
ranking
2 March 2016, Paul
Andlinger
The openCypher
Project: Help Shape
the SQL for Graphs
22 December 2015,
Emil Eifrem (guest
author)
Graph DBMS
increased their
popularity by 500%
within the last 2 years
3 March 2015, Paul
Andlinger
show all
Building AI-powered
Drug Adverse Events
Monitoring Service
Conferences and events Webinar, 3pm GMT,
more DBMS events 8am PDT, 11am EDT,
4pm BST, 18
November 2018
(finished)
Global Graph
Database Market
2019 Revenue –
Ontotext GraphDB triAGENS
Named Innovator in GmbH(Arango DB),
Bloor's Graph Neo4j, Inc, OrientDB
Database Market Ltd, Cayley
Research 15 February 2019,
11 February 2019, Slap Coffee
PRNewswire Global Graph
Ontotext's GraphDB Database Market
8.7 Offers Vector- 2018 Trending
Based Concept Scenario - triAGENS
Matching and Better GmbH(Arango DB),
Scalability, Neo4j, Inc, OrientDB
Performance and Ltd
Data Governance 21 January 2019,
Recent citations in the 9 October 2018, PR Daily Seekers
news Newswire Graph Database
Ontotext has Just Technology – The
Released GraphDB Power To Probe
8.8 Complex Pharma
16 January 2019, Datasets
SYS-CON Media 26 November 2018,
The Graph Database PharmiWeb.com
Poised To Pounce On Neo4j nabs $80M
The Mainstream Series E as graph
19 September 2018, database tech
The Next Platform flourishes
Graph databases and 1 November 2018,
RDF: It's a family TechCrunch
affair AI could help push
19 May 2017, ZDNet Neo4j graph database
provided byGoogle News growth
20 September 2018,
TechCrunch
provided byGoogle News
Neo4j Xexclude from OrientDB Xexclude
Name
comparison from comparison
Multi-model DBMS
Open source graph
Description (Document, Graph,
database
Key/Value)
Document store
Graph DBMS
Key-value store
DB-Engines
Ranking measures
Primary database model Graph DBMS
the popularity of
database
management
systems
Trend Chart
Score 6.05
Score 47.86 Rank #52 Overall
Rank #22 Overall #8 Document stores
#1 Graph DBMS #3 Graph DBMS
#9 Key-value stores
Website neo4j.com orientdb.com
Technical
neo4j.com/docs orientdb.org/docs/3.0.x
documentation
OrientDB LTD;
Developer Neo4j, Inc.
CallidusCloud
Initial release 2007 2010
Current release 3.5.1, December 2018 3.0.12, December 2018
Open Source GPL
License Commercial Open Source Apache
version3, commercial
or Open Source version 2
licenses available
Cloud-based only
Only available as a no no
cloud service
DBaaS offerings
(sponsored links)
Database as a Service
Providers of DBaaS
offerings, please contact
us to be listed.
Implementation
Java, Scala Java
language
Linux Can also be
used server-less as
embedded Java
Server operating All OS with a Java JDK
database.
systems (>= JDK 6)
OS X
Solaris
Windows
schema-free Schema
can be enforced for
schema-free and
Data scheme whole record ("schema-
schema-optional
full") or for some fields
only ("schema-hybrid")
Typing predefined
data types such as float yes yes
or date
XML support Some
form of processing data
in XML format, e.g.
support for XML data no
structures, and/or
support for XPath,
XQuery or XSLT.
yes pluggable
Secondary indexes indexing subsystem, by yes
default Apache Lucene
SQL-like query
SQL Support of SQL no
language, no joins
Bolt protocol
Cypher query language Java API
Java API RESTful HTTP/JSON
APIs and other access Neo4j-OGM Object API
methods Graph Mapper Tinkerpop technology
RESTful HTTP API stack with Blueprints,
Spring Data Neo4j Gremlin, Pipes
TinkerPop 3
.Net
.Net
Clojure
C
Elixir
C#
Go
C++
Groovy
Clojure
Haskell
Supported Java
Java
programming languages JavaScript
JavaScript
JavaScript (Node.js)
Perl
PHP
PHP
Python
Python
Ruby
Ruby
Scala
Scala
Server-side scripts yes User defined Java, Javascript
Stored procedures Procedures and
Functions
Triggers yes via event handler Hooks
Partitioning methods
Methods for storing
none Sharding
different data on
different nodes
Replication methods Causal Clustering using
Methods for Raft protocol Master-master
redundantly storing data available in in replication
on multiple nodes Enterprise Version only
MapReduce Offers
no could be achieved
an API for user-defined no
with distributed queries
Map/Reduce methods
Causal and Eventual
Consistency concepts Consistency
Methods to ensure configurable in Causal
consistency in a Cluster setup
distributed system Immediate Consistency
in stand-alone mode
Foreign keys yes Relationships in yes relationship in
Referential integrity graphs graphs
Transaction concepts
Support to ensure data
integrity after non- ACID ACID
atomic manipulations of
data
Concurrency Support
for concurrent yes yes
manipulation of data
Durability Support
for making data yes yes
persistent
Users, roles and
permissions. Pluggable
Access rights for users
User concepts Access authentication with
and roles; record level
control supported standards
security configurable
(LDAP, Active
Directory, Kerberos)
More information provided by the system vendor
Neo4j OrientDB
Neo4j is a native graph
database platform that
Specific characteristics is built to store, query,
analyze...
» more
Competitive advantages Neo4j database is the
only transactional
database that combines
everything you need...
» more
Real-Time
Recommendations
Master Data
Typical application
Management Identity
scenarios
and Access
Management Network...
» more
Over 300 commercial
customers and over 750
Key customers startups use Neo4j.
Flagship customers...
» more
Neo4j boasts the
world's largest graph
database ecosystem
Market metrics
with more than a 15
million...
» more
GPL v3 license that can
be used all the places
Licensing and pricing where you might use
models MySQL. Neo4j
Commercial...
» more
News Graph Algorithms in
Neo4j: Strongly
Connected Components
18 February 2019
Apache Spark
Developers Have Voted
to Include Cypher in
Spark 3.0 [Update]
15 February 2019
Host an Event (Big or
Small) with Us for
Global Graph
Celebration Day
14 February 2019
We invite representatives of system vendors to contact us for updating and extending the system
information,
and for displaying vendor-provided information such as key customers, competitive advantages
and market metrics.
More resources
Neo4j OrientDB
MySQL, PostgreSQL
and Redis are the
winners of the March
ranking
Graph DBMS increased
2 March 2016, Paul
their popularity by
Andlinger
500% within the last 2
years
The openCypher
3 March 2015, Paul
Project: Help Shape the
Andlinger
SQL for Graphs
22 December 2015,
Graph DBMSs are
DB-Engines blog posts Emil Eifrem (guest
gaining in popularity
author)
faster than any other
database category
Graph DBMS increased
21 January 2014,
their popularity by
Matthias Gelbmann
500% within the last 2
years
show all
3 March 2015, Paul
Andlinger
show all
In this post, we are going to discuss about Apache Hadoop 2.x Architecture and How it’s
components work in detail.
HDFS
YARN
MapReduce
These three are also known as Three Pillars of Hadoop 2. Here major key component change is
YARN. It is really game changing component in BigData Hadoop System.
2. HDFS
It’s HDFS component is also knows as NameNode. It’s NameNode is used to store Meta
Data.
In Hadoop 2.x, some more Nodes acts as Master Nodes as shown in the above diagram.
Each this 2nd level Master Node has 3 components:
1. Node Manager
2. Application Master
3. Data Node
Each this 2nd level Master Node again contains one or more Slave Nodes as shown in the
above diagram.
These Slave Nodes have two components:
1. Node Manager
2. HDFS
It’s HDFS component is also knows as Data Node. It’s Data Node component is used to
store actual our application Big Data. These nodes does not contain Application Master
component.
Hadoop 2.x Components In-detail Architecture
2. Application Manager
Application Master:
Node Manager:
Container:
Each Master Node or Slave Node contains set of Containers. In this diagram, Main Node’s
Name Node is not showing the Containers. However, it also contains a set of Containers.
Container is a portion of Memory in HDFS (Either Name Node or Data Node).
In Hadoop 2.x, Container is similar to Data Slots in Hadoop 1.x. We will see the major
differences between these two Components
Hive
Installation and Configuration
Requirements
Java 1.7
Note: Hive versions 1.2 onward require Java
1.7 or newer. Hive versions 0.14 to 1.1 work
with Java 1.6 as well. Users are strongly
advised to start moving to Java 1.8 (see HIVE-
8607).
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
$ export PATH=$HIVE_HOME/bin:$PATH
The Hive GIT repository for the most recent Hive code
is located here: git clone https://git-wip-
us.apache.org/repos/asf/hive.git (the
master branch).
$ cd hive
$ cd packaging/target/apache-hive-
{version}-SNAPSHOT-bin/apache-hive-
{version}-SNAPSHOT-bin
$ ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
$ svn co
http://svn.apache.org/repos/asf/hive/branche
s/branch-{version} hive
$ cd hive
# ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
Running Hive
export HADOOP_HOME=<hadoop-
install-dir>
$ $HADOOP_HOME/bin/hadoop fs -
mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -
mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod
g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod
g+w /user/hive/warehouse
$ $HIVE_HOME/bin/hive
$ $HIVE_HOME/bin/hiveserver2
$ $HIVE_HOME/bin/beeline -u
jdbc:hive2://$HS2_HOST:$HS2_PORT
$ $HIVE_HOME/bin/beeline -u jdbc:hive2://
Running HCatalog
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh
$ $HIVE_HOME/hcatalog/bin/hcat
$
$HIVE_HOME/hcatalog/sbin/webhcat_server.sh
$ bin/hive --hiveconf
x1=y1 --hiveconf x2=y2
//this sets the
variables x1 and x2 to
y1 and y2 respectively
$ bin/hiveserver2 --
hiveconf x1=y1 --
hiveconf x2=y2 //this
sets server-side
variables x1 and x2 to
y1 and y2 respectively
$ bin/beeline --
hiveconf x1=y1 --
hiveconf x2=y2 //this
sets client-side
variables x1 and x2 to
y1 and y2
respectively.
Runtime Configuration
beeline> SET
mapred.job.tracker=myhost.mycompany.co
m:50030;
mapred.job.tracker
Hive Logging
/tmp/<user.name>/hive.log
Note: In local mode, prior to Hive 0.13.0 the log
file name was ".log" instead of "hive.log".
This bug was fixed in release 0.13.0 (see HIVE-
5528 and HIVE-5676).
hive.log.dir=<other_location>
bin/hive --hiveconf
hive.root.logger=INFO,console
//for HiveCLI (deprecated)
bin/hiveserver2 --hiveconf
hive.root.logger=INFO,console
Alternatively, the user can change the logging level
only by using:
bin/hive --hiveconf
hive.root.logger=INFO,DRFA //for
HiveCLI (deprecated)
bin/hiveserver2 --hiveconf
hive.root.logger=INFO,DRFA
bin/hive --hiveconf
hive.root.logger=INFO,DAILY
//for HiveCLI (deprecated)
bin/hiveserver2 --hiveconf
hive.root.logger=INFO,DAILY
Audit Logs
Perf Logger
log4j.logger.org.apache.hadoop.hiv
e.ql.log.PerfLogger=DEBUG
If the logger level has already been set to DEBUG at
root via hive.root.logger, the above setting is not
required to see the performance logs.
DDL Operations
lists all the table that end with 's'. The pattern matching
follows Java regular expressions. Check out this link
for documentation
http://java.sun.com/javase/6/docs/api/java/util/regex/Pa
ttern.html.
Metadata Store
NOTES:
SQL Operations
Example Queries
GROUP BY
JOIN
MULTITABLE INSERT
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.*
WHERE src.key < 100
STREAMING
userid INT,
movieid INT,
rating INT,
unixtime STRING)
STORED AS TEXTFILE;
wget
http://files.grouplens.org/datasets/movielen
s/ml-100k.zip
or:
curl --remote-name
http://files.grouplens.org/datasets/movielen
s/ml-100k.zip
unzip ml-100k.zip
And load u.data into the table that was just created:
Create weekday_mapper.py:
import sys
import datetime
line = line.strip()
weekday =
datetime.datetime.fromtimestamp(float(unixti
me)).isoweekday()
userid INT,
movieid INT,
rating INT,
weekday INT)
SELECT
FROM u_data;
SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
Sqoop ships with a help tool. To display a list of all available tools,
type the following command:
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with
database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display
the results
export Export an HDFS directory to a
database table
help List available commands
import Import a table from a database to
HDFS
import-all-tables Import tables from a database to
HDFS
import-mainframe Import mainframe datasets to HDFS
list-databases List available databases on a server
list-tables List available tables in a database
version Display version information
See 'sqoop help COMMAND' for information on a specific
command.
You can display help for a specific tool by entering: sqoop help
(tool-name); for example, sqoop help import.
You can also add the --help argument to any command: sqoop
import --help.
7. sqoop-import
7.1. Purpose
7.2. Syntax
7.2.1. Connecting to a Database Server
7.2.2. Selecting the Data to Import
7.2.3. Free-form Query Imports
7.2.4. Controlling Parallelism
7.2.5. Controlling Distributed Cache
7.2.6. Controlling the Import Process
7.2.7. Controlling transaction isolation
7.2.8. Controlling type mapping
7.2.9. Schema name handling
7.2.10. Incremental Imports
7.2.11. File Formats
7.2.12. Large Objects
7.2.13. Importing Data Into Hive
7.2.14. Importing Data Into HBase
7.2.15. Importing Data Into Accumulo
7.2.16. Additional Import Configuration Properties
7.3. Example Invocations
7.1. Purpose
The import tool imports an individual table from an RDBMS to
HDFS. Each row from a table is represented as a separate record in
HDFS. Records can be stored as text files (one record per line), or
in binary representation as Avro or SequenceFiles.
7.2. Syntax
7.2.1. Connecting to a Database Server
7.2.2. Selecting the Data to Import
7.2.3. Free-form Query Imports
7.2.4. Controlling Parallelism
7.2.5. Controlling Distributed Cache
7.2.6. Controlling the Import Process
7.2.7. Controlling transaction isolation
7.2.8. Controlling type mapping
7.2.9. Schema name handling
7.2.10. Incremental Imports
7.2.11. File Formats
7.2.12. Large Objects
7.2.13. Importing Data Into Hive
7.2.14. Importing Data Into HBase
7.2.15. Importing Data Into Accumulo
7.2.16. Additional Import Configuration Properties
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
While the Hadoop generic arguments must precede any import
arguments, you can type the import arguments in any order with
respect to one another.
Note
In this document, arguments are grouped into collections organized by
function. Some collections are present in several tools (for example, the
"common" arguments). An extended description of their functionality is
given only on the first presentation in this document.
Note
The facility of using free-form query in the current version of Sqoop is
limited to simple queries where there are no ambiguous projections and
no OR conditions in the WHERE clause. Use of complex queries such as
queries that have sub-queries or joins leading to ambiguous projections can
lead to unexpected results.
If the target table does not exist, the Sqoop job will exit with an
error, unless the --accumulo-create-table parameter is specified.
Otherwise, you should create the target table before running an
import.
Sqoop currently serializes all values to Accumulo by converting
each field to its string representation (as if you were importing to
HDFS in text mode), and then inserts the UTF-8 bytes of this string
in the target cell.
By default, no visibility is applied to the resulting cells in Accumulo,
so the data will be visible to any Accumulo user. Use the --
accumulo-visibility parameter to specify a visibility token to
apply to all rows in the import job.
For performance tuning, use the optional --accumulo-buffer-
size\ and --accumulo-max-latency parameters. See Accumulo’s
documentation for an explanation of the effects of these
parameters.
In order to connect to an Accumulo instance, you must specify the
location of a Zookeeper ensemble using the --accumulo-
zookeepers parameter, the name of the Accumulo instance (--
accumulo-instance), and the username and password to connect
with (--accumulo-user and --accumulo-password respectively).
Table 11. Code generation arguments:
Argument Description
--bindir <dir> Output directory for compiled objects
Sets the generated class name. This overrides --package-name.
--class-name <name>
When combined with --jar-file, sets the input class.
--jar-file <file> Disable code generation; use specified jar
--outdir <dir> Output directory for generated code
--package-name
Put auto-generated classes in this package
<name>
--map-column-java Override default mapping from SQL type to Java type for configured
<m> columns.
8. sqoop-import-all-tables
8.1. Purpose
8.2. Syntax
8.3. Example Invocations
8.1. Purpose
The import-all-tables tool imports a set of tables from an
RDBMS to HDFS. Data from each table is stored in a separate
directory in HDFS.
For the import-all-tables tool to be useful, the following
conditions must be met:
Each table must have a single-column primary key or --
autoreset-to-one-mapper option must be used.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor
impose any conditions via a WHERE clause.
8.2. Syntax
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
Although the Hadoop generic arguments must preceed any import
arguments, the import arguments can be entered in any order with
respect to one another.
Table 13. Common arguments
Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-
Specify connection manager class to use
name>
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
Set path for a file containing the authentication
--password-file
password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file Optional properties file that provides connection
<filename> parameters
Set connection transaction isolation to read
--relaxed-isolation
uncommitted for the mappers.
9. sqoop-import-mainframe
9.1. Purpose
9.2. Syntax
9.2.1. Connecting to a Mainframe
9.2.2. Selecting the Files to Import
9.2.3. Controlling Parallelism
9.2.4. Controlling Distributed Cache
9.2.5. Controlling the Import Process
9.2.6. File Formats
9.2.7. Importing Data Into Hive
9.2.8. Importing Data Into HBase
9.2.9. Importing Data Into Accumulo
9.2.10. Additional Import Configuration Properties
9.3. Example Invocations
9.1. Purpose
The import-mainframe tool imports all sequential datasets in a
partitioned dataset(PDS) on a mainframe to HDFS. A PDS is akin to
a directory on the open systems. The records in a dataset can
contain only character data. Records will be stored with the entire
record as a single text field.
9.2. Syntax
9.2.1. Connecting to a Mainframe
9.2.2. Selecting the Files to Import
9.2.3. Controlling Parallelism
9.2.4. Controlling Distributed Cache
9.2.5. Controlling the Import Process
9.2.6. File Formats
9.2.7. Importing Data Into Hive
9.2.8. Importing Data Into HBase
9.2.9. Importing Data Into Accumulo
9.2.10. Additional Import Configuration Properties
$ sqoop import-mainframe (generic-args) (import-args)
$ sqoop-import-mainframe (generic-args) (import-args)
While the Hadoop generic arguments must precede any import
arguments, you can type the import arguments in any order with
respect to one another.
Table 19. Common arguments
Argument Description
--connect <hostname> Specify mainframe host to connect
--connection-manager <class-
Specify connection manager class to use
name>
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
Set path for a file containing the authentication
--password-file
password
Argument Description
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file Optional properties file that provides connection
<filename> parameters
Example:
$ sqoop import-mainframe --connect z390 --username david
--password 12345
Table 20. Import control arguments:
Argument Description
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data to Parquet Files
--delete-target-dir Delete the import target directory if it exists
-m,--num-mappers <n> Use n map tasks to import in parallel
--target-dir <dir> HDFS destination dir
--warehouse-dir <dir> HDFS parent for table destination
-z,--compress Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)
If the target table does not exist, the Sqoop job will exit with an
error, unless the --accumulo-create-table parameter is specified.
Otherwise, you should create the target table before running an
import.
Sqoop currently serializes all values to Accumulo by converting
each field to its string representation (as if you were importing to
HDFS in text mode), and then inserts the UTF-8 bytes of this string
in the target cell.
By default, no visibility is applied to the resulting cells in Accumulo,
so the data will be visible to any Accumulo user. Use the --
accumulo-visibility parameter to specify a visibility token to
apply to all rows in the import job.
For performance tuning, use the optional --accumulo-buffer-
size\ and --accumulo-max-latency parameters. See Accumulo’s
documentation for an explanation of the effects of these
parameters.
In order to connect to an Accumulo instance, you must specify the
location of a Zookeeper ensemble using the --accumulo-
zookeepers parameter, the name of the Accumulo instance (--
accumulo-instance), and the username and password to connect
with (--accumulo-user and --accumulo-password respectively).
Table 26. Code generation arguments:
Argument Description
--bindir <dir> Output directory for compiled objects
Sets the generated class name. This overrides --package-name.
--class-name <name>
When combined with --jar-file, sets the input class.
--jar-file <file> Disable code generation; use specified jar
--outdir <dir> Output directory for generated code
--package-name
Put auto-generated classes in this package
<name>
--map-column-java Override default mapping from SQL type to Java type for configured
<m> columns.
10. sqoop-export
10.1. Purpose
10.2. Syntax
10.3. Inserts vs. Updates
10.4. Exports and Transactions
10.5. Failed Exports
10.6. Example Invocations
10.1. Purpose
The export tool exports a set of files from HDFS back to an
RDBMS. The target table must already exist in the database. The
input files are read and parsed into a set of records according to the
user-specified delimiters.
The default operation is to transform these into a set
of INSERT statements that inject the records into the database. In
"update mode," Sqoop will generate UPDATE statements that
replace existing records in the database, and in "call mode" Sqoop
will make a stored procedure call for each record.
10.2. Syntax
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Although the Hadoop generic arguments must preceed any export
arguments, the export arguments can be entered in any order with
respect to one another.
Table 27. Common arguments
Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-
Specify connection manager class to use
name>
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
Set path for a file containing the authentication
--password-file
password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file Optional properties file that provides connection
<filename> parameters
Set connection transaction isolation to read
--relaxed-isolation
uncommitted for the mappers.
11.1. Purpose
Validate the data copied, either import or export by comparing the
row counts from the source and the target post copy.
11.2. Introduction
There are 3 basic interfaces: ValidationThreshold - Determines if
the error margin between the source and target are acceptable:
Absolute, Percentage Tolerant, etc. Default implementation is
AbsoluteValidationThreshold which ensures the row counts from
source and targets are the same.
ValidationFailureHandler - Responsible for handling failures: log an
error/warning, abort, etc. Default implementation is
LogOnFailureHandler that logs a warning message to the configured
logger.
Validator - Drives the validation logic by delegating the decision to
ValidationThreshold and delegating failure handling to
ValidationFailureHandler. The default implementation is
RowCountValidator which validates the row counts from source and
the target.
11.3. Syntax
$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)
Validation arguments are part of import and export arguments.
11.4. Configuration
The validation framework is extensible and pluggable. It comes with
default implementations but the interfaces can be extended to allow
custom implementations by passing them as part of the command
line arguments as described below.
Validator.
Property: validator
Description: Driver for validation,
must implement
org.apache.sqoop.validation.Validator
Supported values: The value has to be a fully qualified
class name.
Default value:
org.apache.sqoop.validation.RowCountValidator
Validation Threshold.
Property: validation-threshold
Description: Drives the decision based on the
validation meeting the
threshold or not. Must implement
org.apache.sqoop.validation.ValidationThreshold
Supported values: The value has to be a fully qualified
class name.
Default value:
org.apache.sqoop.validation.AbsoluteValidationThreshold
Validation Failure Handler.
Property: validation-failurehandler
Description: Responsible for handling failures, must
implement
org.apache.sqoop.validation.ValidationFailureHandler
Supported values: The value has to be a fully qualified
class name.
Default value:
org.apache.sqoop.validation.AbortOnFailureHandler
11.5. Limitations
Validation currently only validates data copied from a single table
into HDFS. The following are the limitations in the current
implementation:
all-tables option
free-form query option
Data imported into Hive, HBase or Accumulo
table import with --where argument
incremental imports
11.6. Example Invocations
A basic import of a table named EMPLOYEES in the corp database
that uses validation to validate the row counts:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
--table EMPLOYEES --validate
A basic export to populate a table named bar with validation
enabled:
$ sqoop export --connect jdbc:mysql://db.example.com/foo
--table bar \
--export-dir /results/bar_data --validate
Another example that overrides the validation args:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --
table EMPLOYEES \
--validate --validator
org.apache.sqoop.validation.RowCountValidator \
--validation-threshold \
org.apache.sqoop.validation.AbsoluteValidationThreshold \
--validation-failurehandler \
org.apache.sqoop.validation.AbortOnFailureHandler
13. sqoop-job
13.1. Purpose
13.2. Syntax
13.3. Saved jobs and passwords
13.4. Saved jobs and incremental imports
13.1. Purpose
The job tool allows you to create and work with saved
jobs. Saved jobs remember the parameters used to
specify a job, so they can be re-executed by invoking the job by its
handle.
If a saved job is configured to perform an incremental import, state
regarding the most recently imported rows is updated in the saved
job to allow the job to continually import only the newest rows.
13.2. Syntax
$ sqoop job (generic-args) (job-args) [-- [subtool-name]
(subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name]
(subtool-args)]
Although the Hadoop generic arguments must preceed any job
arguments, the job arguments can be entered in any order with
respect to one another.
Table 33. Job management options:
Argument Description
--create Define a new saved job with the specified job-id (name). A second Sqoop
<job-id> command-line, separated by a -- should be specified; this defines the saved job.
--delete
Delete a saved job.
<job-id>
--exec <job-
Given a job defined with --create, run the saved job.
id>
--show <job-
Show the parameters for a saved job.
id>
--list List all saved jobs
14. sqoop-metastore
14.1. Purpose
14.2. Syntax
14.1. Purpose
The metastore tool configures Sqoop to host a shared metadata
repository. Multiple users and/or remote users can define and
execute saved jobs (created with sqoop job) defined in this
metastore.
Clients must be configured to connect to the metastore in sqoop-
site.xml or with the --meta-connect argument.
14.2. Syntax
$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)
Although the Hadoop generic arguments must preceed any
metastore arguments, the metastore arguments can be entered in
any order with respect to one another.
Table 36. Metastore management options:
Argument Description
--shutdown Shuts down a running metastore instance on the same machine.
15. sqoop-merge
15.1. Purpose
15.2. Syntax
15.1. Purpose
The merge tool allows you to combine two datasets where entries
in one dataset should overwrite entries of an older dataset. For
example, an incremental import run in last-modified mode will
generate multiple datasets in HDFS where successively newer data
appears in each dataset. The merge tool will "flatten" two datasets
into one, taking the newest available records for each primary key.
15.2. Syntax
$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)
Although the Hadoop generic arguments must preceed any merge
arguments, the job arguments can be entered in any order with
respect to one another.
Table 37. Merge options:
Argument Description
Specify the name of the record-specific class to use during the merge
--class-name <class>
job.
--jar-file <file> Specify the name of the jar to load the record class from.
--merge-key <col> Specify the name of a column to use as the merge key.
--new-data <path> Specify the path of the newer dataset.
--onto <path> Specify the path of the older dataset.
--target-dir <path> Specify the target path for the output of the merge job.
The merge tool runs a MapReduce job that takes two directories as
input: a newer dataset, and an older one. These are specified
with --new-data and --onto respectively. The output of the
MapReduce job will be placed in the directory in HDFS specified
by --target-dir.
When merging the datasets, it is assumed that there is a unique
primary key value in each record. The column for the primary key is
specified with --merge-key. Multiple rows in the same dataset
should not have the same primary key, or else data loss may occur.
To parse the dataset and extract the key column, the auto-
generated class from a previous import must be used. You should
specify the class name and jar file with --class-name and --jar-
file. If this is not availab,e you can recreate the class using
the codegen tool.
The merge tool is typically run after an incremental import with the
date-last-modified mode (sqoop import --incremental
lastmodified …).
Supposing two incremental imports were performed, where some
older data is in an HDFS directory named older and newer data is
in an HDFS directory named newer, these could be merged like so:
$ sqoop merge --new-data newer --onto older --target-dir
merged \
--jar-file datatypes.jar --class-name Foo --merge-key
id
This would run a MapReduce job where the value in the id column
of each row is used to join rows; rows in the newer dataset will be
used in preference to rows in the older dataset.
This can be used with both SequenceFile-, Avro- and text-based
incremental imports. The file types of the newer and older datasets
must be the same.
16. sqoop-codegen
16.1. Purpose
16.2. Syntax
16.3. Example Invocations
16.1. Purpose
The codegen tool generates Java classes which encapsulate and
interpret imported records. The Java definition of a record is
instantiated as part of the import process, but can also be
performed separately. For example, if Java source is lost, it can be
recreated. New versions of a class can be created which use
different delimiters between fields, and so on.
16.2. Syntax
$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
Although the Hadoop generic arguments must preceed any codegen
arguments, the codegen arguments can be entered in any order
with respect to one another.
Table 38. Common arguments
Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-
Specify connection manager class to use
name>
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
Set path for a file containing the authentication
--password-file
password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file Optional properties file that provides connection
<filename> parameters
Set connection transaction isolation to read
--relaxed-isolation
uncommitted for the mappers.
17. sqoop-create-hive-table
17.1. Purpose
17.2. Syntax
17.3. Example Invocations
17.1. Purpose
The create-hive-table tool populates a Hive metastore with a
definition for a table based on a database table previously imported
to HDFS, or one planned to be imported. This effectively performs
the "--hive-import" step of sqoop-import without running the
preceeding import.
If data was already loaded to HDFS, you can use this tool to finish
the pipeline of importing the data to Hive. You can also create Hive
tables with this tool; data then can be imported and populated into
the target after a preprocessing step run by the user.
17.2. Syntax
$ sqoop create-hive-table (generic-args) (create-hive-
table-args)
$ sqoop-create-hive-table (generic-args) (create-hive-
table-args)
Although the Hadoop generic arguments must preceed any create-
hive-table arguments, the create-hive-table arguments can be
entered in any order with respect to one another.
Table 43. Common arguments
Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-
Specify connection manager class to use
name>
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
Argument Description
Set path for a file containing the authentication
--password-file
password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file Optional properties file that provides connection
<filename> parameters
Set connection transaction isolation to read
--relaxed-isolation
uncommitted for the mappers.
18. sqoop-eval
18.1. Purpose
18.2. Syntax
18.3. Example Invocations
18.1. Purpose
The eval tool allows users to quickly run simple SQL queries
against a database; results are printed to the console. This allows
users to preview their import queries to ensure they import the
data they expect.
Warning
The eval tool is provided for evaluation purpose only. You can use it to
verify database connection from within the Sqoop or to test simple queries.
It’s not suppose to be used in production workflows.
18.2. Syntax
$ sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
Although the Hadoop generic arguments must preceed any eval
arguments, the eval arguments can be entered in any order with
respect to one another.
Table 46. Common arguments
Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-
Specify connection manager class to use
name>
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
Set path for a file containing the authentication
--password-file
password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file Optional properties file that provides connection
<filename> parameters
--relaxed-isolation Set connection transaction isolation to read
Argument Description
uncommitted for the mappers.
19. sqoop-list-databases
19.1. Purpose
19.2. Syntax
19.3. Example Invocations
19.1. Purpose
List database schemas present on a server.
19.2. Syntax
$ sqoop list-databases (generic-args) (list-databases-
args)
$ sqoop-list-databases (generic-args) (list-databases-
args)
Although the Hadoop generic arguments must preceed any list-
databases arguments, the list-databases arguments can be entered
in any order with respect to one another.
Table 48. Common arguments
Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-
Specify connection manager class to use
name>
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
Argument Description
Set path for a file containing the authentication
--password-file
password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file Optional properties file that provides connection
<filename> parameters
Set connection transaction isolation to read
--relaxed-isolation
uncommitted for the mappers.
Flume vs. Kafka vs. Kinesis - A Detailed Guide on Hadoop Ingestion Tools
As the amount of data available for systems to analyze is increasing by the day, the need for
newer faster ways to capture all this data in continuous streams is also arising. Apache
Hadoop is possibly one of the most widely used frameworks for distributed storage and
processing of Big Data data sets. And with the help of various ingestion tools for Hadoop, it is
now possible to capture raw sensor data as binary streams.
Three of the most popular Hadoop ingestion tools include Flume, Kafka and Kinesis. This post
aims at discussing the pros and cons of using each tool - from initial capturing of data to
monitoring and scaling.
Before we dive into this further, let us understand what a binary stream is. Most data that
becomes available - user logs, logs from IoT devices etc are streams of text events which are
generated by some user action. This data can be broken into chunks based on the event that
happened - user clicks on a button, a setting change and so on. A binary data stream is one in
which instead of breaking down the data stream by events, the data is collected in a
continuous stream at a specific rate. The ingest tools in question capture this data and then
push out the serialized data to Hadoop.
Now, back to the ingestion tools. Both Flume and Kafka are provided by Apache whereas
Kinesis is a fully managed service provided by Amazon .
Apache Flume:
Flume also allows configuring multiple collector hosts for continued availability in case of a
collector failure.
Apache Kafka:
Kafka is gaining popularity in the enterprise space as the ingestion tool to use. A streaming
interface on Kafka is called a producer. Kafka also provides many producer implementations
and also lets you implement your own interface. With Kafka, you need to build your
consumer’s ability to plug into the data - there is no default monitoring implementation.
Scalability on Kafka is achieved by using partitions configured right inside the producer. Data is
distributed across nodes in the cluster. A higher throughput requires more number of partitions.
The tricky part in this could be selecting the right partition scheme. Generally, metadata from
the source is used to partition the streams in a logical manner.
The best thing about Kafka is resiliency via distributed replicas. These replicas do not affect
the throughput in any way. Kafka is also a hot favorite among most enterprises.
AWS Kinesis:
Kinesis is similar to Kafka in many ways. It is a fully managed service which integrates really
well with other AWS services. This makes it easy to scale and process incoming information.
Kinesis, unlike Flume and Kafka, only provides example
implementations, there are no default producers available.
The one disadvantage Kinesis has over Kafka is that it is a cloud service. This introduces a
latency when communicating with an on-premise source compared to the Kafka on-premise
implementation.
The final choice of ingestion tool really depends on your use case. If you want a highly fault-
tolerant, DIY solution and can have developers for supporting it, Kafka is definitely the way to
go. If you need something which is more out-of-the-box, use Kinesis or Flume. There again,
choose wisely depending on how the data will be consumed. Kafka and Kinesis pull data
whereas Flume pushes it out using something called data sinks.
Apache Storm - also for data streaming but generally used for shorter terms, maybe an add-
on to your existing Hadoop environment
Chukwa (a Hadoop subproject) - devoted to large scale log collection and analysis. It is built
on top of HDFS and MapReduce and is highly scalable. It also includes a powerful monitoring
toolkit
Streaming data gives a business the opportunity to identify real-time business value. Knowing
the big players and which one works best for your use case is a great enabler for you to make
the right architectural decisions.
Flume :
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating
and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since
data sources are customizable, Flume can be used to transport massive
quantities of event data including but not limited to network traffic data, social-
media-generated data, email messages and pretty much any data source
possible.
Apache Flume is a top level project at the Apache Software Foundation.
System Requirements
Architecture
Flume allows a user to build multi-hop flows where events travel through
multiple agents before reaching the final destination. It also allows fan-in and
fan-out flows, contextual routing and backup routes (fail-over) for failed hops.
Reliability
The events are staged in a channel on each agent. The events are then
delivered to the next agent or terminal repository (like HDFS) in the flow. The
events are removed from a channel only after they are stored in the channel of
next agent or in the terminal repository. This is a how the single-hop message
delivery semantics in Flume provide end-to-end reliability of the flow.
Flume uses a transactional approach to guarantee the reliable delivery of the
events. The sources and sinks encapsulate in a transaction the
storage/retrieval, respectively, of the events placed in or provided by a
transaction provided by the channel. This ensures that the set of events are
reliably passed from point to point in the flow. In the case of a multi-hop flow,
the sink from the previous hop and the source from the next hop both have
their transactions running to ensure that the data is safely stored in the
channel of the next hop.
Recoverability
The events are staged in the channel, which manages recovery from failure.
Flume supports a durable file channel which is backed by the local file system.
There’s also a memory channel which simply stores the events in an in-
memory queue, which is faster but any events still left in the memory channel
when an agent process dies can’t be recovered.
Setup
Setting up an agent
Flume agent configuration is stored in a local configuration file. This is a text
file that follows the Java properties file format. Configurations for one or more
agents can be specified in the same configuration file. The configuration file
includes properties of each source, sink and channel in an agent and how they
are wired together to form data flows.
The agent needs to know what individual components to load and how they are
connected in order to constitute the flow. This is done by listing the names of
each of the sources, sinks and channels in the agent, and then specifying the
connecting channel for each sink and source. For example, an agent flows
events from an Avro source called avroWeb to HDFS sink hdfs-cluster1 via a
file channel called file-channel. The configuration file will contain names of
these components and file-channel as a shared channel for both avroWeb
source and hdfs-cluster1 sink.
Starting an agent
An agent is started using a shell script called flume-ng which is located in the
bin directory of the Flume distribution. You need to specify the agent name,
the config directory, and the config file on the command line:
Now the agent will start running source and sinks configured in the given
properties file.
A simple example
This configuration defines a single agent named a1. a1 has a source that
listens for data on port 44444, a channel that buffers event data in memory,
and a sink that logs event data to the console. The configuration file names the
various components, then describes their types and configuration parameters.
A given configuration file might define several named agents; when a given
Flume process is launched a flag is passed telling it which named agent to
manifest.
Given this configuration file, we can start Flume as follows:
Note that in a full deployment we would typically include one more option: --
conf=<conf-dir>. The <conf-dir> directory would include a shell script flume-
env.sh and potentially a log4j properties file. In this example, we pass a Java
option to force Flume to log to the console and we go without a custom
environment script.
From a separate terminal, we can then telnet port 44444 and send Flume an
event:
The original Flume terminal will output the event in a log message.
NB: it currently works for values only, not for keys. (Ie. only on the “right side”
of the = mark of the config lines.)
This can be enabled via Java system properties on agent invocation by
setting propertiesImplementation =
org.apache.flume.node.EnvVarResolverProperties.
For example::
Logging the raw stream of data flowing through the ingest pipeline is not
desired behaviour in many production environments because this may result in
leaking sensitive data or security related configurations, such as secret keys, to
Flume log files. By default, Flume will not log such information. On the other
hand, if the data pipeline is broken, Flume will attempt to provide clues for
debugging the problem.
One way to debug problems with event pipelines is to set up an
additional Memory Channel connected to a Logger Sink, which will output all
event data to the Flume logs. In some situations, however, this approach is
insufficient.
In order to enable logging of event- and configuration-related data, some Java
system properties must be set in addition to log4j properties.
To enable configuration-related logging, set the Java system property -
Dorg.apache.flume.log.printconfig=true. This can either be passed on the
command line or by setting this in the JAVA_OPTS variable in flume-env.sh.
To enable data logging, set the Java system property -
Dorg.apache.flume.log.rawdata=true in the same way described above. For
most components, the log4j logging level must also be set to DEBUG or TRACE
to make event-specific logging appear in the Flume logs.
Here is an example of enabling both configuration logging and raw data logging
while also setting the Log4j loglevel to DEBUG for console output:
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -
Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -
Dorg.apache.flume.log.rawdata=true
- /flume
|- /a1 [Agent config file]
|- /a2 [Agent config file]
Once the configuration file is uploaded, start the agent with following options
$ bin/flume-ng agent –conf conf -z zkhost:2181,zkhost1:2181 -p
/flume –name a1 -Dflume.root.logger=INFO,console
Argument
Name Default Description
z – Zookeeper
connection
string. Comma
separated list of
hostname:port
p /flume Base Path in
Zookeeper to
store Agent
configurations
Flume has a fully plugin-based architecture. While Flume ships with many out-
of-the-box sources, channels, sinks, serializers, and the like, many
implementations exist which ship separately from Flume.
While it has always been possible to include custom Flume components by
adding their jars to the FLUME_CLASSPATH variable in the flume-env.sh file,
Flume now supports a special directory called plugins.d which automatically
picks up plugins that are packaged in a specific format. This allows for easier
management of plugin packaging issues as well as simpler debugging and
troubleshooting of several classes of issues, especially library dependency
conflicts.
The plugins.d directory
The plugins.d directory is located at $FLUME_HOME/plugins.d. At startup time,
the flume-ng start script looks in the plugins.d directory for plugins that
conform to the below format and includes them in proper paths when starting
up java.
plugins.d/
plugins.d/custom-source-1/
plugins.d/custom-source-1/lib/my-source.jar
plugins.d/custom-source-1/libext/spring-core-2.5.6.jar
plugins.d/custom-source-2/
plugins.d/custom-source-2/lib/custom.jar
plugins.d/custom-source-2/native/gettext.so
Data ingestion
RPC
An Avro client included in the Flume distribution can send a given file to Flume
Avro source using avro RPC mechanism:
The above command will send the contents of /usr/logs/log.10 to to the Flume
source listening on that ports.
Executing commands
There’s an exec source that executes a given command and consumes the
output. A single ‘line’ of output ie. text followed by carriage return (‘\r’) or line
feed (‘\n’) or both together.
Network streams
Flume supports the following mechanisms to read data from popular log
stream types, such as:
1. Avro
2. Thrift
3. Syslog
4. Netcat
In order to flow the data across multiple agents or hops, the sink of the
previous agent and source of the current hop need to be avro type with the
sink pointing to the hostname (or IP address) and port of the source.
Consolidation
This can be achieved in Flume by configuring a number of first tier agents with
an avro sink, all pointing to an avro source of single agent (Again you could
use the thrift sources/sinks/clients in such a scenario). This source on the
second tier agent consolidates the received events into a single channel which
is consumed by a sink to its final destination.
Flume supports multiplexing the event flow to one or more destinations. This is
achieved by defining a flow multiplexer that can replicate or selectively route
an event to one or more channels.
The above example shows a source from agent “foo” fanning out the flow to
three different channels. This fan out can be replicating or multiplexing. In
case of replicating flow, each event is sent to all three channels. For the
multiplexing case, an event is delivered to a subset of available channels when
an event’s attribute matches a preconfigured value. For example, if an event
attribute called “txnType” is set to “customer”, then it should go to channel1
and channel3, if it’s “vendor” then it should go to channel2, otherwise
channel3. The mapping can be set in the agent’s configuration file.
Configuration
As mentioned in the earlier section, Flume agent configuration is read from a
file that resembles a Java property file format with hierarchical property
settings.
For example, an agent named agent_foo is reading data from an external avro
client and sending it to HDFS via a memory channel. The config file
weblog.config could look like:
After defining the flow, you need to set properties of each source, sink and
channel. This is done in the same hierarchical namespace fashion where you
set the component type and other values for the properties specific to each
component:
The property “type” needs to be set for each component for Flume to
understand what kind of object it needs to be. Each source, sink and channel
type has its own set of properties required for it to function as intended. All
those need to be set as needed. In the previous example, we have a flow from
avro-AppSrv-source to hdfs-Cluster1-sink through the memory channel mem-
channel-1. Here’s an example that shows configuration of each of those
components:
agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = hdfs-Cluster1-sink
agent_foo.channels = mem-channel-1
# properties of avro-AppSrv-source
agent_foo.sources.avro-AppSrv-source.type = avro
agent_foo.sources.avro-AppSrv-source.bind = localhost
agent_foo.sources.avro-AppSrv-source.port = 10000
# properties of mem-channel-1
agent_foo.channels.mem-channel-1.type = memory
agent_foo.channels.mem-channel-1.capacity = 1000
agent_foo.channels.mem-channel-1.transactionCapacity = 100
# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path =
hdfs://namenode/flume/webdata
#...
A single Flume agent can contain several independent flows. You can list
multiple sources, sinks and channels in a config. These components can be
linked to form multiple flows:
Then you can link the sources and sinks to their corresponding channels (for
sources) of channel (for sinks) to setup two different flows. For example, if you
need to setup two flows in an agent, one going from an external avro client to
external HDFS and another from output of a tail to avro sink, then here’s a
config to do that:
# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2
To setup a multi-tier flow, you need to have an avro/thrift sink of first hop
pointing to avro/thrift source of the next hop. This will result in the first Flume
agent forwarding events to the next Flume agent. For example, if you are
periodically sending files (1 file per event) using avro client to a local Flume
agent, then this local agent can forward it to another agent that has the
mounted for storage.
Weblog agent config:
Here we link the avro-forward-sink from the weblog agent to the avro-
collection-source of the hdfs agent. This will result in the events coming from
the external appserver source eventually getting stored in HDFS.
As discussed in previous section, Flume supports fanning out the flow from one
source to multiple channels. There are two modes of fan out, replicating and
multiplexing. In the replicating flow, the event is sent to all the configured
channels. In case of multiplexing, the event is sent to only a subset of
qualifying channels. To fan out the flow, one needs to specify a list of channels
for a source and the policy for the fanning it out. This is done by adding a
channel “selector” that can be replicating or multiplexing. Then further specify
the selection rules if it’s a multiplexer. If you don’t specify a selector, then by
default it’s replicating:
<Agent>.sources.<Source1>.selector.type = replicating
The multiplexing select has a further set of properties to bifurcate the flow.
This requires specifying a mapping of an event attribute to a set for channel.
The selector checks for each configured attribute in the event header. If it
matches the specified value, then that event is sent to all the channels mapped
to that value. If there’s no match, then the event is sent to set of channels
configured as default:
The selector checks for a header called “State”. If the value is “CA” then its
sent to mem-channel-1, if its “AZ” then it goes to file-channel-2 or if its “NY”
then both. If the “State” header is not set or doesn’t match any of the three,
then it goes to mem-channel-1 which is designated as ‘default’.
The selector also supports optional channels. To specify optional channels for a
header, the config parameter ‘optional’ is used in the following way:
SSL/TLS support
export JAVA_OPTS="$JAVA_OPTS
-Djavax.net.ssl.keyStore=/path/to/keystore.jks"
export JAVA_OPTS="$JAVA_OPTS -
Djavax.net.ssl.keyStorePassword=password"
Flume uses the system properties defined in JSSE (Java Secure Socket
Extension), so this is a standard way for setting up SSL. On the other hand,
specifying passwords in system properties means that the passwords can be
seen in the process list. For cases where it is not acceptable, it is also be
possible to define the parameters in environment variables. Flume initializes
the JSSE system properties from the corresponding environment variables
internally in this case.
The SSL environment variables can either be set in the shell environment
before starting Flume or in conf/flume-env.sh. (Although, using the command
line is inadvisable because the commands including the passwords will be
saved to the command history.)
export FLUME_SSL_KEYSTORE_PATH=/path/to/keystore.jks
export FLUME_SSL_KEYSTORE_PASSWORD=password
Please note:
SSL must be enabled at component level. Specifying the global SSL
parameters alone will not have any effect.
If the global SSL parameters are specified at multiple levels, the priority
is the following (from higher to lower):
component parameters in agent config
system properties
environment variables
If SSL is enabled for a component, but the SSL parameters are not
specified in any of the ways described above, then
in case of keystores: configuration error
in case of truststores: the default truststore will be used
(jssecacerts / cacerts in Oracle JDK)
The trustore password is optional in all cases. If not specified, then no
integrity check will be performed on the truststore when it is opened by
the JDK.
Sources and sinks can have a batch size parameter that determines the
maximum number of events they process in one batch. This happens within a
channel transaction that has an upper limit called transaction capacity. Batch
size must be smaller than the channel’s transaction capacity. There is an
explicit check to prevent incompatible settings. This check happens whenever
the configuration is read.
Flume Sources
Avro Source
Listens on Avro port and receives events from external Avro client streams.
When paired with the built-in Avro Sink on another (previous hop) Flume
agent, it can create tiered collection topologies. Required properties are
in bold.
Property Name Default Description
channels –
type – The component type
name, needs to
be avro
bind – hostname or IP address
to listen on
port – Port # to bind to
threads – Maximum number of
Property Name Default Description
worker threads to
spawn
selector.type
selector.*
interceptors – Space-separated list of
interceptors
interceptors.*
compression-type none This can be “none” or
“deflate”. The
compression-type must
match the
compression-type of
matching AvroSource
ssl false Set this to true to
enable SSL encryption.
If SSL is enabled, you
must also specify a
“keystore” and a
“keystore-password”,
either through
component level
parameters (see below)
or as global SSL
parameters
(see SSL/TLS
supportsection).
keystore – This is the path to a
Java keystore file. If
not specified here, then
the global keystore will
be used (if defined,
otherwise
configuration error).
keystore-password – The password for the
Java keystore. If not
specified here, then the
global keystore
password will be used
(if defined, otherwise
configuration error).
keystore-type JKS The type of the Java
keystore. This can be
“JKS” or “PKCS12”. If
not specified here, then
the global keystore
type will be used (if
defined, otherwise the
default is JKS).
exclude-protocols SSLv3 Space-separated list of
SSL/TLS protocols to
exclude. SSLv3 will
Property Name Default Description
always be excluded in
addition to the
protocols specified.
include-protocols – Space-separated list of
SSL/TLS protocols to
include. The enabled
protocols will be the
included protocols
without the excluded
protocols. If included-
protocols is empty, it
includes every
supported protocols.
exclude-cipher-suites – Space-separated list of
cipher suites to
exclude.
include-cipher-suites – Space-separated list of
cipher suites to
include. The enabled
cipher suites will be
the included cipher
suites without the
excluded cipher suites.
If included-cipher-
suites is empty, it
includes every
supported cipher suites.
ipFilter false Set this to true to
enable ipFiltering for
netty
ipFilterRules – Define N netty ipFilter
pattern rules with this
config.
Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
Example of ipFilterRules
ipFilterRules defines N netty ipFilters separated by a comma a pattern rule
must be in this format.
<’allow’ or deny>:<’ip’ or ‘name’ for computer name>:<pattern> or
allow/deny:ip/name:pattern
example: ipFilterRules=allow:ip:127.*,allow:name:localhost,deny:ip:*
Note that the first rule to match will apply as the example below shows from a
client on the localhost
This will Allow the client on localhost be deny clients from any other ip
“allow:name:localhost,deny:ip:” This will deny the client on localhost be allow
clients from any other ip “deny:name:localhost,allow:ip:“
Thrift Source
Listens on Thrift port and receives events from external Thrift client streams.
When paired with the built-in ThriftSink on another (previous hop) Flume
agent, it can create tiered collection topologies. Thrift source can be configured
to start in secure mode by enabling kerberos authentication. agent-principal
and agent-keytab are the properties used by the Thrift source to authenticate
to the kerberos KDC. Required properties are in bold.
Property Name Default Description
channels –
type – The component type
name, needs to
be thrift
bind – hostname or IP
address to listen on
port – Port # to bind to
threads – Maximum number of
worker threads to
spawn
selector.type
selector.*
interceptors – Space separated list of
interceptors
interceptors.*
ssl false Set this to true to
enable SSL
encryption. If SSL is
enabled, you must
also specify a
“keystore” and a
“keystore-password”,
either through
component level
parameters (see
below) or as global
SSL parameters
(see SSL/TLS
support section)
keystore – This is the path to a
Java keystore file. If
not specified here,
then the global
keystore will be used
(if defined, otherwise
Property Name Default Description
configuration error).
keystore-password – The password for the
Java keystore. If not
specified here, then
the global keystore
password will be used
(if defined, otherwise
configuration error).
keystore-type JKS The type of the Java
keystore. This can be
“JKS” or “PKCS12”.
If not specified here,
then the global
keystore type will be
used (if defined,
otherwise the default
is JKS).
exclude-protocols SSLv3 Space-separated list
of SSL/TLS protocols
to exclude. SSLv3
will always be
excluded in addition
to the protocols
specified.
include-protocols – Space-separated list
of SSL/TLS protocols
to include. The
enabled protocols will
be the included
protocols without the
excluded protocols. If
included-protocols is
empty, it includes
every supported
protocols.
exclude-cipher-suites – Space-separated list
of cipher suites to
exclude.
include-cipher-suites – Space-separated list
of cipher suites to
include. The enabled
cipher suites will be
the included cipher
suites without the
excluded cipher
suites.
kerberos false Set to true to enable
kerberos
authentication. In
kerberos mode, agent-
principal and agent-
Property Name Default Description
keytab are required
for successful
authentication. The
Thrift source in
secure mode, will
accept connections
only from Thrift
clients that have
kerberos enabled and
are successfully
authenticated to the
kerberos KDC.
agent-principal – The kerberos
principal used by the
Thrift Source to
authenticate to the
kerberos KDC.
agent-keytab —- The keytab location
used by the Thrift
Source in
combination with the
agent-principal to
authenticate to the
kerberos KDC.
Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
Exec Source
Exec source runs a given Unix command on start-up and expects that process
to continuously produce data on standard out (stderr is simply discarded,
unless property logStdErr is set to true). If the process exits for any reason,
the source also exits and will produce no further data. This means
configurations such as cat[named pipe] or tail -F [file] are going to produce the
desired results where as date will probably not - the former two commands
produce streams of data where as the latter produces a single event and exits.
Required properties are in bold.
Property
Name Default Description
channels –
type – The component
type name, needs
to be exec
Property
Name Default Description
command – The command to
execute
shell – A shell
invocation used
to run the
command. e.g.
/bin/sh -c.
Required only for
commands
relying on shell
features like
wildcards, back
ticks, pipes etc.
restartThrottle 10000 Amount of time
(in millis) to wait
before attempting
a restart
restart false Whether the
executed cmd
should be
restarted if it dies
logStdErr false Whether the
command’s
stderr should be
logged
batchSize 20 The max number
of lines to read
and send to the
channel at a time
batchTimeout 3000 Amount of time
(in milliseconds)
to wait, if the
buffer size was
not reached,
before data is
pushed
downstream
selector.type replicating replicating or
multiplexing
selector.* Depends on the
selector.type
value
interceptors – Space-separated
list of
interceptors
interceptors.*
Warning
The problem with ExecSource and other asynchronous sources is that the
source can not guarantee that if there is a failure to put the event into the
Channel the client knows about it. In such cases, the data will be lost. As a
for instance, one of the most commonly requested features is the tail -
F [file] -like use case where an application writes to a log file on disk and
Flume tails the file, sending each line as an event. While this is possible,
there’s an obvious problem; what happens if the channel fills up and Flume
writing the log file that it needs to retain the log or that the event hasn’t
been sent, for some reason. If this doesn’t make sense, you need only know
this: Your application can never guarantee data has been received when
zero guarantee of event delivery when using this source. For stronger
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1
The ‘shell’ config is used to invoke the ‘command’ through a command shell
(such as Bash or Powershell). The ‘command’ is passed as an argument to
‘shell’ for execution. This allows the ‘command’ to use features from the shell
such as wildcards, back ticks, pipes, loops, conditionals etc. In the absence of
the ‘shell’ config, the ‘command’ will be invoked directly. Common values for
‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.
a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done
JMS Source
JMS Source reads messages from a JMS destination such as a queue or topic.
Being a JMS application it should work with any JMS provider but has only
been tested with ActiveMQ. The JMS source provides configurable batch size,
message selector, user/pass, and message to flume event converter. Note that
the vendor provided JMS jars should be included in the Flume classpath using
plugins.d directory (preferred), –classpath on command line, or via
FLUME_CLASSPATH variable in flume-env.sh.
Required properties are in bold.
Property Name Default Description
channels –
type – The component type name,
needs to be jms
initialContextFactory – Inital Context Factory, e.g:
org.apache.activemq.jndi.Act
iveMQInitialContextFactory
connectionFactory – The JNDI name the
connection factory should
appear as
providerURL – The JMS provider URL
destinationName – Destination name
destinationType – Destination type (queue or
topic)
messageSelector – Message selector to use when
creating the consumer
userName – Username for the destination/
provider
passwordFile – File containing the password
for the destination/provider
batchSize 100 Number of messages to
consume in one batch
converter.type DEFAULT Class to use to convert
messages to flume events.
See below.
converter.* – Converter properties.
converter.charset UTF-8 Default converter only.
Charset to use when
converting JMS
TextMessages to byte arrays.
createDurableSubscription false Whether to create durable
subscription. Durable
subscription can only be used
with destinationType topic. If
true, “clientId” and
“durableSubscriptionName”
have to be specified.
clientId – JMS client identifier set on
Connection right after it is
created. Required for durable
subscriptions.
Property Name Default Description
durableSubscriptionName – Name used to identify the
durable subscription.
Required for durable
subscriptions.
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory =
org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE
This source lets you ingest data by placing files to be ingested into a “spooling”
directory on disk. This source will watch the specified directory for new files,
and will parse events out of new files as they appear. The event parsing logic is
pluggable. After a given file has been fully read into the channel, completion by
default is indicated by renaming the file or it can be deleted or the trackerDir is
used to keep track of processed files.
Unlike the Exec source, this source is reliable and will not miss data, even if
Flume is restarted or killed. In exchange for this reliability, only immutable,
uniquely-named files must be dropped into the spooling directory. Flume tries
to detect these problem conditions and will fail loudly if they are violated:
1. If a file is written to after being placed into the spooling directory, Flume
will print an error to its log file and stop processing.
2. If a file name is reused at a later time, Flume will print an error to its log
file and stop processing.
To avoid the above issues, it may be useful to add a unique identifier (such as
a timestamp) to log file names when they are moved into the spooling
directory.
Despite the reliability guarantees of this source, there are still cases in which
events may be duplicated if certain downstream failures occur. This is
consistent with the guarantees offered by other Flume components.
Property Name Default Description
channels –
type – The component type name, needs to be spooldir.
spoolDir – The directory from which to read files from.
fileSuffix .COMPLETED Suffix to append to completely ingested files
deletePolicy never When to delete completed
files: never or immediate
fileHeader false Whether to add a header storing the absolute path
filename.
fileHeaderKey file Header key to use when appending absolute path
filename to event header.
basenameHeader false Whether to add a header storing the basename of the
file.
basenameHeaderKey basename Header Key to use when appending basename of file to
event header.
includePattern ^.*$ Regular expression specifying which files to include. It
can used together with ignorePattern. If a file
matches
both ignorePattern and includePattern rege
x, the file is ignored.
ignorePattern ^$ Regular expression specifying which files to ignore
(skip). It can used together with includePattern.
If a file matches
both ignorePattern and includePattern rege
x, the file is ignored.
trackerDir .flumespool Directory to store metadata related to processing of
files. If this path is not an absolute path, then it is
interpreted as relative to the spoolDir.
Property Name Default Description
trackingPolicy rename The tracking policy defines how file processing is
tracked. It can be “rename” or “tracker_dir”. This
parameter is only effective if the deletePolicy is
“never”. “rename” - After processing files they get
renamed according to the fileSuffix parameter.
“tracker_dir” - Files are not renamed but a new empty
file is created in the trackerDir. The new tracker file
name is derived from the ingested one plus the
fileSuffix.
consumeOrder oldest In which order files in the spooling directory will be
consumed oldest, youngest and random. In case
of oldest and youngest, the last modified time of
the files will be used to compare the files. In case of a
tie, the file with smallest lexicographical order will be
consumed first. In case of random any file will be
picked randomly. When
using oldestand youngest the whole directory
will be scanned to pick the oldest/youngest file, which
might be slow if there are a large number of files, while
using random may cause old files to be consumed
very late if new files keep coming in the spooling
directory.
pollDelay 500 Delay (in milliseconds) used when polling for new
files.
recursiveDirectorySearch false Whether to monitor sub directories for new files to
read.
maxBackoff 4000 The maximum time (in millis) to wait between
consecutive attempts to write to the channel(s) if the
channel is full. The source will start at a low backoff
and increase it exponentially each time the channel
throws a ChannelException, upto the value specified
by this parameter.
batchSize 100 Granularity at which to batch transfer to the channel
inputCharset UTF-8 Character set used by deserializers that treat the input
file as text.
decodeErrorPolicy FAIL What to do when we see a non-decodable character in
the input file. FAIL: Throw an exception and fail to
parse the file. REPLACE: Replace the unparseable
character with the “replacement character” char,
typically Unicode U+FFFD.IGNORE: Drop the
unparseable character sequence.
deserializer LINE Specify the deserializer used to parse the file into
events. Defaults to parsing each line as an event. The
class specified must
implement EventDeserializer.Builder.
deserializer.* Varies per event deserializer.
bufferMaxLines – (Obselete) This option is now ignored.
bufferMaxLineLength 5000 (Deprecated) Maximum length of a line in the commit
buffer. Use deserializer.maxLineLength instead.
selector.type replicating replicating or multiplexing
selector.* Depends on the selector.type value
Property Name Default Description
interceptors – Space-separated list of interceptors
interceptors.*
Example for an agent named agent-1:
a1.channels = ch-1
a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true
Event Deserializers
The following event deserializers ship with Flume.
LINE
This deserializer is able to read an Avro container file, and it generates one
event per Avro record in the file. Each event is annotated with a header that
indicates the schema used. The body of the event is the binary Avro record
data, not including the schema or the rest of the container file elements.
Note that if the spool directory source must retry putting one of these events
onto a channel (for example, because the channel is full), then it will reset and
retry from the most recent Avro container file sync point. To reduce potential
event duplication in such a failure scenario, write sync markers more
frequently in your Avro input files.
BlobDeserializer
This deserializer reads a Binary Large Object (BLOB) per event, typically one
BLOB per file. For example a PDF or JPG file. Note that this approach is not
suitable for very large objects because the entire BLOB is buffered in RAM.
Taildir Source
Note
Watch the specified files, and tail them in nearly real-time once detected new
lines appended to the each files. If the new lines are being written, this source
will retry reading them in wait for the completion of the write.
This source is reliable and will not miss data even when the tailing files rotate.
It periodically writes the last read position of each files on the given position
file in JSON format. If Flume is stopped or down for some reason, it can restart
tailing from the position written on the existing position file.
In other use case, this source can also start tailing from the arbitrary position
for each files using the given position file. When there is no position file on the
specified path, it will start tailing from the first line of each files by default.
Files will be consumed in order of their modification time. File with the oldest
modification time will be consumed first.
This source does not rename or delete or do any modifications to the file being
tailed. Currently this source does not support tailing binary files. It reads text
files line by line.
Property Name Default Description
channels –
type – The component
type name, needs
to be TAILDIR.
filegroups – Space-separated
list of file groups.
Each file group
indicates a set of
files to be tailed.
filegroups.<filegroupN – Absolute path of
ame> the file group.
Regular
expression (and
not file system
patterns) can be
used for filename
only.
positionFile ~/.flume/ File in JSON
taildir_positio format to record
n.json the inode, the
absolute path and
the last position
of each tailing
file.
headers.<filegroupNam – Header value
e>.<headerKey> which is the set
with header key.
Multiple headers
can be specified
for one file
group.
byteOffsetHeader false Whether to add
the byte offset of
a tailed line to a
header called
‘byteoffset’.
Property Name Default Description
skipToEnd false Whether to skip
the position to
EOF in the case
of files not
written on the
position file.
idleTimeout 120000 Time (ms) to
close inactive
files. If the
closed file is
appended new
lines to, this
source will
automatically re-
open it.
writePosInterval 3000 Interval time
(ms) to write the
last position of
each file on the
position file.
batchSize 100 Max number of
lines to read and
send to the
channel at a time.
Using the default
is usually fine.
maxBatchCount Long.MAX_ Controls the
VALUE number of
batches being
read
consecutively
from the same
file. If the source
is tailing multiple
files and one of
them is written at
a fast rate, it can
prevent other
files to be
processed,
because the busy
file would be
read in an
endless loop. In
this case lower
this value.
Property Name Default Description
backoffSleepIncrement 1000 The increment
for time delay
before
reattempting to
poll for new data,
when the last
attempt did not
find any new
data.
maxBackoffSleep 5000 The max time
delay between
each reattempt to
poll for new data,
when the last
attempt did not
find any new
data.
cachePatternMatching true Listing
directories and
applying the
filename regex
pattern may be
time consuming
for directories
containing
thousands of
files. Caching the
list of matching
files can improve
performance. The
order in which
files are
consumed will
also be cached.
Requires that the
file system keeps
track of
modification
times with at
least a 1-second
granularity.
fileHeader false Whether to add a
header storing
the absolute path
filename.
fileHeaderKey file Header key to
use when
appending
absolute path
filename to event
header.
Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000
Warning
This source is highly experimental and may change between minor versions
Experimental source that connects via Streaming API to the 1% sample twitter
firehose, continously downloads tweets, converts them to Avro format and
sends Avro events to a downstream Flume sink. Requires the consumer and
access tokens and secrets of a Twitter developer account. Required properties
are in bold.
Property Name Default Description
channels –
type – The component type name, needs to
be org.apache.flume.source.twitter.Twitter
Source
consumerKey – OAuth consumer key
consumerSecret – OAuth consumer secret
accessToken – OAuth access token
accessTokenSecret – OAuth token secret
maxBatchSize 1000 Maximum number of twitter messages to put in a single batch
maxBatchDurationMillis 1000 Maximum number of milliseconds to wait before closing a
batch
Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = org.apache.flume.source.twitter.TwitterSource
a1.sources.r1.channels = c1
a1.sources.r1.consumerKey = YOUR_TWITTER_CONSUMER_KEY
a1.sources.r1.consumerSecret = YOUR_TWITTER_CONSUMER_SECRET
a1.sources.r1.accessToken = YOUR_TWITTER_ACCESS_TOKEN
a1.sources.r1.accessTokenSecret = YOUR_TWITTER_ACCESS_TOKEN_SECRET
a1.sources.r1.maxBatchSize = 10
a1.sources.r1.maxBatchDurationMillis = 200
Kafka Source
Kafka Source is an Apache Kafka consumer that reads messages from Kafka
topics. If you have multiple Kafka sources running, you can configure them
with the same Consumer Group so each will read a unique set of partitions for
the topics. This currently supports Kafka server releases 0.10.1.0 or higher.
Testing was done up to 2.0.1 that was the highest avilable version at the time
of the release.
Note
retrieval. The duplicates can be present when the source starts. The Kafka
key.deserializer(org.apache.kafka.common.serialization.StringSerializer) and
value.deserializer(org.apache.kafka.common.serialization.ByteArraySerializer
Deprecated Properties
Property Name Default Description
topic – Use kafka.topics
groupId flume Use kafka.consumer.group.id
zookeeperConnect – Is no longer supported by kafka
consumer client since 0.9.x. Use
kafka.bootstrap.servers to
establish connection with kafka
cluster
migrateZookeeperOffsets true When no Kafka stored offset is
found, look up the offsets in
Zookeeper and commit them to
Kafka. This should be true to
support seamless Kafka client
migration from older versions of
Property Name Default Description
Flume. Once migrated this can be
set to false, though that should
generally not be required. If no
Zookeeper offset is found, the
Kafka configuration
kafka.consumer.auto.offset.reset
defines how offsets are handled.
Check Kafka documentation for
details
Example for topic subscription by comma-separated topic list.
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics.regex = ^topic[0-9]$
# the default kafka.consumer.group.id=flume is used
Warning
There is a performance degradation when SSL is enabled, the magnitude of
Reference: Kafka security overview and the jira for tracking this
issue: KAFKA-2561
a1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.source1.kafka.bootstrap.servers = kafka-1:9093,kafka-
2:9093,kafka-3:9093
a1.sources.source1.kafka.topics = mytopic
a1.sources.source1.kafka.consumer.group.id = flume-consumer
a1.sources.source1.kafka.consumer.security.protocol = SSL
# optional, the global truststore can be used alternatively
a1.sources.source1.kafka.consumer.ssl.truststore.location=/path/to/
truststore.jks
a1.sources.source1.kafka.consumer.ssl.truststore.password=<password to
access the truststore>
Specyfing the truststore is optional here, the global truststore can be used
instead. For more details about the global SSL setup, see the SSL/TLS
support section.
Note: By default the property ssl.endpoint.identification.algorithm is not
defined, so hostname verification is not performed. In order to enable
hostname verification, set the following properties
a1.sources.source1.kafka.consumer.ssl.endpoint.identification.algorithm=HTTP
S
Once enabled, clients will verify the server’s fully qualified domain name
(FQDN) against one of the following two fields:
1. Common Name (CN) https://tools.ietf.org/html/rfc6125#section-2.3
2. Subject Alternative Name
(SAN) https://tools.ietf.org/html/rfc5280#section-4.2.1.6
If client side authentication is also required then additionally the following
needs to be added to Flume agent configuration or the global SSL setup can be
used (see SSL/TLS support section). Each Flume agent has to have its client
certificate which has to be trusted by Kafka brokers either individually or by
their signature chain. Common example is to sign each client certificate by a
single Root CA which in turn is trusted by Kafka brokers.
a1.sources.source1.kafka.consumer.ssl.key.password=<password to access
the key>
JAVA_OPTS="$JAVA_OPTS -Djava.security.krb5.conf=/path/to/krb5.conf"
JAVA_OPTS="$JAVA_OPTS
-Djava.security.auth.login.config=/path/to/flume_jaas.conf"
a1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.source1.kafka.bootstrap.servers = kafka-1:9093,kafka-
2:9093,kafka-3:9093
a1.sources.source1.kafka.topics = mytopic
a1.sources.source1.kafka.consumer.group.id = flume-consumer
a1.sources.source1.kafka.consumer.security.protocol = SASL_PLAINTEXT
a1.sources.source1.kafka.consumer.sasl.mechanism = GSSAPI
a1.sources.source1.kafka.consumer.sasl.kerberos.service.name = kafka
a1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.source1.kafka.bootstrap.servers = kafka-1:9093,kafka-
2:9093,kafka-3:9093
a1.sources.source1.kafka.topics = mytopic
a1.sources.source1.kafka.consumer.group.id = flume-consumer
a1.sources.source1.kafka.consumer.security.protocol = SASL_SSL
a1.sources.source1.kafka.consumer.sasl.mechanism = GSSAPI
a1.sources.source1.kafka.consumer.sasl.kerberos.service.name = kafka
# optional, the global truststore can be used alternatively
a1.sources.source1.kafka.consumer.ssl.truststore.location=/path/to/
truststore.jks
a1.sources.source1.kafka.consumer.ssl.truststore.password=<password to
access the truststore>
Sample JAAS file. For reference of its content please see client config sections
of the desired authentication mechanism (GSSAPI/PLAIN) in Kafka
documentation of SASL configuration. Since the Kafka Source may also connect
to Zookeeper for offset migration, the “Client” section was also added to this
example. This won’t be needed unless you require offset migration, or you
require this section for other secure components. Also please make sure that
the operating system user of the Flume processes has read privileges on the
jaas and keytab files.
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/path/to/keytabs/flume.keytab"
principal="flume/flumehost1.example.com@YOURKERBEROSREALM";
};
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/path/to/keytabs/flume.keytab"
principal="flume/flumehost1.example.com@YOURKERBEROSREALM";
};
A netcat-like source that listens on a given port and turns each line of text into
an event. Acts like nc -k -l [host] [port]. In other words, it opens a specified
port and listens for data. The expectation is that the supplied data is newline
separated text. Each line of text is turned into a Flume event and sent via the
connected channel.
Required properties are in bold.
Property Name Default Description
channels –
type – The component
type name, needs
to be netcat
bind – Host name or IP
address to bind to
port – Port # to bind to
Property Name Default Description
max-line-length 512 Max line length
per event body
(in bytes)
ack-every-event true Respond with an
“OK” for every
event received
selector.type replicating replicating or
multiplexing
selector.* Depends on the
selector.type
value
interceptors – Space-separated
list of
interceptors
interceptors.*
Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1
As per the original Netcat (TCP) source, this source that listens on a given port
and turns each line of text into an event and sent via the connected channel.
Acts like nc -u -k -l [host] [port].
Required properties are in bold.
Property Name Default Description
channels –
type – The component
type name, needs
to
be netcatudp
bind – Host name or IP
address to bind to
port – Port # to bind to
remoteAddressHeader –
selector.type replicating replicating or
multiplexing
selector.* Depends on the
selector.type
value
interceptors – Space-separated
list of
interceptors
interceptors.*
Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcatudp
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = seq
a1.sources.r1.channels = c1
Syslog Sources
Reads syslog data and generate Flume events. The UDP source treats an entire
message as a single event. The TCP sources create a new event for each string
of characters separated by a newline (‘n’).
Required properties are in bold.
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
Multiport Syslog TCP Source
This is a newer, faster, multi-port capable version of the Syslog TCP source.
Note that the ports configuration setting has replaced port. Multi-port
capability means that it can listen on many ports at once in an efficient
manner. This source uses the Apache Mina library to do that. Provides support
for RFC-3164 and many common RFC-5424 formatted messages. Also provides
the capability to configure the character set used on a per-port basis.
Property Name Default Description
channels –
type – The component type name,
needs to
be multiport_syslogtc
p
host – Host name or IP address to
bind to.
ports – Space-separated list (one or
more) of ports to bind to.
eventSize 2500 Maximum size of a single
event line, in bytes.
keepFields none Setting this to ‘all’ will
preserve the Priority,
Timestamp and Hostname in
the body of the event. A
spaced separated list of fields
to include is allowed as well.
Currently, the following fields
can be included: priority,
version, timestamp, hostname.
The values ‘true’ and ‘false’
have been deprecated in favor
of ‘all’ and ‘none’.
portHeader – If specified, the port number
will be stored in the header of
each event using the header
name specified here. This
allows for interceptors and
channel selectors to customize
routing logic based on the
incoming port.
Property Name Default Description
clientIPHeader – If specified, the IP address of
the client will be stored in the
header of each event using the
header name specified here.
This allows for interceptors
and channel selectors to
customize routing logic based
on the IP address of the client.
Do not use the standard
Syslog header names here
(like _host_) because the
event header will be
overridden in that case.
clientHostnameHeader – If specified, the host name of
the client will be stored in the
header of each event using the
header name specified here.
This allows for interceptors
and channel selectors to
customize routing logic based
on the host name of the client.
Retrieving the host name may
involve a name service reverse
lookup which may affect the
performance. Do not use the
standard Syslog header names
here (like _host_) because the
event header will be
overridden in that case.
charset.default UTF-8 Default character set used
while parsing syslog events
into strings.
charset.port.<port> – Character set is configurable
on a per-port basis.
batchSize 100 Maximum number of events
to attempt to process per
request loop. Using the default
is usually fine.
readBufferSize 1024 Size of the internal Mina read
buffer. Provided for
performance tuning. Using the
default is usually fine.
numProcessors (auto-detected) Number of processors
available on the system for
use while processing
messages. Default is to auto-
detect # of CPUs using the
Java Runtime API. Mina will
spawn 2 request-processing
threads per detected CPU,
which is often reasonable.
Property Name Default Description
selector.type replicating replicating, multiplexing, or
custom
selector.* – Depends on
the selector.type value
interceptors – Space-separated list of
interceptors.
interceptors.*
ssl false Set this to true to enable SSL
encryption. If SSL is enabled,
you must also specify a
“keystore” and a “keystore-
password”, either through
component level parameters
(see below) or as global SSL
parameters (see SSL/TLS
support section).
keystore – This is the path to a Java
keystore file. If not specified
here, then the global keystore
will be used (if defined,
otherwise configuration error).
keystore-password – The password for the Java
keystore. If not specified here,
then the global keystore
password will be used (if
defined, otherwise
configuration error).
keystore-type JKS The type of the Java keystore.
This can be “JKS” or
“PKCS12”. If not specified
here, then the global keystore
type will be used (if defined,
otherwise the default is JKS).
exclude-protocols SSLv3 Space-separated list of
SSL/TLS protocols to
exclude. SSLv3 will always
be excluded in addition to the
protocols specified.
include-protocols – Space-separated list of
SSL/TLS protocols to include.
The enabled protocols will be
the included protocols without
the excluded protocols. If
included-protocols is empty, it
includes every supported
protocols.
exclude-cipher-suites – Space-separated list of cipher
suites to exclude.
include-cipher-suites – Space-separated list of cipher
suites to include. The enabled
cipher suites will be the
Property Name Default Description
included cipher suites without
the excluded cipher suites. If
included-cipher-suites is
empty, it includes every
supported cipher suites.
For example, a multiport syslog TCP source for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = multiport_syslogtcp
a1.sources.r1.channels = c1
a1.sources.r1.host = 0.0.0.0
a1.sources.r1.ports = 10001 10002 10003
a1.sources.r1.portHeader = port
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogudp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
HTTP Source
A source which accepts Flume Events by HTTP POST and GET. GET should be
used for experimentation only. HTTP requests are converted into flume events
by a pluggable “handler” which must implement the HTTPSourceHandler
interface. This handler takes a HttpServletRequest and returns a list of flume
events. All events handled from one Http request are committed to the channel
in one transaction, thus allowing for increased efficiency on channels like the
file channel. If the handler throws an exception, this source will return a HTTP
status of 400. If the channel is full, or the source is unable to append events to
the channel, the source will return a HTTP 503 - Temporarily unavailable
status.
All events sent in one post request are considered to be one batch and inserted
into the channel in one transaction.
This source is based on Jetty 9.4 and offers the ability to set additional Jetty-
specific parameters which will be passed directly to the Jetty components.
Deprecated Properties
Property Name Default Description
keystorePassword – Use keystore-password.
Deprecated value will be
overwritten with the new
one.
excludeProtocols SSLv3 Use exclude-protocols.
Deprecated value will be
overwritten with the new
one.
enableSSL false Use ssl. Deprecated
value will be overwritten
with the new one.
N.B. Jetty-specific settings are set using the setter-methods on the objects
listed above. For full details see the Javadoc for these classes
(QueuedThreadPool,HttpConfiguration, SslContextFactory and ServerConnector
).
When using Jetty-specific setings, named properites above will take
precedence (for example excludeProtocols will take precedence over
SslContextFactory.ExcludeProtocols). All properties will be inital lower case.
An example http source for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.example.rest.RestHandler