You are on page 1of 10

Apache Flume Sources

By

What is Apache Flume Sources?


Flume sources are input points into the Flume Agent. The purpose
of a Source is to receive data from an external client (e.g. logs or
syslogs) and store it into the Flume Channel. A Source can get an
instance of its own ChannelProcessor to process an Event. The
ChannelProcessor in turn can get an instance of its own
ChannelSelector (Replicating or Multiplexing) thats used to get
the Channels associated with the Source, as configured in the
Flume properties file. A Transaction can then be retrieved from
each associated Channel so that the Source can place Events into
the Channel reliably, within a Transaction.
Several sources comes with the flume distribution and also
available as open sources. However, you can create your own
source by extending the
org.apache.flume.source.AbstarctSource
Ad -> Hadoop Training for Java Developers in just $69/3500INR visit www.HadoopExam.com
(Learn BigData)Hadoop Certification 300+ practice Questions visit www.HadoopExam.com

Flume Sources

Understand different type of Sources


TailSource
Exec Source
Spooling Directory
Syslog
Syslog UDP Source
Syslog TCP sources
Multiport syslog TCP sources

Ad -> Hadoop Training for Java Developers in just $69/3500INR visit www.HadoopExam.com
(Learn BigData)Hadoop Certification 300+ practice Questions visit www.HadoopExam.com

Introduction to TailSource and why it


is discontinued
TailSource is no longer part of Flume. Using the TailSource you can tail any
file on the system and for each line it can create flume events.

In case of channels and sinks, events are added and removed from
the channel, will be a part of transaction. However, when you tail the file,
there is no way, that it could be part of a transaction.
Suppose, because of any reason for instance channel fails, then there is no
possibility to rollback this tailed transaction, to put back the data.

Lets have an example, if you are tailing a file


/user/hadoopexam/access.log

And in the log4j you had done the configuration to rotate or rename the
file if it reaches the 1 MB in size and renaming will be done as below.
/user/hadoopexam/access.log1

Ad -> Hadoop Training for Java Developers in just $69/3500INR visit www.HadoopExam.com
(Learn BigData)Hadoop Certification 300+ practice Questions visit www.HadoopExam.com

And assume Flume was reading a file access.log which is renamed to


access.log1, however, it has file handler with it so it is still able to read it.
But at the same time assume the new log file is also renamed as below
/user/hadoopexam/access.log2

Now, Flume is done with the access.log1, and it will start reading the file
access.log and it is unaware that there is another file access.log2 was
created and that log would be missed by the Apache Flume for reading.

So, you might have noticed that using the TailSource there are chances
that data could be lost, that is the second reason why TailSource was
discontinued after 0.9 flume release.

1. Tail cannot be a part of transaction

2. Possibility of data loss as per above example.

Ad -> Hadoop Training for Java Developers in just $69/3500INR visit www.HadoopExam.com
(Learn BigData)Hadoop Certification 300+ practice Questions visit www.HadoopExam.com

Apache Flume with the EXEC sources

The exec source command can be used to run a command outside of Flume. Output of that
command will be than ingested as an event in the Flume.
How to use exec source?
Ans: set the agents source type property to exec as below.
agents.sources.sourceid.type=exec
Define the channels as below to fed all the events to particular channel
agents.sources.sourceid.channels=channel1
You can also configure more than one channel, with space as a separator.
Now, you have to specify one of the mandatory parameter , which is command to be passed to the
operating system as below.

agents.sources.sourceid.command=tail F /user/hadoopexam/access.log

Ad -> Hadoop Training for Java Developers in just $69/3500INR visit www.HadoopExam.com
(Learn BigData)Hadoop Certification 300+ practice Questions visit www.HadoopExam.com

Summary of above configuration:

Here we have single source configured named as sourceid


Agent name is agent
An exec source, wich will tail the access.log file
All the events will be written to the channel1 channel.
Important: When you use tail command using exec source
type. Flume will fork a child process. Which sometimes
does not shutdown, when flume agent shutdown and
restarts.
And there would be orphan tail F process, even you delete
the file this tail process will keep the file handler open
indefinitely. Hence, you have to kill this process manually,
to reclaim files space.
Ad -> Hadoop Training for Java Developers in just $69/3500INR visit www.HadoopExam.com
(Learn BigData)Hadoop Certification 300+ practice Questions visit www.HadoopExam.com

Properties for the exec sources


Key

Required

Type

Default

type

Yes

String

channels

Yes

String

exec
Space-separated list of
channels

command

Yes

String

restart

No

boolean

FALSE

restartThrottle

No

long (milliseconds)

10000

logStdErr

No

boolean

FALSE

batchSize

No

int

20

Ad -> Hadoop Training for Java Developers in just $69/3500INR visit www.HadoopExam.com
(Learn BigData)Hadoop Certification 300+ practice Questions visit www.HadoopExam.com

Another command example


agents.sources.sourceid.command=uptime
Uptime commads on the unix box, prints since
when when box has been restarted and exits
immediately.
Hence, below is the configuration for which this
command will be executed every minute
periodically.
agents.sources.sourceid.restart=true
agents.sources.sourceid.restartThrottle=6000
Ad -> Hadoop Training for Java Developers in just $69/3500INR visit www.HadoopExam.com
(Learn BigData)Hadoop Certification 300+ practice Questions visit www.HadoopExam.com

Advertisement
www.HadoopExam.com provides BigData Hadoop Training and
Hadoop Developer and Admin Certification material.
Hbase Certification Material
AWS Solution Architect Certification material
Please visit or watch below YouTube videos for Sample Hadoop
Training.
Module 1 : Hadoop Introduction :
https://www.youtube.com/watch?v=R-qjyEn3bjs
Module 2 : HDFS Introduction :
https://www.youtube.com/watch?v=PK6Im7tBWow
Module 2A : HDFS File Operation Lifecycle
https://www.youtube.com/watch?v=Wu2EGfQY-i4
Dont forget to subscribe YouTube Channel for regular updates.

You might also like