You are on page 1of 19

Hadoop Interview Questions HDFS!

What is BIG DATA?


Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture,
store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data
processing techniques.
Can you give some examples of Big Data?
There are many real life examples of Big Data !aceboo" is generating #$$% terabytes of data per day, &'() *&ew
'or" (toc" )xchange+ generates about , terabyte of new trade data per day, a -et airline collects ,$ terabytes of
censor data for every .$ minutes of flying time. /ll these are day to day examples of Big Data
Can you give a detailed overview about the Big Data being generated by a!eboo"?
/s of December .,, 0$,0, there are ,.$1 billion monthly active users on faceboo" and 12$ million mobile users. 3n
an average, ..0 billion li"es and comments are posted every day on !aceboo". 405 of web audience is on
!aceboo". /nd why not There are so many activities going on faceboo" from wall posts, sharing images, videos,
writing comments and li"ing posts, etc. 6n fact, !aceboo" started using 7adoop in mid-0$$8 and was one of the initial
users of 7adoop.
A!!ording to IB#$ what are the three !hara!teristi!s of Big Data?
/ccording to 6B9, the three characteristics of Big Data are:
Volume: !aceboo" generating #$$% terabytes of data per day.
Velocity: /nalyzing 0 million records each day to identify the reason for losses.
Variety: images, audio, video, sensor data, log files, etc.
%ow Big is &Big Data'?
;ith time, data volume is growing exponentially. )arlier we used to tal" about 9egabytes or <igabytes. But time has
arrived when we tal" about data volume in terms of terabytes, petabytes and also zettabytes <lobal data volume was
around ,.2=B in 0$,, and is expected to be 4.8=B in 0$,#. 6t is also "nown that the global information doubles in
every two years
%ow analysis of Big Data is useful for organi(ations?
)ffective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on
and which areas are less important. Big data analysis provides some early "ey indicators that can prevent the
company from a huge loss or help in grasping a great opportunity with open hands / precise analysis of Big Data
helps in decision ma"ing !or instance, nowadays people rely so much on !aceboo" and Twitter before buying any
product or service. /ll than"s to the Big Data explosion.
Who are &Data )!ientists'?
Data scientists are soon replacing business analysts or data analysts. Data scientists are experts who find solutions
to analyze data. >ust as web analysis, we have data scientists who have good business insight as to how to handle a
business challenge. (harp data scientists are not only involved in dealing business problems, but also choosing the
relevant issues that can bring value-addition to the organization.
What is %adoop?
7adoop is a framewor" that allows for distributed processing of large data sets across clusters of commodity
computers using a simple programming model.
Why the name &%adoop'?
7adoop doesn?t have any expanding version li"e @oops?. The charming yellow elephant you see is basically named
after Doug?s son?s toy elephant
Why do we need %adoop?
)veryday a large amount of unstructured data is getting dumped into our machines. The ma-or challenge is not to
store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data
present in different machines at different locations. 6n this situation a necessity for 7adoop arises. 7adoop has the
ability to analyze the data present in different machines at different locations very quic"ly and in a very cost effective
way. 6t uses the concept of 9apAeduce which enables it to divide the query into small parts and process them in
parallel. This is also "nown as parallel computing.
The lin" Why Hadoop gives you a detailed explanation about why 7adoop is gaining so much popularity
What are some of the !hara!teristi!s of %adoop framewor"?
7adoop framewor" is written in >ava. 6t is designed to solve problems that involve analyzing large data *e.g.
petabytes+. The programming model is based on <oogle?s 9apAeduce. The infrastructure is based on <oogle?s Big
Data and Distributed !ile (ystem. 7adoop handles large filesBdata throughput and supports data intensive distributed
applications. 7adoop is scalable as more nodes can be easily added to it.
Give a brief overview of %adoop history*
6n 0$$0, Doug Cutting created an open source, web crawler pro-ect.
6n 0$$D, <oogle published 9apAeduce, <!( papers.
6n 0$$1, Doug Cutting developed the open source, 9apreduce and 7D!( pro-ect.
6n 0$$2, 'ahoo ran D,$$$ node 7adoop cluster and 7adoop won terabyte sort benchmar".
6n 0$$8, !aceboo" launched (EF support for 7adoop.
Give examples of some !ompanies that are using %adoop stru!ture?
/ lot of companies are using the 7adoop structure such as Cloudera, )9C, 9apA, 7ortonwor"s, /mazon, !aceboo",
eBay, Twitter, <oogle and so on.
What is the basi! differen!e between traditional +DB#) and %adoop?
Traditional ADB9( is used for transactional systems to report and archive the data, whereas 7adoop is an approach
to store huge amount of data in the distributed file system and process it. ADB9( will be useful when you want to
see" one record from Big data, whereas, 7adoop will be useful when you want Big data in one shot and perform
analysis on that later.
What is stru!tured and unstru!tured data?
(tructured data is the data that is easily identifiable as it is organized in a structure. The most common form of
structured data is a database where specific information is stored in tables, that is, rows and columns. Gnstructured
data refers to any data that cannot be identified easily. 6t could be in the form of images, videos, documents, email,
logs and random text. 6t is not in the form of rows and columns.
What are the !ore !omponents of %adoop?
Core components of 7adoop are 7D!( and 9apAeduce. 7D!( is basically used to store large data sets and
9apAeduce is used to process such large data sets.
What is %D)?
7D!( is a file system designed for storing very large files with streaming data access patterns, running clusters on
commodity hardware.
What are the "ey features of %D)?
7D!( is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to
file system data and can be built out of commodity hardware.
What is ault Toleran!e?
(uppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is
no chance of getting the data bac" present in that file. To avoid such situations, 7adoop has introduced the feature of
fault tolerance in 7D!(. 6n 7adoop, when we store a file, it automatically gets replicated at two other locations also.
(o even if one or two of the systems collapse, the file is still available on the third system.
+epli!ation !auses data redundan!y then why is is pursued in %D)?
7D!( wor"s with commodity hardware *systems with average configurations+ that has high chances of getting
crashed any time. Thus, to ma"e the entire system highly fault-tolerant, 7D!( replicates and stores data in different
places. /ny data on 7D!( gets stored at atleast . different locations. (o, even if one of them is corrupted and the
other is unavailable for some time for any reason, then data can be accessed from the third one. 7ence, there is no
chance of losing the data. This replication factor helps us to attain the feature of 7adoop called !ault Tolerant.
)in!e the data is repli!ated thri!e in %D)$ does it mean that any !al!ulation done on
one node will also be repli!ated on the other two?
(ince there are . nodes, when we send the 9apAeduce programs, calculations will be done only on the original data.
The master node will "now which node exactly has that particular data. 6n case, if one of the nodes is not responding,
it is assumed to be failed. 3nly then, the required calculation will be done on the second replica.
What is throughput? %ow does %D) get a good throughput?
Throughput is the amount of wor" done in a unit time. 6t describes how fast the data is getting accessed from the
system and it is usually used to measure performance of the system. 6n 7D!(, when we want to perform a tas" or an
action, then the wor" is divided and shared among different systems. (o all the systems will be executing the tas"s
assigned to them independently and in parallel. (o the wor" will be completed in a very short period of time. 6n this
way, the 7D!( gives good throughput. By reading data in parallel, we decrease the actual time to read data
tremendously.
What is streaming a!!ess?
/s 7D!( wor"s on the principle of @;rite 3nce, Aead 9any@, the feature of streaming access is
extremely important in 7D!(. 7D!( focuses not so much on storing the data but how to retrieve it at the fastest
possible speed, especially while analyzing logs. 6n 7D!(, reading the complete data is more important than the time
ta"en to fetch a single record from the data.
What is a !ommodity hardware? Does !ommodity hardware in!lude +A#?
Commodity hardware is a non-expensive system which is not of high quality or high-availability. 7adoop can be
installed in any average commodity hardware. ;e don?t need super computers or high-end hardware to wor" on
7adoop. 'es, Commodity hardware includes A/9 because there will be some services which will be running on
A/9.
What is a ,amenode?
&amenode is the master node on which -ob trac"er runs and consists of the metadata. 6t maintains and manages the
bloc"s which are present on the datanodes. 6t is a high-availability machine and single point of failure in 7D!(.
Is ,amenode also a !ommodity?
&o. &amenode can never be a commodity hardware because the entire 7D!( rely on it. 6t is the single point of failure
in 7D!(. &amenode has to be a high-availability machine.
What is a metadata?
9etadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.
What is a Datanode?
Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are
responsible for serving read and write requests for the clients.
Why do we use %D) for appli!ations having large data sets and not when there are lot
of small files?
7D!( is more suitable for large amount of data sets in a single file as compared to small amount of data spread
across multiple files. This is because &amenode is a very expensive high performance system, so it is not prudent to
occupy the space in the &amenode by unnecessary amount of metadata that is generated for multiple small files. (o,
when there is a large amount of data in a single file, name node will occupy less space. 7ence for getting optimized
performance, 7D!( supports large data sets instead of multiple small files.
What is a daemon?
Daemon is a process or service that runs in bac"ground. 6n general, we use this word in G&6H environment. The
equivalent of Daemon in ;indows is IservicesJ and in Dos is J T(AJ.
What is a -ob tra!"er?
>ob trac"er is a daemon that runs on a namenode for submitting and trac"ing 9apAeduce -obs in 7adoop. 6t assigns
the tas"s to the different tas" trac"er. 6n a 7adoop cluster, there will be only one -ob trac"er but many tas" trac"ers. 6t
is the single point of failure for 7adoop and 9apAeduce (ervice. 6f the -ob trac"er goes down all the running -obs are
halted. 6t receives heartbeat from tas" trac"er based on which >ob trac"er decides whether the assigned tas" is
completed or not.
What is a tas" tra!"er?
Tas" trac"er is also a daemon that runs on datanodes. Tas" Trac"ers manage the execution of individual tas"s on
slave node. ;hen a client submits a -ob, the -ob trac"er will initialize the -ob and divide the wor" and assign them to
different tas" trac"ers to perform 9apAeduce tas"s. ;hile performing this action, the tas" trac"er will be
simultaneously communicating with -ob trac"er by sending heartbeat. 6f the -ob trac"er does not receive heartbeat
from tas" trac"er within specified time, then it will assume that tas" trac"er has crashed and assign that tas" to
another tas" trac"er in the cluster.
Is ,amenode ma!hine same as datanode ma!hine as in terms of hardware?
6t depends upon the cluster you are trying to create. The 7adoop K9 can be there on the same machine or on
another machine. !or instance, in a single node cluster, there is only one machine, whereas in the development or in
a testing environment, &amenode and datanodes are on different machines.
What is a heartbeat in %D)?
/ heartbeat is a signal indicating that it is alive. / datanode sends heartbeat to &amenode and tas" trac"er will send
its heart beat to -ob trac"er. 6f the &amenode or -ob trac"er does not receive heart beat then they will decide that
there is some problem in datanode or tas" trac"er is unable to perform the assigned tas".
Are ,amenode and -ob tra!"er on the same host?
&o, in practical environment, &amenode is on a separate host and -ob trac"er is on a separate host.
What is a &blo!"' in %D)?
/ @bloc"? is the minimum amount of data that can be read or written. 6n 7D!(, the default bloc" size is 1D
9B as contrast to the bloc" size of 2,80 bytes in GnixBFinux. !iles in 7D!( are bro"en down into bloc"-
sized chun"s, which are stored as independent units. 7D!( bloc"s are large as compared to dis" bloc"s,
particularly to minimize the cost of see"s.
If a particular file is 50 mb, will the HDFS block still consume ! mb as the "efault si#e$
&o, not at all 1D mb is -ust a unit where the data will be stored. 6n this particular situation, only #$ mb will
be consumed by an 7D!( bloc" and ,D mb will be free to store something else. 6t is the 9aster&ode that
does data allocation in an efficient manner.
What are the benefits of blo!" transfer?
/ file can be larger than any single dis" in the networ". There?s nothing that requires the bloc"s from a file
to be stored on the same dis", so they can ta"e advantage of any of the dis"s in the cluster. 9a"ing the
unit of abstraction a bloc" rather than a file simplifies the storage subsystem. Bloc"s provide fault
tolerance and availability. To insure against corrupted bloc"s and dis" and machine failure, each bloc" is
replicated to a small number of physically separate machines *typically three+. 6f a bloc" becomes
unavailable, a copy can be read from another location in a way that is transparent to the client.
If we want to !opy ./ blo!"s from one ma!hine to another$ but another ma!hine
!an !opy only 0*1 blo!"s$ !an the blo!"s be bro"en at the time of repli!ation?
6n 7D!(, bloc"s cannot be bro"en down. Before copying the bloc"s from one machine to another, the
9aster node will figure out what is the actual amount of space required, how many bloc" are being used,
how much space is available, and it will allocate the bloc"s accordingly.
%ow indexing is done in %D)?
7adoop has its own way of indexing. Depending upon the bloc" size, once the data is stored, 7D!( will
"eep on storing the last part of the data which will say where the next part of the data will be. 6n fact, this
is the base of 7D!(.
If a data ,ode is full how it's identified?
;hen data is stored in datanode, then the metadata of that data will be stored in the &amenode. (o
&amenode will identify if the data node is full.
If datanodes in!rease$ then do we need to upgrade ,amenode?
;hile installing the 7adoop system, &amenode is determined based on the size of the clusters. 9ost of
the time, we do not need to upgrade the &amenode because it does not store the actual data, but -ust the
metadata, so such a requirement rarely arise.
Are -ob tra!"er and tas" tra!"ers present in separate ma!hines?
'es, -ob trac"er and tas" trac"er are present in different machines. The reason is -ob trac"er is a single
point of failure for the 7adoop 9apAeduce service. 6f it goes down, all running -obs are halted.
When we send a data to a node$ do we allow settling in time$ before sending
another data to that node?
'es, we do.
Does hadoop always re2uire digital data to pro!ess?
'es. 7adoop always require digital data to be processed.
3n what basis ,amenode will de!ide whi!h datanode to write on?
/s the &amenode has the metadata *information+ related to all the data nodes, it "nows which datanode
is free.
Doesn't Google have its very own version of D)?
'es, <oogle owns a D!( "nown as I<oogle !ile (ystem *<!(+J developed by <oogle 6nc. for its own
use.
Who is a &user' in %D)?
/ user is li"e you or me, who has some query or who needs some "ind of data.
Is !lient the end user in %D)?
&o, Client is an application which runs on your machine, which is used to interact with the &amenode *-ob
trac"er+ or datanode *tas" trac"er+.
What is the !ommuni!ation !hannel between !lient and namenode4datanode?
The mode of communication is ((7.
What is a ra!"?
Aac" is a storage area with all the datanodes put together. These datanodes can be physically located at
different places. Aac" is a physical collection of datanodes which are stored at a single location. There
can be multiple rac"s in a single location.
3n what basis data will be stored on a ra!"?
;hen the client is ready to load a file into the cluster, the content of the file will be divided into bloc"s.
&ow the client consults the &amenode and gets . datanodes for every bloc" of the file which indicates
where the bloc" should be stored. ;hile placing the datanodes, the "ey rule followed is Ifor every bloc" of
data, two copies will exist in one rac", third copy in a different rac"I. This rule is "nown as IAeplica
Llacement LolicyI.
Do we need to pla!e 5nd and 6rd data in ra!" 5 only?
'es, this is to avoid datanode failure.
What if ra!" 5 and datanode fails?
6f both rac"0 and datanode present in rac" , fails then there is no chance of getting data from it. 6n order
to avoid such situations, we need to replicate that data more number of times instead of replicating only
thrice. This can be done by changing the value in replication factor which is set to . by default.
What is a )e!ondary ,amenode? Is it a substitute to the ,amenode?
The secondary &amenode constantly reads the data from the A/9 of the &amenode and writes it into the
hard dis" or the file system. 6t is not a substitute to the &amenode, so if the &amenode fails, the entire
7adoop system goes down.
What is the differen!e between Gen. and Gen5 %adoop with regards to the
,amenode?
6n <en , 7adoop, &amenode is the single point of failure. 6n <en 0 7adoop, we have what is "nown as
/ctive and Lassive &amenodes "ind of a structure. 6f the active &amenode fails, passive &amenode
ta"es over the charge.
What is #ap+edu!e?
9ap Aeduce is the @heart@ of 7adoop that consists of two parts M @map? and @reduce?. 9aps and reduces
are programs for processing data. @9ap? processes the data first to give some intermediate output which
is further processed by @Aeduce? to generate the final output. Thus, 9apAeduce allows for distributed
processing of the map and reduction operations.
Can you explain how do &map' and &redu!e' wor"?
&amenode ta"es the input and divide it into parts and assign them to data nodes. These datanodes
process the tas"s assigned to them and ma"e a "ey-value pair and returns the intermediate output to the
Aeducer. The reducer collects this "ey value pairs of all the datanodes and combines them and generates
the final output.
What is &7ey value pair' in %D)?
Ney value pair is the intermediate data generated by maps and sent to reduces for generating the final
output.
What is the differen!e between #ap+edu!e engine and %D) !luster?
7D!( cluster is the name given to the whole configuration of master and slaves where data is stored.
9ap Aeduce )ngine is the programming module which is used to retrieve and analyze data.
Is map li"e a pointer?
&o, 9ap is not li"e a pointer.
Do we re2uire two servers for the ,amenode and the datanodes?
'es, we need two different servers for the &amenode and the datanodes. This is because &amenode
requires highly configurable system as it stores information about the location details of all the files stored
in different datanodes and on the other hand, datanodes require low configuration system.
Why are the number of splits e2ual to the number of maps?
The number of maps is equal to the number of input splits because we want the "ey and value pairs of all
the input splits.
Is a -ob split into maps?
&o, a -ob is not split into maps. (pilt is created for the file. The file is placed on datanodes in bloc"s. !or
each split, a map is needed.
Whi!h are the two types of &writes' in %D)?
There are two types of writes in 7D!(: posted and non-posted write. Losted ;rite is when we write it and
forget about it, without worrying about the ac"nowledgement. 6t is similar to our traditional 6ndian post. 6n a
&on-posted ;rite, we wait for the ac"nowledgement. 6t is similar to the today?s courier services. &aturally,
non-posted write is more expensive than the posted write. 6t is much more expensive, though both writes
are asynchronous.
Why &+eading& is done in parallel and &Writing& is not in %D)?
Aeading is done in parallel because by doing so we can access the data fast. But we do not perform
the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might
result in data inconsistency. !or example, you have a file and two nodes are trying to write data into the
file in parallel, then the first node does not "now what the second node has written and vice-versa. (o,
this ma"es it confusing which data to be stored and accessed.
Can %adoop be !ompared to ,3)89 database li"e Cassandra?
Though &3(EF is the closet technology that can be compared to 7adoop, it has its own pros and cons.
There is no D!( in &3(EF. 7adoop is not a database. 6t?s a filesystem *7D!(+ and distributed
programming framewor" *9apAeduce+.
%ow !an I install Cloudera :# in my system?
;hen you enrol for the hadoop course at )dure"a, you can download the 7adoop 6nstallation
steps.pdf file from our dropbox. This will be shared with you by an e-mail.
Hadoop Interview Questions setting Up Hadoop Cluster!
Whi!h are the three modes in whi!h %adoop !an be run?
The three modes in which 7adoop can be run are:
,. standalone *local+ mode
0. Lseudo-distributed mode
.. !ully distributed mode
What are the features of )tand alone ;lo!al< mode?
6n stand-alone mode there are no daemons, everything runs on a single >K9. 6t has no D!( and utilizes
the local file system. (tand-alone mode is suitable only for running 9apAeduce programs during
development. 6t is one of the most least used environments.
What are the features of =seudo mode?
Lseudo mode is used both for development and in the E/ environment. 6n the Lseudo mode all the
daemons run on the same machine.
Can we !all :#s as pseudos?
&o, K9s are not pseudos because K9 is something different and pesudo is very specific to 7adoop.
What are the features of ully Distributed mode?
!ully Distributed mode is used in the production environment, where we have @n? number of machines
forming a 7adoop cluster. 7adoop daemons run on a cluster of machines. There is one host onto which
&amenode is running and another host on which datanode is running and then there are machines on
which tas" trac"er is running. ;e have separate masters and separate slaves in this distribution.
Does %adoop follows the >,I? pattern?
'es, 7adoop closely follows the G&6H pattern. 7adoop also has the @conf@ directory as in the case of
G&6H.
In whi!h dire!tory %adoop is installed?
Cloudera and /pache has the same directory structure. 7adoop is installed in c" %usr%lib%ha"oop&0.'0%.
What are the port numbers of ,amenode$ -ob tra!"er and tas" tra!"er?
The port number for &amenode is ?4$O, for -ob trac"er is ?.$O and for tas" trac"er is ?1$O.
What is the %adoop@!ore !onfiguration?
7adoop core is configured by two xml files:
(. ha"oop&"efault.)ml which was renamed to '. ha"oop&site.)ml.
These files are written in xml format. ;e have certain properties in these xml files, which consist of name
and value. But these files do not exist now.
What are the %adoop !onfiguration files at present?
There are . configuration files in 7adoop:
(. core&site.)ml
'. h"fs&site.)ml
*. mapre"&site.)ml
These files are located in the conf/ subdirectory.
%ow to exit the :i editor?
To exit the Ki )ditor, press )(C and type :q and then press enter.
What is a spill fa!tor with respe!t to the +A#?
(pill factor is the size after which your files move to the temp file. 7adoop-temp directory is used for this.
Is fs*mapr*wor"ing*dir a single dire!tory?
'es, fs.mapr.working.dir it is -ust one directory.
Whi!h are the three main hdfs@site*xml properties?
The three main hdfs-site.xml properties are:
,. "fs.name."ir which gives you the location on which metadata will be stored and where D!( is located
M on dis" or onto the remote.
0. "fs."ata."ir which gives you the location where the data is going to be stored.
.. fs.checkpoint."ir which is for secondary &amenode.
%ow to !ome out of the insert mode?
To come out of the insert mode, press )(C, type :q *if you have not written anything+ 3A type :wq *if you
have written anything in the file+ and then press )&T)A.
What is Cloudera and why it is used?
Cloudera is the distribution of 7adoop. 6t is a user created on K9 by default. Cloudera belongs to /pache
and is used for data processing.
What happens if you get a &!onne!tion refused -ava ex!eption' when you type
hadoop fs!" 4?
6t could mean that the &amenode is not wor"ing on your K9.
We are using >buntu operating system with Cloudera$ but from where we !an
download %adoop or does it !ome by default with >buntu?
This is a default configuration of 7adoop that you have to download from Cloudera or from )dure"a?s
dropbox and the run it on your systems. 'ou can also proceed with your own configuration but you need a
Finux box, be it Gbuntu or Aed hat. There are installation steps present at the Cloudera location or in
)dure"a?s Drop box. 'ou can go either ways.
What does &-ps' !ommand do?
This command chec"s whether your &amenode, datanode, tas" trac"er, -ob trac"er, etc are wor"ing or
not.
%ow !an I restart ,amenode?
,. Clic" on stop&all.sh and then clic" on start&all.sh OR
0. ;rite su"o h"fs *press enter+, su&h"fs *press enter+, %etc%init."%ha *press enter+ and
then %etc%init."%ha"oop&0.'0&nameno"e start *press enter+.
What is the full form of fs!"?
!ull form of fsc" is File System Check.
%ow !an we !he!" whether ,amenode is wor"ing or not?
To chec" whether &amenode is wor"ing or not, use the command %etc%init."%ha"oop&0.'0&nameno"e
status or as simple as +ps.
What does the !ommand mapred*-ob*tra!"er do?
The command mapre".+ob.tracker lists out which of your nodes is acting as a -ob trac"er.
What does 4et! 4init*d do?
%etc %init." specifies where daemons *services+ are placed or to see the status of these daemons. 6t is
very F6&GH specific, and nothing to do with 7adoop.
%ow !an we loo" for the ,amenode in the browser?
6f you have to loo" for &amenode in the browser, you don?t have to give localhost:2$0,, the port number
to loo" for &amenode in the brower is 50070.
%ow to !hange from )> to Cloudera?
To change from (G to Cloudera -ust type exit.
Whi!h files are used by the startup and shutdown !ommands?
Slaves and Masters are used by the startup and the shutdown commands.
What do slaves !onsist of?
(laves consist of a list of hosts, one per line, that host datanode and tas" trac"er servers.
What do masters !onsist of?
9asters contain a list of hosts, one per line, that are to host secondary namenode servers.
What does hadoop@env*sh do?
ha"oop&en,.sh provides the environment for 7adoop to run. >/K/P739) is set over here.
Can we have multiple entries in the master files?
'es, we can have multiple entries in the 9aster files.
Where is hadoop@env*sh file present?
ha"oop&en,.sh file is present in the conf location.
In %adoopA=IDADI+$ what does =ID stands for?
L6D stands for @Lrocess 6D?.
What does 4var4hadoop4pids do?
6t stores the L6D.
What does hadoop@metri!s*properties file do?
ha"oop&metrics.properties is used for @Reortin!@ purposes. 6t controls the reporting for 7adoop. The
default status is @not to reort@.
What are the networ" re2uirements for %adoop?
The 7adoop core uses (hell *((7+ to launch the server processes on the slave nodes. 6t
requires ass"ord#less ((7 connection between the master and all the slaves and the secondary
machines.
Why do we need a password@less ))% in ully Distributed environment?
;e need a ass"ord#less ((7 in a !ully-Distributed environment because when the cluster is F6K) and
running in !ully
Distributed environment, the communication is too frequent. The -ob trac"er should be able to send a tas"
to tas" trac"er quic"ly.
Does this lead to se!urity issues?
&o, not at all. 7adoop cluster is an isolated cluster. /nd generally it has nothing to do with an internet. 6t
has a different "ind of a configuration. ;e needn?t worry about that "ind of a security breach, for instance,
someone hac"ing through the internet, and so on. 7adoop has a very secured way to connect to other
machines to fetch and to process data.
3n whi!h port does ))% wor"?
((7 wor"s on Lort &o. '', though it can be configured. '' is the default Lort number.
Can you tell us more about ))%?
((7 is nothing but a secure shell communication, it is a "ind of a protocol that wor"s on a Lort &o. 00,
and when you do an ((7, what you really require is a password.
Why password is needed in ))% lo!alhost?
Lassword is required in ((7 for security and in a situation where ass"ord#less communication is not
set.
Do we need to give a password$ even if the "ey is added in ))%?
'es, password is still required even if the "ey is added in ((7.
What if a ,amenode has no data?
6f a &amenode has no data it is not a &amenode. Lractically, &amenode will have some data.
What happens to -ob tra!"er when ,amenode is down?
;hen &amenode is down, your cluster is 3!!, this is because &amenode is the single point of failure in
7D!(.
What happens to a ,amenode$ when -ob tra!"er is down?
;hen a -ob trac"er is down, it will not be functional but &amenode will be present. (o, cluster is
accessible if &amenode is wor"ing, even if the -ob trac"er is not wor"ing.
Can you give us some more details about ))% !ommuni!ation between #asters
and the )laves?
((7 is a password-less secure communication where data pac"ets are sent across the slave. 6t has
some format into which data is sent across. ((7 is not only between masters and slaves but also
between two hosts.
What is formatting of the D)?
>ust li"e we do for ;indows, D!( is formatted for proper structuring. 6t is not usually done as it formats
the &amenode too.
Does the %D) !lient de!ide the input split or ,amenode?
&o, the Client does not decide. 6t is already specified in one of the configurations through which input split
is already configured.
In Cloudera there is already a !luster$ but if I want to form a !luster on >buntu
!an we do it?
'es, you can go ahead with this There are installation steps for creating a new cluster. 'ou can uninstall
your present cluster and install the new cluster.
Can we !reate a %adoop !luster from s!rat!h?
'es we can do that also once we are familiar with the 7adoop environment.
Can we use Windows for %adoop?
/ctually, Red $at %inu& or 'buntu are the best 3perating (ystems for 7adoop. ;indows is not used
frequently for installing 7adoop as there are many support problems attached with ;indows. Thus,
;indows is not a preferred environment for 7adoop.
Mapredu!e Interview Q"s
What is MapReduce?
It is a framework or a programming model that is used for processing large data sets over clusters of
computers using distributed programming.
What are maps and reduces?
Maps and Reduces are two phases of solving a query in HDFS. Map is responsible to read data
from input location, and based on the input type, it will generate a key value pair, that is, an
intermediate output in local machine. Reducer is responsible to process
the intermediate output received from the mapper and generate the fnal output.
What are the four basic parameters of a mapper?
The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The frst two
represent input parameters and the second two represent intermediate output parameters.
What are the four basic parameters of a reducer?
The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The frst two represent
intermediate output parameters and the second two represent fnal output parameters.
What do the master class and the output class do?
Master is defned to update the Master or the job tracker and the output class is defned to write data
onto the output location.
What is the input type/format in MapReduce by default?
By default the type input type in MapReduce is text.
Is it mandatory to set input and output type/format in MapReduce?
No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster
takes the input and the output type as text.
What does the text input format do?
In text input format, each line will create a line object, that is an hexa-decimal number. Key is
considered as a line object and value is considered as a whole line text. This is how the data gets
processed by a mapper. The mapper will receive the key as a LongWritable parameter and value
as a text parameter.
What does job conf class do?
MapReduce needs to logically separate diferent jobs running on the same cluster. Job conf class
helps to do job level settings such as declaring a job in real environment. It is recommended
that Job name should be descriptive and represent the type of job that is being executed.
What does conf.setMapper Class do?
Conf.setMapper class sets the mapper class and all the stuf related to map job such as reading a
data and generating a key-value pair out of the mapper.
What do sorting and shufing do?
Sorting and shufing are responsible for creating a unique key and a list of values. Making similar
keys at one location is known as Sorting. And the process by which the intermediate output of the
mapper is sorted and sent across to the reducers is known as Shufing.
What does a split do?
Before transferring the data from hard disk location to map method, there is a phase or method
called the Split Method. Split method pulls a block of data from HDFS to the framework. The Split
class does not write anything, but reads data from the block and pass it to the mapper. Be default,
Split is taken care by the framework. Split method is equal to the block size and is used to divide
block into bunch of splits.
How can we change the split size if our commodity hardware has less storage space?
If our commodity hardware has less storage space, we can change the split size by writing the
custom splitter. There is a feature of customization in Hadoop which can be called from the main
method.
What does a MapReduce partitioner do?
A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer,
thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to
the reducer by determining which reducer is responsible for a particular key.
How is Hadoop diferent from other data processing tools?
In Hadoop, based upon your requirements, you can increase or decrease the number of mappers
without bothering about the volume of data to be processed. this is the beauty of
parallel processing in contrast to the other data processing tools available.
Can we rename the output fle?
Yes we can rename the output fle by implementing multiple format output class.
Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting
happens only on the reducer side. Mapper method initialization depends upon each input split. While
doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will
get initialized. For each row, input split again gets divided into mapper, thus we do not have a track
of the previous row value.
What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce
in any programming language which can accept standard input and can produce standard output. It
could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce
can only be done using Java and not any other programming language.
What is a Combiner?
A Combiner is a mini reducer that performs the local reduce task. It receives the input from the
mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the
efciency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.
What is the diference between an HDFS Block and Input Split?
HDFS Block is the physical division of the data and Input Split is the logical division of the data.
What happens in a textinputformat?
In textinputformat, each line in the text fle is a record. Key is the byte ofset of the line and value is
the content of the line. For instance, Key: longWritable, value: text.
What do you know about keyvaluetextinputformat?
In keyvaluetextinputformat, each line in the text fle is a record. The frst separator character divides
each line. Everything before the separator is the key and everything after the separator is the value.
For instance, Key: text, value: text.
What do you know about Sequencefleinputformat?
Sequencefleinputformat is an input format for reading in sequence fles. Key and value are user
defned. It is a specifc compressed binary fle format which is optimized for passing the data
between the output of one MapReduce job to the input of some other MapReduce job.
What do you know about Nlineoutputformat?
Nlineoutputformat splits n lines of input as one split.
Hadoop Interview Questions - PIG!
an you give us some examples how %adoop is used in real time environment?
Fet us assume that the we have an exam consisting of ,$ 9ultiple-choice questions and 0$ students appear for that
exam. )very student will attempt each question. !or each question and each answer option, a "ey will be generated.
(o we have a set of key#value airs for all the questions and all the answer options for every student. Based on the
options that the students have selected, you have to analyze and find out how many students have answered
correctly. This isn?t an easy tas". 7ere 7adoop comes into picture 7adoop helps you in solving these problems
quic"ly and without much effort. 'ou may also ta"e the case of how many students have wrongly attempted a
particular question.
What is Bloom#apile used for?
The (loomMaFile is a class that extends MaFile. (o its functionality is similar to 9ap!ile. Bloom9ap!ile uses
dynamic Bloom filters to provide quic" membership test for the "eys. 6t is used in 7base table format.
What is =IG?
L6< is a platform for analyzing large data sets that consist of high level language for expressing data analysis
programs, coupled with infrastructure for evaluating these programs. L6<?s infrastructure layer consists of a compiler
that produces sequence of 9apAeduce Lrograms.
What is the differen!e between logi!al and physi!al plans?
Lig undergoes some steps when a Lig Fatin (cript is converted into 9apAeduce -obs. /fter performing the basic
parsing and semantic chec"ing, it produces a logical plan. The lo!ical lan describes the logical operators that have
to be executed by Lig during execution. /fter this, Lig produces a physical plan. The hysical lan describes the
physical operators that are needed to execute the script.
Does &I99>)T+ATB' run #+ -ob?
&o, illustrate will not pull any 9A, it will pull the internal data. 3n the console, illustrate will not do any -ob. 6t -ust
shows output of each stage and not the final output.
Is the "eyword &DBI,B' li"e a fun!tion name?
'es, the "eyword @D)!6&)? is li"e a function name. 3nce you have registered, you have to define it. ;hatever logic
you have written in >ava program, you have an exported -ar and also a -ar registered by you. &ow the compiler will
chec" the function in exported -ar. ;hen the function is not present in the library, it loo"s into your -ar.
Is the "eyword &>,CTI3,A9' a >ser Defined un!tion ;>D<?
No, the keyword FUNCTIONAL is not a User Defned Function (UDF! "hi#e usin$ UDF, we ha%e to
o%erride so&e 'unctions! Certain#y you ha%e to do your (o) with the he#* o' these 'unctions on#y! +ut
the keyword FUNCTIONAL is a )ui#t,in 'unction i!e a *re,defned 'unction, there'ore it does not work as
a UDF!
Why do we need #ap+edu!e during =ig programming?
Lig is a high-level platform that ma"es many 7adoop data analysis issues easier to execute. The language we use
for this platform is: )i! %atin. / program written in )i! %atin is li"e a query written in (EF, where we need an
execution engine to execute the query. (o, when a program is written in Lig Fatin, Lig compiler will convert the
program into 9apAeduce -obs. 7ere, MaReduce acts as the execution engine.
Are there any problems whi!h !an only be solved by #ap+edu!e and !annot be solved
by =IG? In whi!h "ind of s!enarios #+ -obs will be more useful than =IG?
Fet us ta"e a scenario where we want to count the population in two cities. 6 have a data set and sensor list of
different cities. 6 want to count the population by using one mapreduce for two cities. Fet us assume that one is
Bangalore and the other is &oida. (o 6 need to consider "ey of Bangalore city similar to &oida through which 6 can
bring the population data of these two cities to one reducer. The idea behind this is some how 6 have to instruct map
reducer program M whenever you find city with the name @(an!alore@ and city with the name @*oida+, you create the
alias name which will be the common name for these two cities so that you create a common "ey for both the cities
and it get passed to the same reducer. !or this, we have to write custom artitioner.
6n mapreduce when you create a @key+ for city, you have to consider +city+ as the "ey. (o, whenever the framewor"
comes across a different city, it considers it as a different "ey. 7ence, we need to use customized partitioner. There is
a provision in mapreduce only, where you can write your custom partitioner and mention if city Q bangalore or noida
then pass similar hashcode. 7owever, we cannot create custom partitioner in Lig. /s Lig is not a framewor", we
cannot direct execution engine to customize the partitioner. 6n such scenarios, 9apAeduce wor"s better than Lig.
Does =ig give any warning when there is a type mismat!h or missing field?
&o, Lig will not show any warning if there is no matching field or a mismatch. 6f you assume that Lig gives such a
warning, then it is difficult to find in log file. 6f any mismatch is found, it assumes a null value in Lig.
What !o@group does in =ig?
Co-group -oins the data set by grouping one particular data set only. 6t groups the elements by their common field and
then returns a set of records containing two separate bags. The first bag consists of the record of the first data set
with the common data set and the second bag consists of the records of the second data set with the common data
set.
Can we say !ogroup is a group of more than . data set?
Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets
and -oin them based on the common field. 7ence, we can say that cogroup is a !rou of more than one data set
and ,oin of that data set as well.
What does 3+BAC% do?
!3A)/C7 is used to apply transformations to the data and to generate new data items. The name itself is indicating
that for each element of a data bag, the respective action will be performed.
Synta& - !3A)/C7 bagname <)&)A/T) expression,, expression0, R..
The meaning of this statement is that the expressions mentioned after <)&)A/T) will be applied to the current
record of the data bag.
What is bag?
/ bag is one of the data models present in Lig. 6t is an unordered collection of tuples with possible duplicates. Bags
are used to store collections while grouping. The size of bag is the size of the local dis", this means that the size of
the bag is limited. ;hen the bag is full, then Lig will spill this bag into local dis" and "eep only some parts of the bag
in memory. There is no necessity that the complete bag should fit into memory. ;e represent bags with ISTJ.

You might also like