You are on page 1of 173

Hadoop V1.

1
Apache Hadoop HDFSMapReduce Apache Hadoop Hadoop-The Definitive Guide SE Hadoop zhangmengzhi2005@126.com 2013 08 14
Hadoop V1.1 .................................................................................................................... 1 ........................................................................................................................................... 2 1 ............................................................................................................................................. 3 2 ......................................................................................................................................... 5 3 RPC ........................................................................................................................................... 12 4 HDFS ....................................................................................................................... 26 5 HDFS ........................................................................................................................... 39 6 DataNode ............................................................................................................................. 72 7 NameNode ................................................................................................................................ 76 8 Lease ......................................................................................................................................... 90 9 Heartbeat ................................................................................................................................ 100 10 HDFS ............................................................................................................................. 107 11 MapReduce .................................................................................................................. 118 12 ........................................................................................................................... 122 13 ........................................................................................................................... 132 14 ........................................................................................................................... 142 15 ........................................................................................................................... 153 16 ............................................................................................................................... 171

2013-08-24 2013-09-26

Google Google 5

GoogleClusterChubbyGFSMapReduceBigTable

googlecluster-iee chubby-osdi06.pdf gfs-sosp2003.pdf e.pdf

MapReduce.pdf

bigtable-osdi06.p df

03 06 03 SOSP GFS04 OSDI MapReduce 06 OSDI BigTable


SOSP OSDI A SOSP OSDI

Apache Apache Hadoop Chubby-->ZooKeeper GFS-->HDFS BigTable-->HBase MapReduce-->Hadoop MapReduce Facebook HiveYahho Pig Hadoop HDFS MapReduce HDFSMapReduce Hadoop Hadoop 12 October, 2012: Release 1.0.4 available Hadoop core/mapred/tools/hdfs

Hadoop Hadoop
3

HDFS API Amazon S3 confconf fs fs

Hadoop Package tools mapreduce filecache fs hdfs ipc io Dependences DistCparchive Hadoop Map/Reduce HDFS Map/Reduce HDFSHadoop IPC io /
4

net security conf metrics util record http log

DNSsocket DDL C++ Java Jetty HTTP Servlet HTTP HTTP Servlet

Hadoop remote

procedure callRPCRPC RPC HDFSMapReduce Hadoop Java org.apache.hadoop.io Writable Writable DataOutput DataInput
5

/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.io; import java.io.DataOutput; import java.io.DataInput; import java.io.IOException; /** * A serializable object which implements a simple, efficient, serialization * protocol, based on {@link DataInput} and {@link DataOutput}. * * <p>Any <code>key</code> or <code>value</code> type in the Hadoop Map-Reduce * framework implements this interface.</p> * * <p>Implementations typically implement a static <code>read(DataInput)</code> * method which constructs a new instance, calls {@link #readFields(DataInput)} * and returns the instance.</p> * * <p>Example:</p> * <p><blockquote><pre> * public class MyWritable implements Writable { * // Some data * private int counter; * private long timestamp; * * public void write(DataOutput out) throws IOException { * out.writeInt(counter);
6

* out.writeLong(timestamp); * } * * public void readFields(DataInput in) throws IOException { * counter = in.readInt(); * timestamp = in.readLong(); * } * * public static MyWritable read(DataInput in) throws IOException { * MyWritable w = new MyWritable(); * w.readFields(in); * return w; * } * } * </pre></blockquote></p> */ public interface Writable { /** * Serialize the fields of this object to <code>out</code>. * * @param out <code>DataOuput</code> to serialize this object into. * @throws IOException */ void write(DataOutput out) throws IOException; /** * Deserialize the fields of this object from <code>in</code>. * * <p>For efficiency, implementations should attempt to re-use storage in the * existing object where possible.</p> * * @param in <code>DataInput</code> to deseriablize this object from. * @throws IOException */ void readFields(DataInput in) throws IOException; }

write readFields Writable org.apache.hadoop.io

WritableComparable Writable java.lang.Comparable IntWritable WritableComparable MapReduce Hadoop Java Comparator RawComparator
package org.apache.hadoop.io; import java.util.Comparator; import org.apache.hadoop.io.serializer.DeserializerComparator; /** * <p>
8

* A {@link Comparator} that operates directly on byte representations of * objects. * </p> * @param <T> * @see DeserializerComparator */ public interface RawComparator<T> extends Comparator<T> { public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2); }

WritableComparator WritableComparable RawComparator compare() RawComparator Writable Hadoop org.apache.hadoop.io Writable

Writable Java short char IntWritable get() set()


Java Writable

boolean

BooleanWritable

byte

ByteWritable

int

IntWritable

VIntWritable

1-5

float

FloatWritable

4
10

long

LongWritable

VLongWritable

1-9

double

DoubleWritable

IntWritable LongWritable VIntWritable VLongWritable -127 127 -127 127 Text UTF-8 Writable java.lang.String Writable ObjectWritable Hadoop RPC RPC Java String Writable ObjectWritable RPC / Writable Writable RPC MyWritable ObjectWritable ObjectWritable

ObjectWritable ObjectWritable WritableFactories Writable MyWritable WritableFactories WritableFactories.setFactory MapReduce Writable MapReduce API
11

Hadoop serialization framework API Serialization WritableSerialization Writable Serialization Serialization Serializer Deserializer Hadoop io.serizalizations Serialization org.apache.hadoop.io.serializer. WritableSerialization

RPC
/ Hadoop RPC

IPC Hadoop Writable Hadoop Writable Java String Writable IPC Java Hadoop RPC CORBA IDL stub skeleton IOException RPCorg.apache.hadoop.ipc Client Server Server RPC Server RPC org.apache.hadoop.ipc

12

org.apache.hadoop.ipc.Client IPC org.apache.hadoop.rpc.Client

CallConnectionConnectionId

13

Call Invocation VO Connection Thread ConnectionId Client Server HDFS NameNode / DataNode Client Client ConnectionId Connection ID ConnectionId InetSocketAddress IP ++ InetSocketAddress Client.Connection RPC Connection RPC Connection id Client.Call Connection Hash Call
// calls private Hashtable<Integer, Call> calls = new Hashtable<Integer, Call>();

RPC addCall Connection Java String Writable Call ObjectWritable Client.Connection socket / Client. writeHeader() Writable / Client.Connection socket Call Call Call Obejct wait notify RPC Client Client.Connection

14

Client call()
/** Make a call, passing <code>param</code>, to the IPC server defined by * <code>remoteId</code>, returning the value. * Throws exceptions if there are network problems or if the remote code * threw an exception. */ public Writable call(Writable param, ConnectionId remoteId) throws InterruptedException, IOException { Call call = new Call(param); Connection connection = getConnection(remoteId, call); connection.sendParam(call); // send the parameter boolean interrupted = false; synchronized (call) { while (!call.done) { try { call.wait(); // wait for the result } catch (InterruptedException ie) { // save the fact that we were interrupted interrupted = true; } } if (interrupted) { // set the interrupt flag now that we are done waiting Thread.currentThread().interrupt(); } if (call.error != null) { if (call.error instanceof RemoteException) { call.error.fillInStackTrace(); throw call.error; } else { // local exception // use the connection because it will reflect an ip change, unlike // the remoteId throw wrapException(connection.getRemoteAddress(), call.error); } } else { return call.value; } } }

Client.getConnection()

-->

Client.Connection.setupIOstreams()

-->

Client.Connection.setupConnection()
15

socket Client.Connection.sendParam() java io socket Client.Connection.setupIOstreams() Client.Connection.run()


public void run() { if (LOG.isDebugEnabled()) LOG.debug(getName() + ": starting, having connections " + connections.size()); while (waitForWork()) {//wait here for work - read or close connection receiveResponse(); // } close(); if (LOG.isDebugEnabled()) LOG.debug(getName() + ": stopped, remaining connections " + connections.size()); }

Client.Connection. receiveResponse ()
/* Receive a response. * Because only one receiver, so no synchronization on in. */ private void receiveResponse() { if (shouldCloseConnection.get()) { return; } touch(); try { int id = in.readInt();

// try to read an id

if (LOG.isDebugEnabled()) LOG.debug(getName() + " got value #" + id); Call call = calls.get(id);

16

int state = in.readInt(); // read call status if (state == Status.SUCCESS.state) { Writable value = ReflectionUtils.newInstance(valueClass, conf); value.readFields(in); // read value call.setValue(value); calls.remove(id); } else if (state == Status.ERROR.state) { call.setException(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); calls.remove(id); } else if (state == Status.FATAL.state) { // Close the connection markClosed(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); } } catch (IOException e) { markClosed(e); } }

call call client org.apache.hadoop.ipc. Server

17

CallListenerResponderConnectionHandler Call Listener Listener Listener.Reader Reader Responder RPC Responder Connection Handler callQueue call Server
/** Called for each call. */ public abstract Writable call(Class<?> protocol, Writable param, long receiveTime) throws IOException;

Server Server call Server.Call Client.Call Server.Call id param Client.Call connection Call connection timestamp response Writable Server.Connection socket Hadoop Server Java NIO socket socket Server accept socket Listener Server.Handle run
18

Server.Call Server.call Responder NIO Responder Server call call Server Listener Listener run()
public void run() { LOG.info(getName() + ": starting"); SERVER.set(Server.this); while (running) { SelectionKey key = null; try { selector.select(); Iterator<SelectionKey> iter = selector.selectedKeys().iterator(); while (iter.hasNext()) { key = iter.next(); iter.remove(); try { if (key.isValid()) { if (key.isAcceptable()) doAccept(key); } } catch (IOException e) { } key = null; } } catch (OutOfMemoryError e) { // we can run out of memory if we have too many threads // log the event and sleep for a minute and give // some thread(s) a chance to finish LOG.warn("Out of Memory in server select", e); closeCurrentConnection(key, e); cleanupConnections(true); try { Thread.sleep(60000); } catch (Exception ie) {} } catch (Exception e) { closeCurrentConnection(key, e);
19

} cleanupConnections(false); } LOG.info("Stopping " + this.getName()); synchronized (this) { try { acceptChannel.close(); selector.close(); } catch (IOException e) { } selector= null; acceptChannel= null; // clean up all connections while (!connectionList.isEmpty()) { closeConnection(connectionList.remove(0)); } } }

Listener doAccept ()
void doAccept(SelectionKey key) throws IOException, OutOfMemoryError { Connection c = null; ServerSocketChannel server = (ServerSocketChannel) key.channel(); SocketChannel channel; while ((channel = server.accept()) != null) { channel.configureBlocking(false); channel.socket().setTcpNoDelay(tcpNoDelay); Reader reader = getReader(); // readers reader try { reader.startAdd(); // readSelector adding true SelectionKey readKey = reader.registerChannel(channel);// c = new Connection(readKey, channel, System.currentTimeMillis());// readKey.attach(c); // connection readKey synchronized (connectionList) { connectionList.add(numConnections, c); numConnections++; } if (LOG.isDebugEnabled()) LOG.debug("Server connection from " + c.toString() + "; # active connections: " + numConnections +
20

"; # queued calls: " + callQueue.size()); } finally { // adding false notify() reader, Listener reader wait() reader.finishAdd(); } } }

reader reader doRead()doRead() Server.Connection readAndProcess()readAndProcess() Server.Connection processOneRpc() processData() processData() call call callQueue rpc call Server Handler call
run() final Call call = callQueue.take(); // pop the queue; maybe blocked here value = call(call.connection.protocol, call.param, call.timestamp); synchronized (call.connection.responseQueue) { // setupResponse() needs to be sync'ed together with // responder.doResponse() since setupResponse may use // SASL to encrypt response data and SASL enforces // its own message ordering. setupResponse(buf, call, (error == null) ? Status.SUCCESS : Status.ERROR, value, errorClass, error); // Discard the large buf and reset it back to // smaller size to freeup heap if (buf.size() > maxRespSize) { LOG.warn("Large response size " + buf.size() + " for call " + call.toString()); buf = new ByteArrayOutputStream(INITIAL_RESP_BUF_SIZE); } responder.doRespond(call); }
21

Server.Responder doRespond()
void doRespond(Call call) throws IOException { synchronized (call.connection.responseQueue) { call.connection.responseQueue.addLast(call); if (call.connection.responseQueue.size() == 1) { // writeSelector processResponse(call.connection.responseQueue, true); } } } Client Server RPC.java

stub skeleton CORBA IDL stub skeleton org.apache.hadoop.rpc :

InvocationClientCacheInvokerServer Invocation VO;


22

ClientCache client socket factory hash key, hashMap <SocketFactory, Client>; Invoker InvocationHandler; Server ipc.Server Hadoop RPC Invoker InvocationHandler invoke invoke InvocationHandler Invoker Client socket proxy Client Invoker InvocationInvocation : methodNameparameterClasses parameters Writable RPC.Server org.apache.hadoop.ipc.Server RPC Invocation Java socket Dynamic Proxy Invocation VOClientCache Server Invoker Invoker RPC.Invoker invoke()
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { final boolean logDebug = LOG.isDebugEnabled(); long startTime = 0; if (logDebug) { startTime = System.currentTimeMillis(); } ObjectWritable value = (ObjectWritable) client.call(new Invocation(method, args), remoteId);
23

if (logDebug) { long callTime = System.currentTimeMillis() - startTime; LOG.debug("Call: " + method.getName() + " " + callTime); } return value.get(); }

invoke() method.invoke(ac, arg); method.invoke(ac, arg); JVM Hadoop invoke()


ObjectWritable value = (ObjectWritable) client.call(new Invocation(method, args), remoteId);

Invocation VO Client call() RPC.Server org.apache.hadoop.ipc.Server RPC ipc.RPC getServer()


/** Construct a server for a protocol implementation instance listening on a * port and address, with a secret manager. */ public static Server getServer(final Object instance, final String bindAddress, final int port, final int numHandlers, final boolean verbose, Configuration conf, SecretManager<? extends TokenIdentifier> secretManager) throws IOException { return new Server(instance, conf, bindAddress, port, numHandlers, verbose, secretManager); }

getServer() Server RPC.Server RPC.Server ipc.Server RPC. waitForProxy () Server Server RPC.getServer() Server
24

RPC RPC IPC RPC Hadoop

VersionedProtocol RPC getProtocolVersion() 1HDFS ClientDatanodeProtocol datanode ClientProtocol client Namenode DatanodeProtocol : Datanode Namenode blockreport NamenodeProtocol SecondaryNode Namenode (2Mapreduce InterDatanodeProtocol Datanode block
25

InnerTrackerProtocol TaskTracker JobTracker DatanodeProtocol JobSubmissionProtocol JobClient JobTracker Job Job Job TaskUmbilicalProtocol Task mapreduce TaskTracker

HDFS
HDFS NameNode DataNodeNameNode

DataNode namenode namenode DataNode DataNode DataNode DataNode ID HDFS DataNode DataNode NameNode NameNode NameNode NameNode DataNode NameNode HDFS

26

DataNode NameNode DataNode NameNode Heartbeat NameNode NameNode DataNode DataNode NameNode DataNode / DataNode / DataNode DataNode DataNode DataNode Hadoop HADOOP_HOME/conf/hdfs-site.xml
<property> <name>dfs.data.dir</name> <value>/usr/local/hadoop/data1,/usr/local/hadoop/data2</value> </property>
27

/usr/local/hadoop/data1 /usr/local/hadoop/data2
drwxr-xr-x 6 hadoop hadoop 4096 4 26 15:11 . drwxr-xr-x 24 hadoop hadoop 4096 4 19 14:26 .. drwxrwxr-x 2 hadoop hadoop 4096 4 26 13:57 blocksBeingWritten drwxrwxr-x 2 hadoop hadoop 4096 4 26 13:57 current drwxrwxr-x 2 hadoop hadoop 4096 4 3 14:10 detach -rw-rw-r-- 1 hadoop hadoop 157 4 3 14:10 storage drwxrwxr-x 2 hadoop hadoop 4096 4 26 13:56 tmp

storage in_use.lock blocksBeingWritten block blocksBeingWritten

current current detach snapshottmp DataNode tmp current subdir0 subdir63 HDFS HDFS
-rw-rw-r-- 1 hadoop hadoop 66 -rw-rw-r-- 1 hadoop hadoop 11 -rw-rw-r-- 1 hadoop hadoop 727 -rw-rw-r-- 1 hadoop hadoop 15 -rw-rw-r-- 1 hadoop hadoop 7416 -rw-rw-r-- 1 hadoop hadoop 155 4 18 17:19 blk_8027040652559443757 4 18 17:19 blk_8027040652559443757_1055.meta 4 18 17:42 blk_-8559958631634410715 4 18 17:42 blk_-8559958631634410715_1071.meta 6 14 15:36 dncp_block_verification.log.curr 6 14 15:36 VERSION

80270406525594437578559958631634410715 ID 10551071
28

current VERSION dncp_block_verification.log.curr DataNode DataNode HDFS


FORMAT("-format") REGULAR("-regular") UPGRADE("-upgrade") ROLLBACK("-rollback") FINALIZE("-finalize") IMPORT("-importCheckpoint") Checkpoint

Hadoop http://wiki.apache.org/hadoop/Hadoop_Upgrade upgrade rollback finalize importCheckpoint NameNode

//
29

Hadoop DataNode // DataNode DataStorage VERSION NameNode DataNode Heartbeat DataNode current previous.tmp snapshot current VERSION previous.tmp current previous.tmp current VERSION previous.tmp previous previous current removed.tmp previous current removed.tmp previous finalized.tmp

HDFS
30

previous DataNode DataNode Hadoop Storage DataNode block Datanode block _blk block .meta meta file blokcFileName_generationStamp.meta HDFS storage storage Datanode storage Namenode storage storage ) Datanode FSDataset FSDataset FSVolume FSVolume storage FSVolume FSVolume dataDir blocks meta file) tmpDir detachDir copy on write for blocks in snapshot block detachDir

Storage StorageInfo DataNode DataStorage Storage

31

StorageInfo 3
public int layoutVersion; // Version read from the stored file. public int namespaceID; // namespace id of the storage public long cTime; // creation timestamp

Storage dfs.data.dir Storage StorageDirectory StorageDirectory analyzeStorage StorageDirectory


NON_EXISTENT NOT_FORMATTED COMPLETE_UPGRADEprevious.tmp current RECOVER_UPGRADEprevious.tmp current COMPLETE_FINALIZEfinalized.tmp current
32

COMPLETE_ROLLBACKremoved.tmp current previous RECOVER_ROLLBACKremoved.tmp current previous COMPLETE_CHECKPOINTlastcheckpoint.tmp current RECOVER_CHECKPOINTlastcheckpoint.tmp current NORMAL

StorageDirectory current
previous previous.tmp removed.tmp finalized.tmp lastcheckpoint.tmp NameNode previous.checkpoint NameNode

doRecover
COMPLETE_UPGRADEmv previous.tmp -> previous RECOVER_UPGRADEmv previous.tmp -> current COMPLETE_FINALIZErm finalized.tmp COMPLETE_ROLLBACKrm removed.tmp RECOVER_ROLLBACKmv removed.tmp -> current COMPLETE_CHECKPOINTmv lastcheckpoint.tmp -> previous.checkpoint RECOVER_CHECKPOINTmv lastcheckpoint.tmp -> current

RECOVER_UPGRADE
1. current->previous.tmp 2. current 3. previous.tmp->previous

previous.tmp current previous.tmp current StorageDirectory StorageInfo StorageDirectory VERSION StorageDirectory read/write / DataNode VERSION
33

#Fri Jun 14 15:36:32 CST 2013 namespaceID=1584403768 storageID=DS-1617068520-127.0.1.1-50010-1364969464023 cTime=0 storageType=DATA_NODE layoutVersion=-32

StorageDirectory in_use.lock /StorageDirectory lock unlock Storage StorageDirectory Storage DataStorage Storage DataNode DataNode // DataStorage doUpgrade/doRollback/doFinalize DataStorage format DataNode Storage StorageDirectoryDataStorage Storage FSDataset Storage Block FSDataset

34

Block Block blk_3148782637964391313 blk_3148782637964391313_242812.meta blockId 3148782637964391313242812 numBytesBlock DatanodeBlockInfo Block Block FSVolume detach detach snapshotsnapshot current current detach snapshot current
35

current snapshot detach copy-on-write DatanodeBlockInfo detachBlock Block detach Block DatanodeBlockInfo FSVolumeSet FSVolume FSDir DataNode Storage HDFS Block Storage FSDataset FSVolume Storage FSDir FSVolume FSVolumeSet FSDataset FSVolumeSet FSDir HDFS FSDir Block Storage FSDir FSDir FSDir getBlockInfo Block getVolumeMap Block DatanodeBlockInfo FSVolume Storagedetach FSVolume FSVolume recoverDetachedBlocks detach Storage detach detach FSVolume FSVolume Block FSVolume Block FSVolumeSet FSVolume HDFS chunk FSDataset ActiveFileActiveFile ActiveFile
36

FSDataset FSDataset FSDatasetInterfaceFSDatasetInterface DataNode FSDataset


FSVolumeSet volumes; private HashMap<Block,ActiveFile> ongoingCreates = new HashMap<Block,ActiveFile>(); private HashMap<Block,DatanodeBlockInfo> volumeMap = new HashMap<Block, DatanodeBlockInfo>();;

volumes FSDataset StorageongoingCreates Block ActiveFile Block ongoingCreates FSDataset


public long getMetaDataLength(Block b) throws IOException; block block ID public MetaDataInputStream getMetaDataInputStream(Block b) throws IOException; block block ID public boolean metaFileExists(Block b) throws IOException; block public long getLength(Block b) throws IOException; block public Block getStoredBlock(long blkid) throws IOException; Block ID Block public InputStream getBlockInputStream(Block b) throws IOException; public InputStream getBlockInputStream(Block b, long seekOffset) throws IOException; Block public BlockInputStreams getTmpInputStreams(Block b, long blkoff, long ckoff) throws IOException; Block tmp tmp current current public BlockWriteStreams writeToBlock(Block b, boolean isRecovery) throws IOException;
37

block BlockWriteStreams isRecovery block block writeToBlock ActiveFile ongoingCreates BlockWriteStreams ActiveFile ActiveFile threads blk_3148782637964391313 DataNode Block ID 3148782637964391313 DataNode tmp/blk_3148782637964391313 meta tmp/blk_3148782637964391313_XXXXXX.meta XXXXXX isRecovery true finalizeBlock detached writeToBlock interrupt ongoingCreates / ActiveFile ongoingCreates public void updateBlock(Block oldblock, Block newblock) throws IOException; block updateBlock updateBlock tryUpdateBlock tryUpdateBlock volumeMap tryUpdateBlock updateBlock join public void finalizeBlock(Block b) throws IOException; finalize writeToBlock block Block tmp current FSDataset finalizeBlock ongoingCreates block block DatanodeBlockInfo volumeMap blk_3148782637964391313 DataNode Block ID 3148782637964391313 DataNode tmp/blk_3148782637964391313 current subdir12 tmp/blk_3148782637964391313 current/subdir12/blk_3148782637964391313 meta current/subdir12 public void unfinalizeBlock(Block b) throws IOException; writeToBlock block finalizeBlock public boolean isValidBlock(Block b); Block public void invalidate(Block invalidBlks[]) throws IOException;
38

block public void validateBlockMetadata(Block b) throws IOException; block

DataNode

HDFS
DataNode

DataXceiverServer DataXceiverDataNode / RPC RPC DataNode DataXceiverServer DataXceiver DataXceiver BlockSender BlockReceiver

39

DataXceiverServer DataXceiver socket DataXceiverServer run DataXceiverServer socket DataXceiver socket DataXceiver DataXceiver DataXceiver
OP_WRITE_BLOCK (80) OP_READ_BLOCK (81) OP_READ_METADATA (82) OP_REPLACE_BLOCK (83) OP_COPY_BLOCK (84) OP_BLOCK_CHECKSUM (85)

DataXceiver $HADOOP_HOME/bin/hadoop fs -put <localsrc> <dst>

$HADOOP_HOME/bin/hadoop fs -copyFromLocal <localsrc> <dst> (OP_WRITE_BLOCK (80) ) namenode hadoop append namenode namdnode block hdfs datanode namenode IOUtils.copyBytes() client packet namenode datenodes blocksnamenode datanodes blocks client datanode 3 datanode datanode
40

datanode datanode datanode ACK client

DistributedFileSystem create() DistributedFileSystem namenode RPC namenode DistributedFileSystem datanode namenode FSDataOutputStream DFSOutputStream (data queue) DataStreamer datenode namenode datanode (pipeline) DataStreamer 1 datanode
41

DFSOutputStream (ack queue) datanode datanode close() datanode namenode Namenode hadoop client
org.apache.hadoop.fs. FsShell: public int run(String argv[]) throws Exception { if ("-put".equals(cmd) || "-copyFromLocal".equals(cmd)) { Path[] srcs = new Path[argv.length-2]; for (int j=0 ; i < argv.length-1 ;) srcs[j++] = new Path(argv[i++]); copyFromLocal(srcs, argv[i++]); } } org.apache.hadoop.fs. FsShell: void copyFromLocal(Path[] srcs, String dstf) throws IOException { Path dstPath = new Path(dstf); FileSystem dstFs = dstPath.getFileSystem(getConf()); if (srcs.length == 1 && srcs[0].toString().equals("-")) copyFromStdin(dstPath, dstFs); else dstFs.copyFromLocalFile(false, false, srcs, dstPath); } org.apache.hadoop.fs. FileSystem: public void copyFromLocalFile(boolean delSrc, boolean overwrite, Path[] srcs, Path dst) throws IOException { Configuration conf = getConf();
42

FileUtil.copy(getLocal(conf), srcs, this, dst, delSrc, overwrite, conf); } org.apache.hadoop.fs. FileUtil: public static boolean copy(FileSystem srcFS, Path[] srcs, FileSystem dstFS, Path dst, boolean deleteSource, boolean overwrite, Configuration conf) throws IOException { if (srcs.length == 1) return copy(srcFS, srcs[0], dstFS, dst, deleteSource, overwrite, conf); for (Path src : srcs) { try { if (!copy(srcFS, src, dstFS, dst, deleteSource, overwrite, conf)) returnVal = false; } catch (IOException e) { gotException = true; exceptions.append(e.getMessage()); exceptions.append("\n"); } } return returnVal; }

FsShell hadoop run() hadoop shell shell -put -copyFromLocal copyFromLocal() shell copyFromLocalFile() FileUtil.copy() copy()
org.apache.hadoop.fs. FileUtil: public static boolean copy(FileSystem srcFS, Path src, FileSystem dstFS, Path dst, boolean deleteSource, boolean overwrite, Configuration conf) throws IOException { dst = checkDest(src.getName(), dstFS, dst, overwrite);
43

if (srcFS.getFileStatus(src).isDir()) { checkDependencies(srcFS, src, dstFS, dst); if (!dstFS.mkdirs(dst)) { return false; } FileStatus contents[] = srcFS.listStatus(src); for (int i = 0; i < contents.length; i++) { copy(srcFS, contents[i].getPath(), dstFS, new Path(dst, contents[i].getPath().getName()), deleteSource, overwrite, conf); } } else if (srcFS.isFile(src)) { InputStream in=null; OutputStream out = null; try { in = srcFS.open(src); out = dstFS.create(dst, overwrite); IOUtils.copyBytes(in, out, conf, true); } catch (IOException e) { IOUtils.closeStream(out); IOUtils.closeStream(in); throw e; } } else { throw new IOException(src.toString() + ": No such file or directory"); } if (deleteSource) { return srcFS.delete(src, true); } else { return true; } }

copy() conf in out IOUtils.copyBytes() dstFS.create()


44

FileSystem

DistributedFileSystem

dstFS.create(dst, overwrite);

return create(f, overwrite, getConf().getInt("io.file.buffer.size", 4096),getDefaultReplication(),getDefaultBlockSize());

return create(f, overwrite, bufferSize, replication, blockSize, null);

return this.create(f, FsPermission.getDefault(),overwrite, bufferSize, replication, blockSize, progress); 1 return new FSDataOutputStream (dfs.create(getPathName(f), permission, overwrite, true, replication, blockSize, progress, bufferSize), statistics);

// public abstract FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException;

create() FileSystem FSDataOutputStream create() FSDateOutputStream 2 FS HDFS DistributedFileSystem create() FSDataOutputStream OutputStream dfs.create() DFSOutputStream

45

DFSClient

DFSOutputStream

ClientProtocol(NameNode)

DFSOutputStream(String src, FsPermission masked, boolean overwrite, boolean createParent, short replication, long blockSize, Progressable progress, int buffersize, int bytesPerChecksum) throws IOException { this(src, blockSize, progress, bytesPerChecksum, replication); computePacketChunkSize(writePacketSize, bytesPerChecksum); try { namenode.create( src, masked, clientName, overwrite, createParent, replication, blockSize); } catch(RemoteException re) { throw re.unwrapRemoteException(AccessControlException.class, FileAlreadyExistsException.class, FileNotFoundException.class, NSQuotaExceededException.class, DSQuotaExceededException.class); } streamer.start(); } new

return new FSDataOutputStream (dfs.create(getPathName(f), permission, overwrite, true, replication, blockSize, progress, bufferSize), statistics);

dfs.create() DFSClient create() OutputStream DFSOutputStreamDFSOutputStream namenode streamer.start() pipeline DataStreamer data queue block 64M 64K packet 1000 packets/block DataStreamer namenode
org.apache.hadoop.hdfs.server.namenode. NameNode: public void create(String src, FsPermission masked, String clientName, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { String clientMachine = getClientMachine();
46

if (stateChangeLog.isDebugEnabled()) { stateChangeLog.debug("*DIR* NameNode.create: file " +src+" for "+clientName+" at "+clientMachine); } if (!checkPathLength(src)) { throw new IOException("create: Pathname too long. Limit " + MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels."); } namesystem.startFile(src, new PermissionStatus(UserGroupInformation.getCurrentUser().getShortUserName(), null, masked), clientName, clientMachine, overwrite, createParent, replication, blockSize); myMetrics.incrNumFilesCreated(); myMetrics.incrNumCreateFileOps(); } org.apache.hadoop.hdfs.server.namenode. FSNamesystem void startFile(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { startFileInternal(src, permissions, holder, clientMachine, overwrite, false, createParent, replication, blockSize); getEditLog().logSync(); if (auditLog.isInfoEnabled() && isExternalInvocation()) { final HdfsFileStatus stat = dir.getFileInfo(src); logAuditEvent(UserGroupInformation.getCurrentUser(), Server.getRemoteIp(), "create", src, null, stat); } } org.apache.hadoop.hdfs.server.namenode. FSNamesystem private synchronized void startFileInternal(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean append, boolean createParent, short replication, long blockSize ) throws IOException {
47

DatanodeDescriptor clientNode = host2DataNodeMap.getDatanodeByHost(clientMachine); if (append) { // // Replace current node with a INodeUnderConstruction. // Recreate in-memory lease record. // INodeFile node = (INodeFile) myFile; INodeFileUnderConstruction cons = new INodeFileUnderConstruction( node.getLocalNameBytes(), node.getReplication(), node.getModificationTime(), node.getPreferredBlockSize(), node.getBlocks(), node.getPermissionStatus(), holder, clientMachine, clientNode); dir.replaceNode(src, node, cons); leaseManager.addLease(cons.clientName, src); } else { // Now we can add the name to the filesystem. This file has no // blocks associated with it. // checkFsObjectLimit(); // increment global generation stamp long genstamp = nextGenerationStamp(); INodeFileUnderConstruction newNode = dir.addFile(src, permissions, replication, blockSize, holder, clientMachine, clientNode, genstamp); if (newNode == null) { throw new IOException("DIR* NameSystem.startFile: " + "Unable to add file to namespace."); } leaseManager.addLease(newNode.clientName, src); if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: " +"add "+src+" to namespace for "+holder); } }
48

namenode create() FSNameSystem startFileInternale() hadoop append INode node under construction blocks stamp client IOUtils.copyBytes() client & block

IOUtils

FSOutputSummer

DFSClient.DFSOutputStream

IOUtils.copyBytes(in, out, conf, true);

copyBytes(in, out, conf.getInt("io.file.buffer.size", 4096), close); copyBytes(in, out, buffSize); out.write(buf, 0, bytesRead); for (int n=0;n<len;n+=write1(b, off+n, len-n))

public static void copyBytes(InputStream in, OutputStream out, int buffSize) throws IOException { PrintStream ps = out instanceof PrintStream ? (PrintStream)out : null; byte buf[] = new byte[buffSize]; int bytesRead = in.read(buf); while (bytesRead >= 0) { out.write(buf, 0, bytesRead); if ((ps != null) && ps.checkError()) { throw new IOException("Unable to write to output stream."); } bytesRead = in.read(buf); } }

write1(b, off+n, len-n) writeChecksumChunk(b, off, length, false); writeChunk(b, off, len, checksum);

IOUtils copyBytes() FSOutputSummer checksum writeChecksumChunk() DFSClient DFSOutputStream writeChunk()

49

OutputStream

FilterOutputStream FileSystem

FSOutputSummer

DataOutputStream DistributedFileSystem

DFSOutputStream

datanode DataStreamer

create()

FSDataOutputStream(out)

DFSOutputStream FSOutputSummer DFSOutputStream writeChunk() DistributedFileSystem create() DFSOutputStream FSDataOutputStream DFSOutputStream writeChunk() client block packet 3 datanode1datanode2 datanode3 client datanode1 packet1 datanode1 datanode1 datanode2 packet1 datanode2 client packet2 datanode1 datanode2 datanode3 packet1 datanode3 client packet3 datanode1datanode1 packet2 datanode2 datanode datanode3 packet1 ack datanode2 datanode2 ack datanode1 client packet
50

org.apache.hadoop.hdfs.DFSClient.DFSOutputStream // @see FSOutputSummer#writeChunk() @Override protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException { checkOpen(); isClosed(); int cklen = checksum.length; int bytesPerChecksum = this.checksum.getBytesPerChecksum(); if (len > bytesPerChecksum) { throw new IOException("writeChunk() buffer size is " + len + " is larger than supported bytesPerChecksum " + bytesPerChecksum); } if (checksum.length != this.checksum.getChecksumSize()) { throw new IOException("writeChunk() checksum size is supposed to be " + this.checksum.getChecksumSize() + " but found to be " + checksum.length); } synchronized (dataQueue) { // If queue is full, then wait till we can create enough space while (!closed && dataQueue.size() + ackQueue.size() > maxPackets) { try { dataQueue.wait(); } catch (InterruptedException e) { } } isClosed(); if (currentPacket == null) { currentPacket = new Packet(packetSize, chunksPerPacket, bytesCurBlock); if (LOG.isDebugEnabled()) { LOG.debug("DFSClient writeChunk allocating new packet seqno=" + currentPacket.seqno + ", src=" + src + ", packetSize=" + packetSize + ", chunksPerPacket=" + chunksPerPacket + ", bytesCurBlock=" + bytesCurBlock);
51

} } currentPacket.writeChecksum(checksum, 0, cklen); currentPacket.writeData(b, offset, len); currentPacket.numChunks++; bytesCurBlock += len; // If packet is full, enqueue it for transmission // if (currentPacket.numChunks == currentPacket.maxChunks || bytesCurBlock == blockSize) { if (LOG.isDebugEnabled()) { LOG.debug("DFSClient writeChunk packet full seqno=" + currentPacket.seqno + ", src=" + src + ", bytesCurBlock=" + bytesCurBlock + ", blockSize=" + blockSize + ", appendChunk=" + appendChunk); } // // if we allocated a new packet because we encountered a block // boundary, reset bytesCurBlock. // if (bytesCurBlock == blockSize) { currentPacket.lastPacketInBlock = true; bytesCurBlock = 0; lastFlushOffset = 0; } enqueueCurrentPacket(); // If this was the first write after reopening a file, then the above // write filled up any partial chunk. Tell the summer to generate full // crc chunks from now on. if (appendChunk) { appendChunk = false; resetChecksumChunk(bytesPerChecksum); } int psize = Math.min((int)(blockSize-bytesCurBlock), writePacketSize); computePacketChunkSize(psize, bytesPerChecksum); } } //LOG.debug("DFSClient writeChunk done length " + len + // " checksum length " + cklen);
52

} org.apache.hadoop.hdfs.DFSClient.DFSOutputStream private synchronized void enqueueCurrentPacket() { synchronized (dataQueue) { if (currentPacket == null) return; dataQueue.addLast(currentPacket); dataQueue.notifyAll(); lastQueuedSeqno = currentPacket.seqno; currentPacket = null; } }

DFSOutputStream
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream private LinkedList<Packet> dataQueue = new LinkedList<Packet>(); private LinkedList<Packet> ackQueue = new LinkedList<Packet>();

writeChunk() data queue packet currentPacket new Packet packet checksum packet data queue data queue DataStreamer
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream.DataStreamer public void run() { long lastPacket = 0; while (!closed && clientRunning) { // if the Responder encountered an error, shutdown Responder if (hasError && response != null) { try { response.close(); response.join(); response = null; } catch (InterruptedException e) { } } Packet one = null; synchronized (dataQueue) {
53

// process IO errors if any boolean doSleep = processDatanodeError(hasError, false); // wait for a packet to be sent. long now = System.currentTimeMillis(); while ((!closed && !hasError && clientRunning && dataQueue.size() == 0 && (blockStream == null || ( blockStream != null && now - lastPacket < timeoutValue/2))) || doSleep) { long timeout = timeoutValue/2 - (now-lastPacket); timeout = timeout <= 0 ? 1000 : timeout; try { dataQueue.wait(timeout); now = System.currentTimeMillis(); } catch (InterruptedException e) { } doSleep = false; } if (closed || hasError || !clientRunning) { continue; } try { // get packet to be sent. if (dataQueue.isEmpty()) { one = new Packet(); // heartbeat packet } else { one = dataQueue.getFirst(); // regular data packet } long offsetInBlock = one.offsetInBlock; // get new block from namenode. if (blockStream == null) { LOG.debug("Allocating new block"); nodes = nextBlockOutputStream(src); this.setName("DataStreamer for file " + src + " block " + block); response = new ResponseProcessor(nodes); response.start(); }
54

if (offsetInBlock >= blockSize) { throw new IOException("BlockSize " + blockSize + " is smaller than data size. " + " Offset of packet in block " + offsetInBlock + " Aborting file " + src); } ByteBuffer buf = one.getBuffer(); // move packet from dataQueue to ackQueue if (!one.isHeartbeatPacket()) { dataQueue.removeFirst(); dataQueue.notifyAll(); synchronized (ackQueue) { ackQueue.addLast(one); ackQueue.notifyAll(); } } // write out data to remote datanode blockStream.write(buf.array(), buf.position(), buf.remaining()); if (one.lastPacketInBlock) { blockStream.writeInt(0); // indicate end-of-block } blockStream.flush(); lastPacket = System.currentTimeMillis(); if (LOG.isDebugEnabled()) { LOG.debug("DataStreamer block " + block + " wrote packet seqno:" + one.seqno + " size:" + buf.remaining() + " offsetInBlock:" + one.offsetInBlock + " lastPacketInBlock:" + one.lastPacketInBlock); } } catch (Throwable e) { LOG.warn("DataStreamer Exception: " + StringUtils.stringifyException(e)); if (e instanceof IOException) { setLastException((IOException)e); } hasError = true;
55

} } if (closed || hasError || !clientRunning) { continue; } // Is this block full? if (one.lastPacketInBlock) { synchronized (ackQueue) { while (!hasError && ackQueue.size() != 0 && clientRunning) { try { ackQueue.wait(); // wait for acks to arrive from datanodes } catch (InterruptedException e) { } } } LOG.debug("Closing old block " + block); this.setName("DataStreamer for file " + src); response.close(); // ignore all errors in Response try { response.join(); response = null; } catch (InterruptedException e) { } if (closed || hasError || !clientRunning) { continue; } synchronized (dataQueue) { IOUtils.cleanup(LOG, blockStream, blockReplyStream); nodes = null; response = null; blockStream = null; blockReplyStream = null; } } if (progress != null) { progress.progress(); } // This is used by unit test to trigger race conditions. if (artificialSlowdown != 0 && clientRunning) { LOG.debug("Sleeping for artificial slowdown of " +
56

artificialSlowdown + "ms"); try { Thread.sleep(artificialSlowdown); } catch (InterruptedException e) {} } } }

DataStreamer run() packet 1s packet nextBlockOutPutStream() namenode datanodes blocks


org.apache.hadoop.hdfs.DFSClient.DFSOutputStream private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException { LocatedBlock lb = null; boolean retry = false; DatanodeInfo[] nodes; int count = conf.getInt("dfs.client.block.write.retries", 3); boolean success; do { hasError = false; lastException = null; errorIndex = 0; retry = false; nodes = null; success = false; long startTime = System.currentTimeMillis(); DatanodeInfo[] excluded = excludedNodes.toArray(new DatanodeInfo[0]); lb = locateFollowingBlock(startTime, excluded.length > 0 ? excluded : null); block = lb.getBlock(); accessToken = lb.getBlockToken(); nodes = lb.getLocations(); // // Connect to first DataNode in the list. // success = createBlockOutputStream(nodes, clientName, false); if (!success) { LOG.info("Abandoning block " + block); namenode.abandonBlock(block, src, clientName);
57

if (errorIndex < nodes.length) { LOG.info("Excluding datanode " + nodes[errorIndex]); excludedNodes.add(nodes[errorIndex]); } // Connection failed. Let's wait a little bit and retry retry = true; } } while (retry && --count >= 0); if (!success) { throw new IOException("Unable to create new block."); } return nodes; }

nextBlockOutputStream() 3 datanodes blockslocateFollowingBlock() datanode createBlockOutputStream()


org.apache.hadoop.hdfs.DFSClient.DFSOutputStream private LocatedBlock locateFollowingBlock(long start, DatanodeInfo[] excludedNodes ) throws IOException { int retries = conf.getInt("dfs.client.block.write.locateFollowingBlock.retries", 5); long sleeptime = 400; while (true) { long localstart = System.currentTimeMillis(); while (true) { try { if (serverSupportsHdfs630) { return namenode.addBlock(src, clientName, excludedNodes); } else { return namenode.addBlock(src, clientName); } } catch (RemoteException e) { } } }

locateFollowingBlock() 5 namenode datanodes blocks


58

namenode datanodes blocks namenode client addBlock() FSNamesystem.getAdditionalBlock() DatanodeDescriptor targets[] block datanodesInode[] pathINodes path INode INode pendingFile under construction INode newBlock block LocatedBlock() org.apache.hadoop.hdfs.DFSClient.DFSOutputStream. nextBlockOutputStream() lb client org.apache.hadoop.hdfs.DFSClient.DFSOutputStream createBlockOutputStream()client datanode
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream // connects to the first datanode in the pipeline // Returns true if success, otherwise return failure. // private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, boolean recoveryFlag) { short pipelineStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS; String firstBadLink = ""; if (LOG.isDebugEnabled()) { for (int i = 0; i < nodes.length; i++) { LOG.debug("pipeline = " + nodes[i].getName()); } } // persist blocks on namenode on next flush persistBlocks = true; boolean result = false; try { LOG.debug("Connecting to " + nodes[0].getName()); InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName()); s = socketFactory.createSocket(); timeoutValue = 3000 * nodes.length + socketTimeout; NetUtils.connect(s, target, timeoutValue); s.setSoTimeout(timeoutValue); s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
59

LOG.debug("Send buf size " + s.getSendBufferSize()); long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length + datanodeWriteTimeout; // // Xmit header info to datanode // DataOutputStream out = new DataOutputStream( new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), DataNode.SMALL_BUFFER_SIZE)); blockReplyStream = new DataInputStream(NetUtils.getInputStream(s)); out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); out.write( DataTransferProtocol.OP_WRITE_BLOCK ); out.writeLong( block.getBlockId() ); out.writeLong( block.getGenerationStamp() ); out.writeInt( nodes.length ); out.writeBoolean( recoveryFlag ); // recovery flag Text.writeString( out, client ); out.writeBoolean(false); // Not sending src node information out.writeInt( nodes.length - 1 ); for (int i = 1; i < nodes.length; i++) { nodes[i].write(out); } accessToken.write(out); checksum.writeHeader( out ); out.flush(); // receive ack for connect pipelineStatus = blockReplyStream.readShort(); firstBadLink = Text.readString(blockReplyStream); if (pipelineStatus != DataTransferProtocol.OP_STATUS_SUCCESS) { if (pipelineStatus DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) { throw new InvalidBlockTokenException( "Got access token error for connect ack with firstBadLink as " + firstBadLink); } else { throw new IOException("Bad connect ack with firstBadLink as " + firstBadLink); } }

==

60

blockStream = out; result = true; // success } catch (IOException ie) { } finally { } return result; }

nodes[0] pipeline datanode stamp datanode datanodes datanode for i 1 datanode pipeline
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream.DataStreamer.run()

datanodes blocks datanode one data queue ack queue ack OK ack queue datanode ack queue data queue blockStream.write() datanode block packet block packet 0 block datanode & datanode DataTransferProtocol.OP_WRITE_BLOCK datanode DataXceiver writeBlock()
org.apache.hadoop.hdfs.server.datanode.DataXceiver private void writeBlock(DataInputStream in) throws IOException { DatanodeInfo srcDataNode = null; LOG.debug("writeBlock receive buf size " + s.getReceiveBufferSize() + " tcp no delay " + s.getTcpNoDelay()); // // Read in the header // Block block = new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong());
61

LOG.info("Receiving block " + block + " src: " + remoteAddress + " dest: " + localAddress); int pipelineSize = in.readInt(); // num of datanodes in entire pipeline boolean isRecovery = in.readBoolean(); // is this part of recovery? String client = Text.readString(in); // working on behalf of this client boolean hasSrcDataNode = in.readBoolean(); // is src node info present if (hasSrcDataNode) { srcDataNode = new DatanodeInfo(); srcDataNode.readFields(in); } int numTargets = in.readInt(); if (numTargets < 0) { throw new IOException("Mislabelled incoming datastream."); } DatanodeInfo targets[] = new DatanodeInfo[numTargets]; for (int i = 0; i < targets.length; i++) { DatanodeInfo tmp = new DatanodeInfo(); tmp.readFields(in); targets[i] = tmp; } Token<BlockTokenIdentifier> accessToken = new Token<BlockTokenIdentifier>(); accessToken.readFields(in); DataOutputStream replyOut = null; // stream to prev target replyOut = new DataOutputStream( NetUtils.getOutputStream(s, datanode.socketWriteTimeout)); if (datanode.isBlockTokenEnabled) { try { datanode.blockTokenSecretManager.checkAccess(accessToken, null, block, BlockTokenSecretManager.AccessMode.WRITE); } catch (InvalidToken e) { try { if (client.length() != 0) { replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN); Text.writeString(replyOut, datanode.dnRegistration.getName()); replyOut.flush(); } throw new IOException("Access token verification failed, for client " + remoteAddress + " for OP_WRITE_BLOCK for block " + block); } finally { IOUtils.closeStream(replyOut); } }
62

} DataOutputStream mirrorOut = null; // stream to next target DataInputStream mirrorIn = null; // reply from next target Socket mirrorSock = null; // socket to next target BlockReceiver blockReceiver = null; // responsible for data handling String mirrorNode = null; // the name:port of next target String firstBadLink = ""; // first datanode that failed in connection setup short mirrorInStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS; try { // open a block receiver and check if the block does not exist blockReceiver = new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode); // // Open network conn to backup machine, if // appropriate // if (targets.length > 0) { InetSocketAddress mirrorTarget = null; // Connect to backup machine mirrorNode = targets[0].getName(); mirrorTarget = NetUtils.createSocketAddr(mirrorNode); mirrorSock = datanode.newSocket(); try { int timeoutValue = datanode.socketTimeout + (HdfsConstants.READ_TIMEOUT_EXTENSION numTargets); int writeTimeout = datanode.socketWriteTimeout + (HdfsConstants.WRITE_TIMEOUT_EXTENSION numTargets); NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue); mirrorSock.setSoTimeout(timeoutValue); mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); mirrorOut = new DataOutputStream( new BufferedOutputStream( NetUtils.getOutputStream(mirrorSock, writeTimeout), SMALL_BUFFER_SIZE)); mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock)); // Write header: Copied from DFSClient.java! mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
63

mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK ); mirrorOut.writeLong( block.getBlockId() ); mirrorOut.writeLong( block.getGenerationStamp() ); mirrorOut.writeInt( pipelineSize ); mirrorOut.writeBoolean( isRecovery ); Text.writeString( mirrorOut, client ); mirrorOut.writeBoolean(hasSrcDataNode); if (hasSrcDataNode) { // pass src node information srcDataNode.write(mirrorOut); } mirrorOut.writeInt( targets.length - 1 ); for ( int i = 1; i < targets.length; i++ ) { targets[i].write( mirrorOut ); } accessToken.write(mirrorOut); blockReceiver.writeChecksumHeader(mirrorOut); mirrorOut.flush(); // read connect ack (only for clients, not for replication req) if (client.length() != 0) { mirrorInStatus = mirrorIn.readShort(); firstBadLink = Text.readString(mirrorIn); if (LOG.isDebugEnabled() || mirrorInStatus DataTransferProtocol.OP_STATUS_SUCCESS) { LOG.info("Datanode " + targets.length + " got response for connect ack " + " from downstream datanode with firstbadlink as " + firstBadLink); } } } catch (IOException e) { } } // send connect ack back to source (only for clients) if (client.length() != 0) { if (LOG.isDebugEnabled() || mirrorInStatus DataTransferProtocol.OP_STATUS_SUCCESS) { LOG.info("Datanode " + targets.length + " forwarding connect ack to upstream firstbadlink is " + firstBadLink);

!=

!=

64

} replyOut.writeShort(mirrorInStatus); Text.writeString(replyOut, firstBadLink); replyOut.flush(); } // receive the block and mirror to the next target String mirrorAddr = (mirrorSock == null) ? null : mirrorNode; blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut, mirrorAddr, null, targets.length); // if this write is for a replication request (and not // from a client), then confirm block. For client-writes, // the block is finalized in the PacketResponder. if (client.length() == 0) { datanode.notifyNamenodeReceivedBlock(block, DataNode.EMPTY_DEL_HINT); LOG.info("Received block " + block + " src: " + remoteAddress + " dest: " + localAddress + " of size " + block.getNumBytes()); } if (datanode.blockScanner != null) { datanode.blockScanner.addBlock(block); } } catch (IOException ioe) { } finally { } }

datanode datanodes DatanodeInfo targets[] client datanode replyOut datanode datanode BlockReceiver, DataInputStream in DataNode DataOutputStream mirrorOut DataNode OutputStream out datanodetargets.length>0 datanode
65

1 datanode datanodes datanode receiveBlock() datanode


org.apache.hadoop.hdfs.server.datanode.BlockReceiver void receiveBlock( DataOutputStream mirrOut, // output to next datanode DataInputStream mirrIn, // input from next datanode DataOutputStream replyOut, // output to previous datanode String mirrAddr, BlockTransferThrottler throttlerArg, int numTargets) throws IOException { mirrorOut = mirrOut; mirrorAddr = mirrAddr; throttler = throttlerArg; try { // write data chunk header if (!finalized) { BlockMetadataHeader.writeHeader(checksumOut, checksum); } if (clientName.length() > 0) { responder = new Daemon(datanode.threadGroup, new PacketResponder(this, block, mirrIn, replyOut, numTargets, Thread.currentThread())); responder.start(); // start thread to processes reponses } /* * Receive until packet length is zero. */ while (receivePacket() > 0) {} // flush the mirror out if (mirrorOut != null) { try { mirrorOut.writeInt(0); // mark the end of the block mirrorOut.flush(); } catch (IOException e) { handleMirrorOutError(e); } }
66

// wait for all outstanding packet responses. And then // indicate responder to gracefully shutdown. if (responder != null) { ((PacketResponder)responder.getRunnable()).close(); } // if this write is for a replication request (and not // from a client), then finalize block. For client-writes, // the block is finalized in the PacketResponder. if (clientName.length() == 0) { // close the block/crc files close(); // Finalize the block. Does this fsync()? block.setNumBytes(offsetInBlock); datanode.data.finalizeBlock(block); datanode.myMetrics.incrBlocksWritten(); } } catch (IOException ioe) { } finally { } } org.apache.hadoop.hdfs.server.datanode.BlockReceiver private int receivePacket() throws IOException { int payloadLen = readNextPacket(); if (payloadLen <= 0) { return payloadLen; } buf.mark(); //read the header buf.getInt(); // packet length offsetInBlock = buf.getLong(); // get offset of packet in block long seqno = buf.getLong(); // get seqno boolean lastPacketInBlock = (buf.get() != 0); int endOfHeader = buf.position();
67

buf.reset(); if (LOG.isDebugEnabled()){ LOG.debug("Receiving one packet for block " + block + " of length " + payloadLen + " seqno " + seqno + " offsetInBlock " + offsetInBlock + " lastPacketInBlock " + lastPacketInBlock); } setBlockPosition(offsetInBlock); // First write the packet to the mirror: if (mirrorOut != null && !mirrorError) { try { mirrorOut.write(buf.array(), buf.position(), buf.remaining()); mirrorOut.flush(); } catch (IOException e) { handleMirrorOutError(e); } } buf.position(endOfHeader); int len = buf.getInt(); if (len < 0) { throw new IOException("Got wrong length during writeBlock(" + block + ") from " + inAddr + " at offset " + offsetInBlock + ": " + len); } if (len == 0) { LOG.debug("Receiving empty packet for block " + block); } else { offsetInBlock += len; int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)* checksumSize; if ( buf.remaining() != (checksumLen + len)) { throw new IOException("Data remaining in packet does not match " + "sum of checksumLen and dataLen"); } int checksumOff = buf.position();
68

int dataOff = checksumOff + checksumLen; byte pktBuf[] = buf.array(); buf.position(buf.limit()); // move to the end of the data. /* skip verifying checksum iff this is not the last one in the * pipeline and clientName is non-null. i.e. Checksum is verified * on all the datanodes when the data is being written by a * datanode rather than a client. Whe client is writing the data, * protocol includes acks and only the last datanode needs to verify * checksum. */ if (mirrorOut == null || clientName.length() == 0) { verifyChunks(pktBuf, dataOff, len, pktBuf, checksumOff); } try { if (!finalized) { //finally write to the disk : out.write(pktBuf, dataOff, len); // If this is a partial chunk, then verify that this is the only // chunk in the packet. Calculate new crc for this chunk. if (partialCrc != null) { if (len > bytesPerChecksum) { throw new IOException("Got wrong length during writeBlock(" + block + ") from " + inAddr + " " + "A packet can have only one partial chunk."+ " len = " + len + " bytesPerChecksum " + bytesPerChecksum); } partialCrc.update(pktBuf, dataOff, len); byte[] buf = FSOutputSummer.convertToByteStream(partialCrc, checksumSize); checksumOut.write(buf); LOG.debug("Writing out partial crc for data len " + len); partialCrc = null; } else { checksumOut.write(pktBuf, checksumOff, checksumLen); } datanode.myMetrics.incrBytesWritten(len); /// flush entire packet before sending ack flush();

69

// update length only after flush to disk datanode.data.setVisibleLength(block, offsetInBlock); } } catch (IOException iex) { datanode.checkDiskError(iex); throw iex; } } // put in queue for pending acks if (responder != null) { ((PacketResponder)responder.getRunnable()).enqueue(seqno, lastPacketInBlock); } if (throttler != null) { // throttle I/O throttler.throttle(payloadLen); } return payloadLen; }

receiveBlock() receivePacket() packet 0 client queue datanode ack datanode clientreceivePacket() packet packet datanode client ack org.apache.hadoop.hdfs.DFSClient.DFSOutputStream.ResponseProcessor.run() packet ack queue OP_READ_BLOCK (81)

70

FileSystem open() DistributedFileSystem RPC namenode namenode datanode DistributedFileSystem FSDataInputStream FSDataInputStream datanode namenode I/O DFSInputStream read() datanode DFSInputStream datanode read() datanode DFSInputStream datanode datanode FSDataInputStream close()

71

DataNode
DataNode

public class DataNode extends Configured implements InterDatanodeProtocol, ClientDatanodeProtocol, FSConstants, Runnable, DataNodeMXBean

DataNode DataNode ClientDatanodeProtocol Client InterDatanodeProtocol DataNode ipcServer DataNode IPC DataNode ClientDatanodeProtocol InterDatanodeProtocol DataNode DataNode
static{ Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); }
72

DataNode hdfs-default.xml hdfs-site.xml hdfs-site.xml hdfs-default.xml hdfs-default.xml src/hdfs main datanode


1. main secureMain createDataNode datanode datanode 2. createDataNode instantiateDataNode datanode runDatanodeDaemon datanode 3. instantiateDataNode ${dfs.network.script} ${dfs.data.dir} datanode makeInstance 4. makeInstance DataNode DataNode 5.DataNode startDataNode datanode shutdown datanode 6.startDataNode namenode namenode datanode machineName:port namenode version id DataNode JMX Java Management Extensions Java datanode ss 50010 DataXceiverServer ss DataBlockScanner FSDataset datanode infoServer http://0.0.0.0:50075 https https 50475 infoServer DataBlockScanner Servlet http://0.0.0.0:50075/blockScannerReport ipc RPC 50020
73

main secureMain secureMain createDataNode DataNode createDataNode instantiateDataNode DataNode runDatanodeDaemon runDatanodeDaemon NameNode DataNode DataNode DataNode instantiateDataNode DataNode storage makeInstance makeInstance new DataNode(conf, dirs); startDataNode DataNode DataNode NameNode socket NameNode DatanodeProtocol.versionRequest NamespaceInfo FSDataset storage data DataXceiverServer run DataBlockScanner offerService DataNode HttpServer ipcServer DataNode DataNode DataNode NameNode DataXceiverServer ipcServer DataNode DataNode run

startDistributedUpgradeIfNeeded()/offerService() offerService offerService offerService NameNode Block DataNode Block Block NameNode
74

DataNode heartBeatInterval sendHeartbeat Block receivedBlockList delHints receivedBlockList DataNode delHints DataXceiver replaceBlock
datanode.notifyNamenodeReceivedBlock(block, sourceID)

DataNode sourceID Block sourceID Block Block DataNode Block NameNode.blockReceived Block blockReportInterval Block NameNode DataNode
DNA_TRANSFER DataNode DNA_INVALIDATE DNA_SHUTDOWN DataNode DNA_REGISTERDataNode DNA_FINALIZE DNA_RECOVERBLOCK

DataNode transferBlocks transferBlocks Block DataTransfer DataTransfer DataNode OP_WRITE_BLOCK NameNode lease DataNode
75

FSDataset: FSDataset http://caibinbupt.iteye.com/blog/284365 DataXceiverServer:, DataXceiver http://caibinbupt.iteye.com/blog/284979 DataXceiver: http://caibinbupt.iteye.com/blog/284979 http://caibinbupt.iteye.com/blog/286533 BlockReceiver: http://caibinbupt.iteye.com/blog/286259 BlockSender: DataBlockScanner: http://caibinbupt.iteye.com/blog/286650

NameNode
HDFS NameNode DataNode NameNode inode

UNIX SecondaryNameNode NameNode HDFS


=> =>DataNode

=> NameNode =>DataNode DataNode NameNode DataNode DataNode


76

InterDatanodeProtocol ClientDatanodeProtocol NameNode

ClientProtocol NameNode HDFS GFS HDFS POSIX org.apache.hadoop.fs.FileSystem HDFS


77

DatanodeProtocol DataNode NameNode DataNode register DataNode sendHeartbeat/blockReport/blockReceived DataNode offerService errorReport NameNode Block BlockReceiver DataBlockScanner nextGenerationStamp

commitBlockSynchronization lease lease NamenodeProtocol NameNode NameNode

78

namenode :bin/hadoop namenode bin/hadoop java :

org.apache.hadoop.hdfs.server.namenode.NameNodemain --> createNameNode --> NameNode --> initialize NameNode :


public class NameNode implements ClientProtocol, DatanodeProtocol, NamenodeProtocol, FSConstants, RefreshAuthorizationPolicyProtocol,
79

RefreshUserMappingsProtocol { // static{ Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); } public static final int DEFAULT_PORT = 8020; // public static final Log LOG = LogFactory.getLog(NameNode.class.getName()); public static final Log stateChangeLog = LogFactory.getLog("org.apache.hadoop.hdfs.StateChange"); public FSNamesystem namesystem; // TODO: This should private. Use getNamesystem() instead. // Datanode /** RPC server */ private Server server; /** RPC server for HDFS Services communication. BackupNode, Datanodes and all other services should be connecting to this server if it is configured. Clients should only go to NameNode#server */ private Server serviceRpcServer; /** RPC server address */ private InetSocketAddress serverAddress = null; /** RPC server for DN address */ protected InetSocketAddress serviceRPCAddress = null; /** httpServer */ private HttpServer httpServer; /** HTTP server address */ private InetSocketAddress httpAddress = null; private Thread emptier; /** only used for testing purposes */ private boolean stopRequested = false; /** Is service level authorization enabled? */ private boolean serviceAuthEnabled = false; static NameNodeInstrumentation myMetrics; //

FSNamesystem org.apache.hadoop.hdfs.server.namenode Namenode


80

HttpServer org.apache.hadoop.http Jetty Namenode HTTP Namenode


// public static NameNode createNameNode(String argv[], Configuration conf) throws IOException { ... StartupOption startOpt = parseArguments(argv); ... switch (startOpt) { case FORMAT: // namenode namenode boolean aborted = format(conf, true); System.exit(aborted ? 1 : 0); case FINALIZE: // hadoop aborted = finalize(conf, true); System.exit(aborted ? 1 : 0); default: } ... // NameNode initialize NameNode namenode = new NameNode(conf); return namenode; } private void initialize(Configuration conf) throws IOException { ... // fsimage edits log this.namesystem = new FSNamesystem(this, conf); .... // RPCServer rpc 10 8020 this.server = RPC.getServer(this, socAddr.getHostName(), socAddr.getPort(), handlerCount, false, conf, namesystem .getDelegationTokenSecretManager()); startHttpServer(conf);// http http://namenode:50070 hdfs .... this.server.start(); // RPC server .... // fs.trash.interval 60 startTrashEmptier(conf); }
81

public static void main(String argv[]) throws Exception { try { ... NameNode namenode = createNameNode(argv, null); if (namenode != null) namenode.join(); } ... } }

org.apache.hadoop.hdfs.server.namenode.FSNamesystem Namenode NameNode FSNamesystem NameNode FSNamesystem => FSImage =>DataNode DataNode DataNode DataNodeLRU FSNamesystem
public class FSNamesystem implements FSConstants, FSNamesystemMBean, NameNodeMXBean, MetricsSource { // public FSDirectory dir; //BlocksMap Block inode Datanode final BlocksMap blocksMap = new BlocksMap(DEFAULT_INITIAL_MAP_CAPACITY,DEFAULT_MAP_LOAD_FACTOR); //
82

public CorruptReplicasMap corruptReplicas = new CorruptReplicasMap(); //datanode NavigableMap<String, DatanodeDescriptor> datanodeMap = new TreeMap<String, DatanodeDescriptor>(); //datanodeMap DatanodeDescriptorHeartbeatMonitor ArrayList<DatanodeDescriptor> heartbeats = new ArrayList<DatanodeDescriptor>(); // private UnderReplicatedBlocks neededReplications = new UnderReplicatedBlocks(); // private PendingReplicationBlocks pendingReplications; // public LeaseManager leaseManager = new LeaseManager(this); Daemon hbthread = null; // FSNamesystem heartbeatCheck Datanode public Daemon lmthread = null; // LeaseMonitor thread Daemon smmthread = null; // threshold public Daemon replthread = null; // : Datanode ; private ReplicationMonitor replmon = null; // Replication metrics // Datanode -> DatanodeDescriptor private Host2NodesMap host2DataNodeMap = new Host2NodesMap();

// Data Center Rack NetworkTopology clusterMap = new NetworkTopology(); // DNS-name/IP-address -> RackID private DNSToSwitchMapping dnsToSwitchMapping; // ReplicationTargetChooser replicator; // Datanode Datanode Namenode
83

Namenode private HostsFileReader hostsReader; }

FSNamesystem
private void initialize(NameNode nn, Configuration conf) throws IOException { this.systemStart = now(); setConfigurationParameters(conf); dtSecretManager = createDelegationTokenSecretManager(conf); this.nameNodeAddress = nn.getNameNodeAddress(); this.registerMBean(conf); // register the MBean for the FSNamesystemStutus this.dir = new FSDirectory(this, conf); StartupOption startOpt = NameNode.getStartupOption(conf); // fsimage edits this.dir.loadFSImage(getNamespaceDirs(conf), getNamespaceEditsDirs(conf), startOpt); long timeTakenToLoadFSImage = now() - systemStart; LOG.info("Finished loading FSImage in " + timeTakenToLoadFSImage + " msecs"); NameNode.getNameNodeMetrics().setFsImageLoadTime(timeTakenToLoadFSImage); this.safeMode = new SafeModeInfo(conf); setBlockTotal(); pendingReplications = new PendingReplicationBlocks( conf.getInt("dfs.replication.pending.timeout.sec", -1) * 1000L); if (isAccessTokenEnabled) { accessTokenHandler = new BlockTokenSecretManager(true, accessKeyUpdateInterval, accessTokenLifetime); } this.hbthread = new Daemon(new HeartbeatMonitor());// Datanode this.lmthread = new Daemon(leaseManager.new Monitor());// this.replmon = new ReplicationMonitor(); this.replthread = new Daemon(replmon); // hbthread.start(); lmthread.start(); replthread.start(); // datanode this.hostsReader = new HostsFileReader(conf.get("dfs.hosts",""), conf.get("dfs.hosts.exclude","")); //, this.dnthread = new Daemon(new DecommissionManager(this).new Monitor( conf.getInt("dfs.namenode.decommission.interval", 30),
84

conf.getInt("dfs.namenode.decommission.nodes.per.interval", 5))); dnthread.start(); this.dnsToSwitchMapping = ReflectionUtils.newInstance( conf.getClass("topology.node.switch.mapping.impl", ScriptBasedMapping.class, DNSToSwitchMapping.class), conf); /* If the dns to swith mapping supports cache, resolve network * locations of those hosts in the include list, * and store the mapping in the cache; so future calls to resolve * will be fast. */ if (dnsToSwitchMapping instanceof CachedDNSToSwitchMapping) { dnsToSwitchMapping.resolve(new ArrayList<String>(hostsReader.getHosts())); } InetSocketAddress socAddr = NameNode.getAddress(conf); this.nameNodeHostName = socAddr.getHostName(); registerWith(DefaultMetricsSystem.INSTANCE); }

FSDirectory FSNamesystem FSDirectory hdfs INode file/block INode inode Field INodeDirectory INodeDirectory INode INode INodeFile INodeFile INode INodeDirectory INodeFile Datanode INodeFileUnderConstruction HDFS Namenode
85

INodeFile INodeFile Hadoop INodeFileUnderConstruction INodeFile INodeFile INodeFileUnderConstruction INodeFileUnderConstruction HDFS Datanode FSDirectory FSDirectory filename->blockset FSImage fsImage
class FSDirectory implements FSConstants, Closeable { final INodeDirectoryWithQuota rootDir;// INodeDirectory hdfs , FSImage fsImage; // FSImage , }

FSDirectory(FSNamesystem ns, Configuration conf) { this(new FSImage(), ns, conf); ... } FSDirectory(FSImage fsImage, FSNamesystem ns, Configuration conf) { rootDir = new INodeDirectoryWithQuota(INodeDirectory.ROOT_NAME, ns.createFsOwnerPermissions(new FsPermission((short)0755)), Integer.MAX_VALUE, -1); this.fsImage = fsImage; .... namesystem = ns; .... } //FSNamesystem FSDirectory dir loadFSImage fsimage edits
86

void loadFSImage(Collection<File> dataDirs,Collection<File> editsDirs,StartupOption startOpt) throws IOException { // format before starting up if requested if (startOpt == StartupOption.FORMAT) {// FORMAT fsImage.setStorageDirectories(dataDirs, editsDirs);// FSImage ${dfs.name.dir},/tmp/hadoop/dfs/name, fsImage.format();// FSImage startOpt = StartupOption.REGULAR; } try { if (fsImage.recoverTransitionRead(dataDirs, editsDirs, startOpt)) { // (${dfs.name.dir}) fsImage.saveNamespace(true); } FSEditLog editLog = fsImage.getEditLog(); assert editLog != null : "editLog must be initialized"; if (!editLog.isOpen()) editLog.open(); fsImage.setCheckpointDirectories(null, null); } ... }

loadFSImage FSImage FSImage EditLog FSImage EditLog EditLog FSImage FSImage EditLog FSImage namenode namenode hdfs rpc namenode namenode FSNamesystem namesystem namesystem

87

namesystem FSDirectory dir dir dir FSImage fsImage fsImage hdfs EditLog Secondrary Namenoe () namenode EditLog fsimage fsimage EditLog

INode*
NameNode inode inode INode* INode*

INode INodeDirectory
88

INodeFile INodeDirectoryWithQuota INodeFileUnderConstruction HDFS INode name / modificationTime accessTime parent permission HDFS UNIX/Linux UNIX groupuser IDpermission INode long INode get set collectSubtreeBlocksAndClear INode BlockcomputeContentSummary INode INodeDirectory HDFS private List<INode> children; /INodeDirectory get set INodeDirectoryWithQuota INodeDirectory INodeDirectory NameSpace INodeFile HDFS protected BlockInfo blocks[] = null; Block BlockInfo Block INodeFileUnderConstruction clientName clientMachine DataNode clientNode targets

89

Lease
Lease Lease

LeaseNameNode LeaseLease 3 NameNode code

LeaseManager Lease LeaseManager Monitor Lease holder lastUpdate paths LeaseManager Lease LeaseManager addLease Lease renewLease remove add LeaseManager Monitor Lease Lease FSNamesystem internalReleaseLease LeaseManager

90

Hadoop UNIX FsAction org.apache.hadoop.fs.permission FsAction FsPermission / applyUMask FsPermission PermissionStatus FsPermission INode PermissionStatus long SerialNumberManager PermissionStatus SerialNumberManager FSImage SerialNumberManager
91

SerialNumberManager INode long FsPermissionMODE USERGROUP PermissionChecker Lease Management hadoop lease GFS hadoop lease client GFS lease client datanode hadoop -- append client write client append lease wirte lease management 1 createwritecomplete lease lease 2 lease lease

lease management
create ClientProtocol create INode client ClientProtocol INode completed
92

client lease client lease lease writer client lease lease client create Namenode create
public void create(String src, FsPermission masked, String clientName, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { String clientMachine = getClientMachine(); if (stateChangeLog.isDebugEnabled()) { stateChangeLog.debug("*DIR* NameNode.create: file " +src+" for "+clientName+" at "+clientMachine); } if (!checkPathLength(src)) { throw new IOException("create: Pathname too long. Limit " + MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels."); } namesystem.startFile(src, new PermissionStatus(UserGroupInformation.getCurrentUser().getShortUserName(), null, masked), clientName, clientMachine, overwrite, createParent, replication, blockSize); myMetrics.incrNumFilesCreated(); myMetrics.incrNumCreateFileOps(); }

FsNamesystem startFile startFileInternal append create


private synchronized void startFileInternal(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite,
93

boolean append, boolean createParent, short replication, long blockSize ) throws IOException { if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: src=" + src + ", holder=" + holder + ", clientMachine=" + clientMachine + ", createParent=" + createParent + ", replication=" + replication + ", overwrite=" + overwrite + ", append=" + append); } if (isInSafeMode()) throw new SafeModeException("Cannot create file" + src, safeMode); if (!DFSUtil.isValidName(src)) { throw new IOException("Invalid file name: " + src); } // Verify that the destination does not exist as a directory already. boolean pathExists = dir.exists(src); if (pathExists && dir.isDir(src)) { throw new IOException("Cannot create file "+ src + "; already exists as a directory."); } if (isPermissionEnabled) { if (append || (overwrite && pathExists)) { checkPathAccess(src, FsAction.WRITE); } else { checkAncestorAccess(src, FsAction.WRITE); } } if (!createParent) { verifyParentDir(src); } try { INode myFile = dir.getFileINode(src); recoverLeaseInternal(myFile, src, holder, clientMachine, false);

94

try { verifyReplication(src, replication, clientMachine); } catch(IOException e) { throw new IOException("failed to create "+e.getMessage()); } if (append) { if (myFile == null) { throw new FileNotFoundException("failed to append to non-existent file " + src + " on client " + clientMachine); } else if (myFile.isDirectory()) { throw new IOException("failed to append to directory " + src +" on client " + clientMachine); } } else if (!dir.isValidToCreate(src)) { if (overwrite) { delete(src, true); } else { throw new IOException("failed to create file " + src +" on client " + clientMachine +" either because the filename is invalid or the file exists"); } } DatanodeDescriptor clientNode = host2DataNodeMap.getDatanodeByHost(clientMachine); if (append) { // // Replace current node with a INodeUnderConstruction. // Recreate in-memory lease record. // INodeFile node = (INodeFile) myFile; INodeFileUnderConstruction cons = new INodeFileUnderConstruction( node.getLocalNameBytes(), node.getReplication(), node.getModificationTime(), node.getPreferredBlockSize(), node.getBlocks(), node.getPermissionStatus(), holder, clientMachine, clientNode); dir.replaceNode(src, node, cons);
95

leaseManager.addLease(cons.clientName, src); } else { // Now we can add the name to the filesystem. This file has no // blocks associated with it. // checkFsObjectLimit(); // increment global generation stamp long genstamp = nextGenerationStamp(); INodeFileUnderConstruction newNode = dir.addFile(src, permissions, replication, blockSize, holder, clientMachine, clientNode, genstamp); if (newNode == null) { throw new IOException("DIR* NameSystem.startFile: " + "Unable to add file to namespace."); } leaseManager.addLease(newNode.clientName, src); if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: " +"add "+src+" to namespace for "+holder); } } } catch (IOException ie) { NameNode.stateChangeLog.warn("DIR* NameSystem.startFile: " +ie.getMessage()); throw ie; } }

newNode leaseManager.addLease (newNode.clientName, src);


/** * Adds (or re-adds) the lease for the specified file. */ synchronized Lease addLease(String holder, String src) { Lease lease = getLease(holder); if (lease == null) { lease = new Lease(holder); leases.put(holder, lease); sortedLeases.add(lease); } else { renewLease(lease); }
96

sortedLeasesByPath.put(src, lease); lease.paths.add(src); return lease; }

lease lease lease lease lease management write client lease completed create
private void finalizeINodeFileUnderConstruction(String src, INodeFileUnderConstruction pendingFile) throws IOException { NameNode.stateChangeLog.info("Removing lease on file " + src + " from client " + pendingFile.clientName); leaseManager.removeLease(pendingFile.clientName, src); // The file is no longer pending. // Create permanent INode, update blockmap INodeFile newFile = pendingFile.convertToInodeFile(); dir.replaceNode(src, pendingFile, newFile); // close file and persist block allocations for this file dir.closeFile(src, newFile); checkReplicationFactor(newFile); }

client lease pendingFile INodeFile create lease client complete INodeFile

lease management
FsNamesystem initialize this.lmthread = new Daemon(leaseManager.new Monitor());
97

lmthread.start();
/****************************************************** * Monitor checks for leases that have expired, * and disposes of them. ******************************************************/ class Monitor implements Runnable { final String name = getClass().getSimpleName(); /** Check leases periodically. */ public void run() { for(; fsnamesystem.isRunning(); ) { synchronized(fsnamesystem) { checkLeases(); } try { Thread.sleep(2000); } catch(InterruptedException ie) { if (LOG.isDebugEnabled()) { LOG.debug(name + " is interrupted", ie); } } } } }

/** Check the leases beginning from the oldest. */ synchronized void checkLeases() { for(; sortedLeases.size() > 0; ) { final Lease oldest = sortedLeases.first(); if (!oldest.expiredHardLimit()) { return; // } LOG.info("Lease " + oldest + " has expired hard limit"); final List<String> removing = new ArrayList<String>(); // need to create a copy of the oldest lease paths, becuase // internalReleaseLease() removes paths corresponding to empty files, // i.e. it needs to modify the collection being iterated over
98

// causing ConcurrentModificationException String[] leasePaths = new String[oldest.getPaths().size()]; oldest.getPaths().toArray(leasePaths); for(String p : leasePaths) { try { fsnamesystem.internalReleaseLeaseOne(oldest, p); } catch (IOException e) { LOG.error("Cannot release the path "+p+" in the lease "+oldest, e); removing.add(p); } } for(String p : removing) { removeLease(oldest, p); } } }

fsnamesystem.internalReleaseLease(oldest, p); lease lease removeLease(oldest, p); lease lease LeaseManager private SortedMap leases = new TreeMap();holder->lease map private SortedSet sortedLeases = new TreeSet(); lease private SortedMap sortedLeasesByPath = new TreeMap();paths->leases map addLease lease leases sortedLeases lease client lease lease sortedLeasesByPath --lease recovery * Lease Recovery Algorithm * 1) Namenode retrieves lease information
99

* 2) For each file f in the lease, consider the last block b of f * 2.1) Get the datanodes which contains b * 2.2) Assign one of the datanodes as the primary datanode p * 2.3) p obtains a new generation stamp form the namenode * 2.4) p get the block info from each datanode * 2.5) p computes the minimum block length * 2.6) p updates the datanodes, which have a valid generation stamp, * with the new generation stamp and the minimum block length * 2.7) p acknowledges the namenode the update results * 2.8) Namenode updates the BlockInfo * 2.9) Namenode removes f from the lease * and removes the lease once all files have been removed * 2.10) Namenode commit changes to edit log

Heartbeat
Hadoop RPC

1. hadoop master/slave master Namenode Jobtracker slave Datanode Tasktracker 2. master ipc server slave 3. slave master 3 master heartbeat.recheck.interval
100

master master slave 4. namenode datanode jobtracker tasktracker Datanode Namenode Datanode offerService
/** * Main loop for the DataNode. Runs until shutdown, * forever calling remote NameNode functions. */ public void offerService() throws Exception { LOG.info("using BLOCKREPORT_INTERVAL of " + blockReportInterval + "msec" + " Initial delay: " + initialBlockReportDelay + "msec"); // // Now loop for a long time.... // while (shouldRun) { try { long startTime = now(); // // Every so often, send heartbeat or block-report // if (startTime - lastHeartbeat > heartBeatInterval) { // // All heartbeat messages include following info: // -- Datanode name // -- data transfer port // -- Total capacity // -- Bytes remaining // lastHeartbeat = startTime; DatanodeCommand[] cmds = namenode.sendHeartbeat(dnRegistration, data.getCapacity(), data.getDfsUsed(), data.getRemaining(),
101

xmitsInProgress.get(), getXceiverCount()); myMetrics.addHeartBeat(now() - startTime); //LOG.info("Just sent heartbeat, with name " + localName); if (!processCommand(cmds)) continue; } } // while (shouldRun) } // offerService

Hadoop Datanode Namenode 2 2 JVM Datanode namenode namenode public DatanodeProtocol namenode = null; NameNode
public class NameNode implements ClientProtocol, DatanodeProtocol, NamenodeProtocol, FSConstants, RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol

namenode DatanodeProtocol Hadoop RPC Datanode Namenode sendHeartbeat() DataNode NameNode DataNode startDataNode
// connect to name node this.namenode = (DatanodeProtocol) RPC.waitForProxy(DatanodeProtocol.class, DatanodeProtocol.versionID, nameNodeAddr, conf);

namenode Namenode Datanode Namenode

102

RPC Datanode Namenode heartbeat 1) datanode namenode proxy 2) datanode namenode proxy sendHeartbeat 3) datanode namenode ( ) Invocation client.call 4) client call Invocation Call 5) client call namenode 6) namenode namenode Call process DatanodeCommand[] sendHeartbeat
/** * Data node notify the name node that it is alive * Return an array of block-oriented commands for the datanode to execute. * This will be either a transfer or a delete operation. */ public DatanodeCommand[] sendHeartbeat(DatanodeRegistration nodeReg, long capacity, long dfsUsed, long remaining, int xmitsInProgress, int xceiverCount) throws IOException { verifyRequest(nodeReg); return namesystem.handleHeartbeat(nodeReg, capacity, dfsUsed, remaining, xceiverCount, xmitsInProgress); }

DataNode NameNode DatanodeRegistration DatanodeCommand DatanodeCommand

103

DatanodeProtocol DatanodeCommand
/** * Determines actions that data node should perform * when receiving a datanode command. */ final static int DNA_UNKNOWN = 0; // unknown action final static int DNA_TRANSFER = 1; // transfer blocks to another datanode final static int DNA_INVALIDATE = 2; // invalidate blocks final static int DNA_SHUTDOWN = 3; // shutdown node final static int DNA_REGISTER = 4; // re-register final static int DNA_FINALIZE = 5; // finalize previous upgrade final static int DNA_RECOVERBLOCK = 6; // request a block recovery final static int DNA_ACCESSKEYUPDATE = 7; // update access key final static int DNA_BALANCERBANDWIDTHUPDATE = 8; // update balancer bandwidth

FSNamesystem.handleHeartbeat 1 getDatanode DatanodeDescriptor nodeinfo null NameNode StorageID DatanodeCommand.REGISTER DataNode 2 isDecommissioned DisallowedDatanodeException
104

3 nodeinfo DatanodeCommand.REGISTER DataNode 4 capacityTotalcapacityUsedcapacityRemaining totalLoad 5 DatanodeCommand


/** * The given node has reported in. This method should: * 1) Record the heartbeat, so the datanode isn't timed out * 2) Adjust usage stats for future block allocation * * If a substantial amount of time passed since the last datanode * heartbeat then request an immediate block report. * * @return an array of datanode commands * @throws IOException */ DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg, long capacity, long dfsUsed, long remaining, int xceiverCount, int xmitsInProgress) throws IOException { DatanodeCommand cmd = null; synchronized (heartbeats) { synchronized (datanodeMap) { DatanodeDescriptor nodeinfo = null; try { nodeinfo = getDatanode(nodeReg); } catch(UnregisteredDatanodeException e) { return new DatanodeCommand[]{DatanodeCommand.REGISTER}; } // Check if this datanode should actually be shutdown instead. if (nodeinfo != null && shouldNodeShutdown(nodeinfo)) { setDatanodeDead(nodeinfo); throw new DisallowedDatanodeException(nodeinfo); } if (nodeinfo == null || !nodeinfo.isAlive) { return new DatanodeCommand[]{DatanodeCommand.REGISTER}; }
105

updateStats(nodeinfo, false); nodeinfo.updateHeartbeat(capacity, dfsUsed, remaining, xceiverCount); updateStats(nodeinfo, true); //check lease recovery cmd = nodeinfo.getLeaseRecoveryCommand(Integer.MAX_VALUE); if (cmd != null) { return new DatanodeCommand[] {cmd}; } ArrayList<DatanodeCommand> cmds = new ArrayList<DatanodeCommand>(); //check pending replication cmd = nodeinfo.getReplicationCommand( maxReplicationStreams - xmitsInProgress); if (cmd != null) { cmds.add(cmd); } //check block invalidation cmd = nodeinfo.getInvalidateBlocks(blockInvalidateLimit); if (cmd != null) { cmds.add(cmd); } // check access key update if (isAccessTokenEnabled && nodeinfo.needKeyUpdate) { cmds.add(new KeyUpdateCommand(accessTokenHandler.exportKeys())); nodeinfo.needKeyUpdate = false; } // check for balancer bandwidth update if (nodeinfo.getBalancerBandwidth() > 0) { cmds.add(new BalancerBandwidthCommand(nodeinfo.getBalancerBandwidth())); // set back to 0 to indicate that datanode has been sent the new value nodeinfo.setBalancerBandwidth(0); } if (!cmds.isEmpty()) { return cmds.toArray(new DatanodeCommand[cmds.size()]); } } } //check distributed upgrade cmd = getDistributedUpgradeCommand(); if (cmd != null) { return new DatanodeCommand[] {cmd}; }
106

return null; }

10

HDFS
HDFS HDFS

HDFS
// FileSystem public class FileCopyWithProgress { public static void main(String[] args) throws Exception { String localSrc = args[0]; String dst = args[1]; InputStream in = new BufferedInputStream(new FileInputStream(localSrc)); Configuration conf = new Configuration(); // FileSystem HDFS FileSystem fs = FileSystem.get(URI.create(dst), conf); OutputStream out = fs.create(new Path(dst), new Progressable() { //MapReduce public void progress() { System.out.print("."); } }); IOUtils.copyBytes(in, out, 4096, true); } } hadoop FileCopyWithProgress input/1.txt hdfs://localhost/user/hadoop/1.txt // FileSystem API public class FileSystemCat { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); // FileSystem HDFS FileSystem fs = FileSystem.get(URI.create(uri), conf); InputStream in = null; try { in = fs.open(new Path(uri));
107

IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } } } hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt

HDFS FileSystem FileSystem

FileSystem
108

FileSystem CACHE: cache cache statisticsTable: key: CACHE statistics: deleteOnExit: Java clientFinalizer: FileSystem FileSystem getFileBlockLocations: exists: isFile: getContentSummary: listStatus: globStatus: Linux
109

getHomeDirectory: get/set*etWorkingDirectory: copyFromLocalFile: copyToLocalFile: moveFromLocalFile: moveToLocalFile: getFileStatus: setPermission: setOwner: setTimes: getAllStatistics: getStatistics: get: URI FileSystem createFileSystem createFileSystem: URI scheme scheme FileSystem Hadoop HDFS
FileSystem fs = FileSystem.get(URI.create(dst), conf);

uri

FileSystem.Cache
HashMapMap
110

Key FileSystem Key scheme: URI URIhttp://server/index.html scheme http authority: URI authority server authority ugi:

org.apache.hadoop.security.UserGroupInformation schemeauthority ugi

FileSystem.Statistics
scheme: HDFS hdfs bytesRead: AtomicLong bytesWritten: AtomicLong
private final String scheme; private AtomicLong bytesRead = new AtomicLong(); private AtomicLong bytesWritten = new AtomicLong(); private AtomicInteger readOps = new AtomicInteger(); private AtomicInteger largeReadOps = new AtomicInteger(); private AtomicInteger writeOps = new AtomicInteger();

Path
/ URI
111

BlockLocation
HDFS Block
private String[] hosts; //hostnames of datanodes private String[] names; //hostname:portNumber of datanodes private String[] topologyPaths; // full path name in network topology private long offset; //offset of the of the block in the file private long length; //

FileStatus
/ UnixLinux
private Path path; private long length; private boolean isdir; private short block_replication; private long blocksize; private long modification_time; private long access_time; private FsPermission permission; private String owner; private String group; / / / / / / /

FsPermission
/ POSIX

FSDataOutputStream
DataOutputStream Syncable sync FSDataOutputStream FileSystem
112

FSDataInputStream
DataInputStream Seekable PositionReadable seek FSDataInputStream FileSystem FileSystem FileSystem FileSystem.get
/** Returns the FileSystem for this URI's scheme and authority. The scheme * of the URI determines a configuration property name, * <tt>fs.<i>scheme</i>.class</tt> whose value names the FileSystem class. * The entire URI is passed to the FileSystem instance's initialize method. */ public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null) { return get(conf); } // no scheme: use default FS

if (authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme()) // if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); }

FileSystem
113

private static FileSystem createFileSystem(URI uri, Configuration conf ) throws IOException { Class<?> clazz = conf.getClass("fs." + uri.getScheme() + ".impl", null); LOG.debug("Creating filesystem for " + uri); if (clazz == null) { throw new IOException("No FileSystem for scheme: " + uri.getScheme()); } FileSystem fs = (FileSystem)ReflectionUtils.newInstance(clazz, conf); fs.initialize(uri, conf); return fs; }

FileSystem Scheme hdfs FileSystem fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem JAVA DistributedFileSystem DFS FileSystem HDFS DistributedFileSystem DFSClient DFSClient Hadoop FileSystem Hadoop DistributedFileSystem

114

HDFS org.apache.hadoop.fs DistributedFileSystem DFSClien DFSClient hdfs-default.xml hdfs-site.xml namenode HDFS uri namenode checkPath schemeport authority
115

makeQualified getPathName DFSClient Hadoop ClientProtocol NameNode Socket DataNode /Hadoop DistributedFileSystem DFSClient DFSClient DFSClient

DFSClient MAX_BLOCK_ACQUIRE_FAILURES 3
116

TCP_WINDOW_SIZETCP 128KB seek TCP_WINDOW_SIZE TCP rpcNamenode RPC namenode namenode rcpNamenode Retry RetryPolicy leasechecker defaultBlockSize 64MB defaultReplication 3 socketTimeoutsocket 60 datanodeWriteTimeoutdatanode 480 writePacketSize packet 64KB maxBlockAcquireFailures 3 datanode DSClient dfs.socket.timeout dfs.datanode.socket.write.timeout dfs.write.packet.size packet dfs.client.max.block.acquire.failures mapred.task.id map reduce ID clientName DFSClient_ clientName DFSClient_ dfs.block.size
117

dfs.replication DFSClient RPC namenode checkOpen clientRunning ; getBlockLocations namenode LocatedBlocks LocatedBlocks datanode BlockLocation BlockLocation ; getFileChecksum checksum datanode checksum checksum MD5 datanode datanode MD5 MD5 bestNode deadNodes

11

MapReduce
MapReduce Google 1TB

Map Reduce Map Reduce MapReduce 1Client: MapReduce 2JobTracker 3 TaskTracker: Map Reduce 4 Shared
118

FileSystem( HDFS

1. Job waitForCompletion(true) jobtracker ID 2 InputSplit JAR ID jobtracker JAR mapred.submit.replication 10 3 jobtracker 4 2. JobTracker job scheduler 5 Job InputSplit 6 InputSplit map
119

3. TaskTracker (heartbeat) JobTracker. jobtracker tasktracker 7 jobtracker tasktracker jobtracker Hadoop MapReduce FIFO Fair Scheduler Capacity Scheduler jobtracker map reduce ,tasktracker 4. tasktracker JAR tasktracker tasktracker 8 tasktracker JAR tasktracker TaskRunner TaskRunner child JVM 9

120

Hadoop wordcount
hadoop jar hadoop-examples-1.0.4.jar wordcount /usr/input /usr/output

JobTracker Map M1 M2 M3 Reduce R1 R2 Map Reduce TaskTracker TaskTracker Java HDFS InputFormat ASCII JDBC InputFormat InputSplit splite1 splite5 InputFormat RecordReader <k,v><k,v> map map context.collect OutputCollector. collect context Mapper Partitioner Mapper Combiner Mapper
121

<k,v> list key list Combiner Partitioner M1 Combiner Partitioner Map Reduce 3 Shuffle sort reduce Hadoop MapReduce Map key Reducer Mapper key key Reducer HTTP Mapper key <key,value> Reduce Shuffle sort <key, (list of values)> Reducer. reduce OutputFormat DFS

12

1. jobtracker ID2.

MapReduce 3.4. ID jobtracker 5. jobtracker jobtracker Hadoop wordcount


hadoop jar hadoop-examples-1.0.4.jar wordcount /usr/input /usr/output
122

HADOOP_HOME/BIN/hadoop
elif [ "$COMMAND" = "jar" ] ; then CLASS=org.apache.hadoop.util.RunJar HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" # run it exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"

org.apache.hadoop.util.RunJar main RunJar jar hadoop-examples-1.0.4.jar WordCountWordCount


public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); /** * Mapper map * void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter) * k/v k/v * 0 * OutputCollector Mapper Reducer <k,v> * OutputCollector collect(k, v):(k,v) output map value key key StringTokenizer write word write (,1) context */ public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }
123

} public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); /** * Reducer reduce * void reduce(Text key, Iterable<IntWritable> values, Context context) * k/v map context,(combiner), context */ public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { //Configurationmap/reduce j hadoop map-reduce Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count");// job job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); // job Mapper job.setCombinerClass(IntSumReducer.class); // job Combiner job.setReducerClass(IntSumReducer.class); // job Reduce job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // map-reduce OutputFormat /** * InputFormat map-reduce job * setInputPaths(): map-reduce job * setInputPath() map-reduce job
124

*/ FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); // job job.runJob(conf); } }

Main org.apache.hadoop.mapreduce.Job waitForCompletion(true)


public class Job extends JobContext { // JobContext public static enum JobState {DEFINE, RUNNING}; //job private JobState state = JobState.DEFINE; // DEFINE private JobClient jobClient; // /** * Submit the job to the cluster and wait for it to finish. * @param verbose print the progress to the user * @return true if the job succeeded * @throws IOException thrown if the communication with the * <code>JobTracker</code> is lost */ public boolean waitForCompletion(boolean verbose ) throws IOException, InterruptedException, ClassNotFoundException { if (state == JobState.DEFINE) { // job DEFINE submit();// } if (verbose) {// jobClient.monitorAndPrintJob(conf, info); // } else { info.waitForCompletion();// } return isSuccessful();// } }

waitForCompletion(true)MapReduce submit():
125

/** * Submit the job to the cluster and return immediately. * @throws IOException */ public void submit() throws IOException, InterruptedException, ClassNotFoundException { ensureState(JobState.DEFINE); // DEFINE setUseNewAPI();// // Connect to the JobTracker and submit the job connect();// JobTracker JobClient info = jobClient.submitJobInternal(conf); // super.setJobID(info.getID());// ID state = JobState.RUNNING; // Job RUNNING }

connect() JobClient JobClient private JobSubmissionProtocol jobSubmitClient; JobTracker DataNode namenode JobClient RPC jobSubmitClient connect()-->(JobContext : UserGroupInformation) ugi doAs() PrivilegedExceptionAction run() -->new JobClient((JobConf) getConfiguration())-->JobClient init(JobConf conf) -->this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf);--> (JobSubmissionProtocol) RPC.getProxy();
Job.connect() | |-->UserGroupInformation.doAs() | | | |-->PrivilegedExceptionAction<Object>() |<--| | | | |run() | | | |-->JobClient.JobClient(JobConf conf) | | | | | |-->JobClient.setConf(Configuration conf) | | |<---| | | | | | |-->JobClient.init(JobConf conf)
126

| | | | | | | | | | | |

| | | | | |-->JobClient.createRPCProxy(InetSocketAddress addr, | | | Configuration conf) | | | | | | | |-->(JobSubmissionProtocol) RPC.getProxy() | | | |<---------------------------------| | | | | | | |<--| | |<---| | |<---| | | | | | | | | | |

jobClient.submitJobInternal(conf);
/** * Internal method for submitting jobs to the system. * @param job the configuration to submit * @return a proxy object for the running job * @throws FileNotFoundException * @throws ClassNotFoundException * @throws InterruptedException * @throws IOException */ public RunningJob submitJobInternal(final JobConf job ) throws FileNotFoundException, ClassNotFoundException, InterruptedException, IOException { /* * configure the command line options correctly on the submitting dfs */ return ugi.doAs(new PrivilegedExceptionAction<RunningJob>() { public RunningJob run() throws FileNotFoundException, ClassNotFoundException, InterruptedException, IOException{ JobConf jobCopy = job; Path jobStagingArea = JobSubmissionFiles.getStagingDir(JobClient.this, jobCopy); // jobtracker ID JobID jobId = jobSubmitClient.getNewJobId(); Path submitJobDir = new Path(jobStagingArea, jobId.toString()); jobCopy.set("mapreduce.job.dir", submitJobDir.toString());
127

JobStatus status = null; try { populateTokenCache(jobCopy, jobCopy.getCredentials()); copyAndConfigureFiles(jobCopy, submitJobDir); // get delegation token for the dir TokenCache.obtainTokensForNamenodes(jobCopy.getCredentials(), new Path [] {submitJobDir}, jobCopy); Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir); int reduces = jobCopy.getNumReduceTasks(); InetAddress ip = InetAddress.getLocalHost(); if (ip != null) { job.setJobSubmitHostAddress(ip.getHostAddress()); job.setJobSubmitHostName(ip.getHostName()); } JobContext context = new JobContext(jobCopy, jobId); // Check the output specification // // MapReduce if (reduces == 0 ? jobCopy.getUseNewMapper() : jobCopy.getUseNewReducer()) { org.apache.hadoop.mapreduce.OutputFormat<?,?> output = ReflectionUtils.newInstance(context.getOutputFormatClass(), jobCopy); output.checkOutputSpecs(context); } else { jobCopy.getOutputFormat().checkOutputSpecs(fs, jobCopy); } jobCopy = (JobConf)context.getConfiguration(); // Create the splits for the job FileSystem fs = submitJobDir.getFileSystem(jobCopy); LOG.debug("Creating splits at " + fs.makeQualified(submitJobDir)); int maps = writeSplits(context, submitJobDir); jobCopy.setNumMapTasks(maps); // write "queue admins of the queue to which job is being submitted" // to job file. String queue = jobCopy.getQueueName();
128

AccessControlList acl = jobSubmitClient.getQueueAdmins(queue); jobCopy.set(QueueManager.toFullPropertyName(queue, QueueACL.ADMINISTER_JOBS.getAclName()), acl.getACLString()); // Write job file to JobTracker's fs // ID jobtracker FSDataOutputStream out = FileSystem.create(fs, submitJobFile, new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION)); try { jobCopy.writeXml(out); } finally { out.close(); } // // Now, actually submit the job (using the submit name) // jobtracker printTokens(jobId, jobCopy.getCredentials()); status = jobSubmitClient.submitJob( jobId, submitJobDir.toString(), jobCopy.getCredentials()); JobProfile prof = jobSubmitClient.getJobProfile(jobId); if (status != null && prof != null) { return new NetworkedJob(status, prof, jobSubmitClient); } else { throw new IOException("Could not launch job"); } } finally { if (status == null) { LOG.info("Cleaning up the staging area " + submitJobDir); if (fs != null && submitJobDir != null) fs.delete(submitJobDir, true); } } } }); }

jobtracker ID JobTracker getNewJobId() OutputFormat checkOutputSpecs


129

org.apache.hadoop.mapreduce.
OutputFormat / org.apache.hadoop.mapred. OutputFormat

ID jobtracker FileSystem jobtracker jobSubmitClient.submitJob(jobId, submitJobDir.toString(), jobCopy.getCredentials());


/** * JobTracker.submitJob() kicks off a new job. * * Create a 'JobInProgress' object, which contains both JobProfile * and JobStatus. Those two sub-objects are sometimes shipped outside
130

* of the JobTracker. But JobInProgress adds info that's useful for * the JobTracker alone. */ public JobStatus submitJob(JobID jobId, String jobSubmitDir, Credentials ts) throws IOException { JobInfo jobInfo = null; UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); synchronized (this) { if (jobs.containsKey(jobId)) { // job already running, don't start twice return jobs.get(jobId).getStatus(); } jobInfo = new JobInfo(jobId, new Text(ugi.getShortUserName()), new Path(jobSubmitDir)); } // Create the JobInProgress, do not lock the JobTracker since // we are about to copy job.xml from HDFS JobInProgress job = null; try { job = new JobInProgress(this, this.conf, jobInfo, 0, ts); } catch (Exception e) { throw new IOException(e); } synchronized (this) { // check if queue is RUNNING String queue = job.getProfile().getQueueName(); if (!queueManager.isRunning(queue)) { throw new IOException("Queue \"" + queue + "\" is not running"); } try { aclsManager.checkAccess(job, ugi, Operation.SUBMIT_JOB); } catch (IOException ioe) { LOG.warn("Access denied for user " + job.getJobConf().getUser() + ". Ignoring job " + jobId, ioe); job.fail(); throw ioe; } // Check the job if it cannot run in the cluster because of invalid memory // requirements. try { checkMemoryRequirements(job);
131

} catch (IOException ioe) { throw ioe; } boolean recovered = true; // TODO: Once the Job recovery code is there, // (MAPREDUCE-873) we // must pass the "recovered" flag accurately. // This is handled in the trunk/0.22 if (!recovered) { // Store the information in a file so that the job can be recovered // later (if at all) Path jobDir = getSystemDirectoryForJob(jobId); FileSystem.mkdirs(fs, jobDir, new FsPermission(SYSTEM_DIR_PERMISSION)); FSDataOutputStream out = fs.create(getSystemFileForJob(jobId)); jobInfo.write(out); out.close(); } // Submit the job JobStatus status; try { status = addJob(jobId, job); } catch (IOException ioe) { LOG.info("Job " + jobId + " submission failed!", ioe); status = job.getStatus(); status.setFailureInfo(StringUtils.stringifyException(ioe)); failJob(job); throw ioe; } return status; } }

jobtracker

13

JobTracker job

scheduler
132

Job InputSplit InputSplit map JobTracker JobTracker start-mapred.sh start-all.sh start-mapred.sh MapReduce
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker

hadoop-daemon.sh HADOOP_HOME/BIN/hadoop
(start) mkdir -p "$HADOOP_PID_DIR" if [ -f $pid ]; then if kill -0 `cat $pid` > /dev/null 2>&1; then echo $command running as process `cat $pid`. exit 1 fi fi

Stop it first.

if [ "$HADOOP_MASTER" != "" ]; then echo rsync from $HADOOP_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $HADOOP_MASTER/ "$HADOOP_HOME" fi hadoop_rotate_log $log echo starting $command, logging to $log cd "$HADOOP_PREFIX" nohup nice -n $HADOOP_NICENESS "$HADOOP_PREFIX"/bin/hadoop $HADOOP_CONF_DIR $command "$@" > "$log" 2>&1 < /dev/null & echo $! > $pid sleep 1; head "$log" ;;

--config

HADOOP_HOME/BIN/hadoop

133

elif [ "$COMMAND" = "jobtracker" ] ; then CLASS=org.apache.hadoop.mapred.JobTracker HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS" # run it exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"

org.apache.hadoop.mapred.JobTracker main startTracker JobTracker, offerService MapReduce


JobTracker tracker = startTracker(new JobConf()); tracker.offerService();

startTracker JobTracker
result = new JobTracker(conf, identifier); result.taskScheduler.setTaskTrackerManager(result);

JobTracker JobTracker(final JobConf conf,String identifier,Clock clock,QueueManager qm) throws IOException, InterruptedException
// Create the scheduler Class<? extends TaskScheduler> schedulerClass = conf.getClass("mapred.jobtracker.taskScheduler", JobQueueTaskScheduler.class, TaskScheduler.class); taskScheduler = (TaskScheduler) ReflectionUtils.newInstance(schedulerClass, conf);

Hadoop Hadoop Hadoop FIFO Scheduler FIFO Hadoop job tasks Capacity Scheduler
134

Fair Scheduler Hadoop FIFO Scheduler Capacity Scheduler Fair Scheduler mapred-site.xml
<property> <name> mapreduce.jobtracker.taskscheduler </name> <value> org.apache.hadoop.mapred.CapacityTaskScheduler </value> <property>

offerService MapReduce
/** * Run forever */ public void offerService() throws InterruptedException, IOException { // Prepare for recovery. This is done irrespective of the status of restart // flag. while (true) { try { recoveryManager.updateRestartCount(); break; } catch (IOException ioe) { LOG.warn("Failed to initialize recovery manager. ", ioe); // wait for some time Thread.sleep(FS_ACCESS_RETRY_PERIOD); LOG.warn("Retrying..."); } } taskScheduler.start(); // Start the recovery after starting the scheduler try { recoveryManager.recover();
135

} catch (Throwable t) { LOG.warn("Recovery manager crashed! Ignoring.", t); } // refresh the node list as the recovery manager might have added // disallowed trackers refreshHosts(); this.expireTrackersThread = new Thread(this.expireTrackers, "expireTrackers"); this.expireTrackersThread.start(); this.retireJobsThread = new Thread(this.retireJobs, "retireJobs"); this.retireJobsThread.start(); expireLaunchingTaskThread.start(); if (completedJobStatusStore.isActive()) { completedJobsStoreThread = new Thread(completedJobStatusStore, "completedjobsStore-housekeeper"); completedJobsStoreThread.start(); } // start the inter-tracker server once the jt is ready this.interTrackerServer.start(); synchronized (this) { state = State.RUNNING; } LOG.info("Starting RUNNING"); this.interTrackerServer.join(); LOG.info("Stopped interTrackerServer"); }

taskScheduler.start(); org.apache.hadoop.mapred. JobQueueTaskScheduler start taskTrackerManager JobTracker Listener EagerTaskInitializationListener JobQueueJobInProgressListener


@Override public synchronized void start() throws IOException {
136

super.start(); taskTrackerManager.addJobInProgressListener(jobQueueJobInProgressListener); eagerTaskInitializationListener.setTaskTrackerManager(taskTrackerManager); eagerTaskInitializationListener.start(); taskTrackerManager.addJobInProgressListener( eagerTaskInitializationListener); }

jobtracker status = addJob(jobId, job) ;


synchronized (jobs) { synchronized (taskScheduler) { jobs.put(job.getProfile().getJobID(), job); for (JobInProgressListener listener : jobInProgressListeners) { listener.jobAdded(job); } } }

EagerTaskInitializationListener EagerTaskInitializationListener jobAdded job private List<JobInProgress> jobInitQueue = new ArrayList<JobInProgress>(); EagerTaskInitializationListener start JobInitManager
///////////////////////////////////////////////////////////////// // Used to init new jobs that have just been created ///////////////////////////////////////////////////////////////// class JobInitManager implements Runnable { public void run() { JobInProgress job = null; while (true) { try { synchronized (jobInitQueue) { while (jobInitQueue.isEmpty()) { jobInitQueue.wait();
137

} job = jobInitQueue.remove(0); } threadPool.execute(new InitJob(job)); } catch (InterruptedException t) { LOG.info("JobInitManagerThread interrupted."); break; } } LOG.info("Shutting down thread pool"); threadPool.shutdownNow(); } }

JobInitManager jobInitQueue job InitJob TaskTrackerManager initJob job TaskTrackerManager JobTracker JobTracker job job.initTasks(); job(JobInProgress)
/** * Construct the splits, etc. This is invoked from an async * thread so that split-computation doesn't block anyone. */ public synchronized void initTasks() throws IOException, KillInterruptedException, UnknownHostException { if (tasksInited || isComplete()) { return; } synchronized(jobInitKillStatus){ if(jobInitKillStatus.killed || jobInitKillStatus.initStarted) { return; } jobInitKillStatus.initStarted = true; }

138

LOG.info("Initializing " + jobId); final long startTimeFinal = this.startTime; // log job info as the user running the job try { userUGI.doAs(new PrivilegedExceptionAction<Object>() { @Override public Object run() throws Exception { JobHistory.JobInfo.logSubmitted(getJobID(), conf, jobFile, startTimeFinal, hasRestarted()); return null; } }); } catch(InterruptedException ie) { throw new IOException(ie); } // log the job priority setPriority(this.priority); // // generate security keys needed by Tasks // generateAndStoreTokens(); // // read input splits and create a map per a split // TaskSplitMetaInfo[] splits = createSplits(jobId); if (numMapTasks != splits.length) { throw new IOException("Number of maps in JobConf doesn't match number of " + "recieved splits for job " + jobId + "! " + "numMapTasks=" + numMapTasks + ", #splits=" + splits.length); } numMapTasks = splits.length; // Sanity check the locations so we don't create/initialize unnecessary tasks for (TaskSplitMetaInfo split : splits) { NetUtils.verifyHostnames(split.getLocations()); } jobtracker.getInstrumentation().addWaitingMaps(getJobID(), numMapTasks); jobtracker.getInstrumentation().addWaitingReduces(getJobID(), numReduceTasks); this.queueMetrics.addWaitingMaps(getJobID(), numMapTasks); this.queueMetrics.addWaitingReduces(getJobID(), numReduceTasks);
139

maps = new TaskInProgress[numMapTasks]; for(int i=0; i < numMapTasks; ++i) { inputLength += splits[i].getInputDataLength(); maps[i] = new TaskInProgress(jobId, jobFile, splits[i], jobtracker, conf, this, i, numSlotsPerMap); } LOG.info("Input size for job " + jobId + " = " + inputLength + ". Number of splits = " + splits.length); // Set localityWaitFactor before creating cache localityWaitFactor = conf.getFloat(LOCALITY_WAIT_FACTOR, DEFAULT_LOCALITY_WAIT_FACTOR); if (numMapTasks > 0) { nonRunningMapCache = createCache(splits, maxLevel); } // set the launch time this.launchTime = jobtracker.getClock().getTime(); // // Create reduce tasks // this.reduces = new TaskInProgress[numReduceTasks]; for (int i = 0; i < numReduceTasks; i++) { reduces[i] = new TaskInProgress(jobId, jobFile, numMapTasks, i, jobtracker, conf, this, numSlotsPerReduce); nonRunningReduces.add(reduces[i]); } // Calculate the minimum number of maps to be complete before // we should start scheduling reduces completedMapsForReduceSlowstart = (int)Math.ceil( (conf.getFloat("mapred.reduce.slowstart.completed.maps", DEFAULT_COMPLETED_MAPS_PERCENT_FOR_REDUCE_SLOWSTART) * numMapTasks)); // ... use the same for estimating the total output of all maps resourceEstimator.setThreshhold(completedMapsForReduceSlowstart);

140

// create cleanup two cleanup tips, one map and one reduce. cleanup = new TaskInProgress[2]; // cleanup map tip. This map doesn't use any splits. Just assign an empty // split. TaskSplitMetaInfo emptySplit = JobSplit.EMPTY_TASK_SPLIT; cleanup[0] = new TaskInProgress(jobId, jobFile, emptySplit, jobtracker, conf, this, numMapTasks, 1); cleanup[0].setJobCleanupTask(); // cleanup reduce tip. cleanup[1] = new TaskInProgress(jobId, jobFile, numMapTasks, numReduceTasks, jobtracker, conf, this, 1); cleanup[1].setJobCleanupTask(); // create two setup tips, one map and one reduce. setup = new TaskInProgress[2]; // setup map tip. This map doesn't use any split. Just assign an empty // split. setup[0] = new TaskInProgress(jobId, jobFile, emptySplit, jobtracker, conf, this, numMapTasks + 1, 1); setup[0].setJobSetupTask(); // setup reduce tip. setup[1] = new TaskInProgress(jobId, jobFile, numMapTasks, numReduceTasks + 1, jobtracker, conf, this, 1); setup[1].setJobSetupTask(); synchronized(jobInitKillStatus){ jobInitKillStatus.initDone = true; // set this before the throw to make sure cleanup works properly tasksInited = true; if(jobInitKillStatus.killed) { throw new KillInterruptedException("Job " + jobId + " killed in init"); } } JobHistory.JobInfo.logInited(profile.getJobID(), this.launchTime, numMapTasks, numReduceTasks); // Log the number of map and reduce tasks
141

LOG.info("Job " + jobId + " initialized successfully with " + numMapTasks + " map tasks and " + numReduceTasks + " reduce tasks."); }

initTasks() InputSplit InputSplit map map TaskInProgress maps[] reduce reduce TaskInProgress reduces[] JobQueueJobInProgressListener job private Map<JobSchedulingInfo, JobInProgress> jobQueue;

14

TaskTracker JobTracker (heartbeat) heartbeat

TaskTracker JobTracker taskJobTracker task JobTracker TaskTracker task JobTracker Job Job task map task reduce task reduce task JobTracker task reduce task TaskTracker map task reduce task heartbeat JobTracker TaskTracker TaskTracker TaskTracker JobTracker JobTracker TaskTracker

org.apache.hadoop.mapred.TaskTracker TaskTracker main()


TaskTracker tt = new TaskTracker(conf);
142

tt.run();

TaskTracker initialize() JobTracker InterTrackerProtocol JobTracker TaskTracker offerService() 10 JobTracker transmitHeartBeat() HeartbeatResponse HeartbeatResponse getActions() JobTracker TaskTrackerAction LaunchTaskAction addToTaskQueue tasksToCleanup taskCleanupThread KillJobAction KillTaskAction heartbeat transmitHeartBeat()
// Check if the last heartbeat got through... // if so then build the heartbeat information for the JobTracker; // else resend the previous status information. // Check if we should ask for a new Task //add node health information // Xmit the heartbeat HeartbeatResponse heartbeatResponse = jobClient.heartbeat(status, justStarted, justInited, askForNewTask, heartbeatResponseId); // The heartbeat got through successfully! // Clear transient status information which should only // be sent once to the JobTracker // Force a rebuild of 'status' on the next iteration

heartbeat transmitHeartBeat()TaskTracker TaskTrackerStatus


143

Task Task heartbeat() askForNewTask true IPC JobTracker heartbeat()heartbeat() TaskTrackerAction JobTracker heartbeat()
/** * The periodic heartbeat mechanism between the {@link TaskTracker} and * the {@link JobTracker}. * * The {@link JobTracker} processes the status information sent by the * {@link TaskTracker} and responds with instructions to start/stop * tasks or jobs, and also 'reset' instructions during contingencies. */ public synchronized HeartbeatResponse heartbeat(TaskTrackerStatus status, boolean restarted, boolean initialContact, boolean acceptNewTasks, short responseId) throws IOException { if (LOG.isDebugEnabled()) { LOG.debug("Got heartbeat from: " + status.getTrackerName() + " (restarted: " + restarted + " initialContact: " + initialContact + " acceptNewTasks: " + acceptNewTasks + ")" + " with responseId: " + responseId); } // Make sure heartbeat is from a tasktracker allowed by the jobtracker. if (!acceptTaskTracker(status)) { throw new DisallowedTaskTrackerException(status); } // First check if the last heartbeat response got through String trackerName = status.getTrackerName(); long now = clock.getTime(); if (restarted) { faultyTrackers.markTrackerHealthy(status.getHost()); } else { faultyTrackers.checkTrackerFaultTimeout(status.getHost(), now); }
144

HeartbeatResponse prevHeartbeatResponse = trackerToHeartbeatResponseMap.get(trackerName); boolean addRestartInfo = false; if (initialContact != true) { // If this isn't the 'initial contact' from the tasktracker, // there is something seriously wrong if the JobTracker has // no record of the 'previous heartbeat'; if so, ask the // tasktracker to re-initialize itself. if (prevHeartbeatResponse == null) { // This is the first heartbeat from the old tracker to the newly // started JobTracker if (hasRestarted()) { addRestartInfo = true; // inform the recovery manager about this tracker joining back recoveryManager.unMarkTracker(trackerName); } else { // Jobtracker might have restarted but no recovery is needed // otherwise this code should not be reached LOG.warn("Serious problem, cannot find record of 'previous' " + "heartbeat for '" + trackerName + "'; reinitializing the tasktracker"); return new HeartbeatResponse(responseId, new TaskTrackerAction[] {new ReinitTrackerAction()}); } } else { // It is completely safe to not process a 'duplicate' heartbeat from a // {@link TaskTracker} since it resends the heartbeat when rpcs are // lost see {@link TaskTracker.transmitHeartbeat()}; // acknowledge it by re-sending the previous response to let the // {@link TaskTracker} go forward. if (prevHeartbeatResponse.getResponseId() != responseId) { LOG.info("Ignoring 'duplicate' heartbeat from '" + trackerName + "'; resending the previous 'lost' response"); return prevHeartbeatResponse; } } } // Process this heartbeat short newResponseId = (short)(responseId + 1); status.setLastSeen(now);
145

if (!processHeartbeat(status, initialContact, now)) { if (prevHeartbeatResponse != null) { trackerToHeartbeatResponseMap.remove(trackerName); } return new HeartbeatResponse(newResponseId, new TaskTrackerAction[] {new ReinitTrackerAction()}); } // Initialize the response to be sent for the heartbeat HeartbeatResponse response = new HeartbeatResponse(newResponseId, null); List<TaskTrackerAction> actions = new ArrayList<TaskTrackerAction>(); boolean isBlacklisted = faultyTrackers.isBlacklisted(status.getHost()); // Check for new tasks to be executed on the tasktracker if (recoveryManager.shouldSchedule() && acceptNewTasks && !isBlacklisted) { TaskTrackerStatus taskTrackerStatus = getTaskTrackerStatus(trackerName); if (taskTrackerStatus == null) { LOG.warn("Unknown task tracker polling; ignoring: " + trackerName); } else { //setup cleanup task List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus); if (tasks == null ) { // tasks = taskScheduler.assignTasks(taskTrackers.get(trackerName)); } if (tasks != null) { for (Task task : tasks) { // actions TaskTracker expireLaunchingTasks.addNewTask(task.getTaskID()); if(LOG.isDebugEnabled()) { LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID()); } actions.add(new LaunchTaskAction(task)); } } } } // Check for tasks to be killed List<TaskTrackerAction> killTasksList = getTasksToKill(trackerName); if (killTasksList != null) { actions.addAll(killTasksList); } // Check for jobs to be killed/cleanedup
146

List<TaskTrackerAction> killJobsList = getJobsForCleanup(trackerName); if (killJobsList != null) { actions.addAll(killJobsList); } // Check for tasks whose outputs can be saved List<TaskTrackerAction> commitTasksList = getTasksToSave(status); if (commitTasksList != null) { actions.addAll(commitTasksList); } // calculate next heartbeat interval and put in heartbeat response int nextInterval = getNextHeartbeatInterval(); response.setHeartbeatInterval(nextInterval); response.setActions( actions.toArray(new TaskTrackerAction[actions.size()])); // check if the restart info is req if (addRestartInfo) { response.setRecoveredJobs(recoveryManager.getJobsToRecover()); } // Update the trackerToHeartbeatResponseMap trackerToHeartbeatResponseMap.put(trackerName, response); // Done processing the hearbeat, now remove 'marked' tasks removeMarkedTasks(trackerName); return response; }

JobQueueTaskScheduler assignTasks()
@Override public synchronized List<Task> assignTasks(TaskTracker taskTracker) throws IOException { TaskTrackerStatus taskTrackerStatus = taskTracker.getStatus(); ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus(); final int numTaskTrackers = clusterStatus.getTaskTrackers(); final int clusterMapCapacity = clusterStatus.getMaxMapTasks(); final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks(); Collection<JobInProgress> jobQueue = jobQueueJobInProgressListener.getJobQueue();
147

// // Get map + reduce counts for the current tracker. // final int trackerMapCapacity = taskTrackerStatus.getMaxMapSlots(); final int trackerReduceCapacity = taskTrackerStatus.getMaxReduceSlots(); final int trackerRunningMaps = taskTrackerStatus.countMapTasks(); final int trackerRunningReduces = taskTrackerStatus.countReduceTasks(); // Assigned tasks List<Task> assignedTasks = new ArrayList<Task>(); // // Compute (running + pending) map and reduce task numbers across pool // int remainingReduceLoad = 0; int remainingMapLoad = 0; synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() == JobStatus.RUNNING) { remainingMapLoad += (job.desiredMaps() - job.finishedMaps()); if (job.scheduleReduces()) { remainingReduceLoad += (job.desiredReduces() - job.finishedReduces()); } } } } // Compute the 'load factor' for maps and reduces double mapLoadFactor = 0.0; if (clusterMapCapacity > 0) { mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity; } double reduceLoadFactor = 0.0; if (clusterReduceCapacity > 0) { reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity; } // // In the below steps, we allocate first map tasks (if appropriate), // and then reduce tasks if appropriate. We go through all jobs // in order of job arrival; jobs only get serviced if their // predecessors are serviced, too. //
148

// // We assign tasks to the current taskTracker if the given machine // has a workload that's less than the maximum load of that kind of // task. // However, if the cluster is close to getting loaded i.e. we don't // have enough _padding_ for speculative executions etc., we only // schedule the "highest priority" task i.e. the task from the job // with the highest priority. // final int trackerCurrentMapCapacity = Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity), trackerMapCapacity); int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps; boolean exceededMapPadding = false; if (availableMapSlots > 0) { exceededMapPadding = exceededPadding(true, clusterStatus, trackerMapCapacity); } int numLocalMaps = 0; int numNonLocalMaps = 0; scheduleMaps: for (int i=0; i < availableMapSlots; ++i) { synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING) { continue; } Task t = null; // Try to schedule a node-local or rack-local Map task t= job.obtainNewNodeOrRackLocalMapTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { assignedTasks.add(t); ++numLocalMaps; // Don't assign map tasks to the hilt! // Leave some free slots in the cluster for future task-failures, // speculative tasks etc. beyond the highest priority job
149

if (exceededMapPadding) { break scheduleMaps; } // Try all jobs again for the next Map task break; } // Try to schedule a node-local or rack-local Map task t= job.obtainNewNonLocalMapTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { assignedTasks.add(t); ++numNonLocalMaps; // We assign at most 1 off-switch or speculative task // This is to prevent TaskTrackers from stealing local-tasks // from other TaskTrackers. break scheduleMaps; } } } } int assignedMaps = assignedTasks.size(); // // Same thing, but for reduce tasks // However we _never_ assign more than 1 reduce task per heartbeat // final int trackerCurrentReduceCapacity = Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), trackerReduceCapacity); final int availableReduceSlots = Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1); boolean exceededReducePadding = false; if (availableReduceSlots > 0) { exceededReducePadding = exceededPadding(false, clusterStatus, trackerReduceCapacity); synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING || job.numReduceTasks == 0) {
150

continue; } Task t = job.obtainNewReduceTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts() ); if (t != null) { assignedTasks.add(t); break; } // Don't assign reduce tasks to the hilt! // Leave some free slots in the cluster for future task-failures, // speculative tasks etc. beyond the highest priority job if (exceededReducePadding) { break; } } } } if (LOG.isDebugEnabled()) { LOG.debug("Task assignments for " + taskTrackerStatus.getTrackerName() + " --> " + "[" + mapLoadFactor + ", " + trackerMapCapacity + ", " + trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" + (trackerCurrentMapCapacity - trackerRunningMaps) + ", " + assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps + ")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " + trackerCurrentReduceCapacity + "," + trackerRunningReduces + "] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) + ", " + (assignedTasks.size()-assignedMaps) + "]"); } return assignedTasks; }

JobInProgress obtainNewMapTask map task findNewMapTask TaskTracker Node nonRunningMapCache TaskInProgress JobInProgress obtainNewReduceTask reduce task findNewReduceTask nonRunningReduces TaskInProgress TaskTracker offerService()
151

TaskTracker JobTracker heartbeat reponse LaunchTaskAction addToTaskQueue map task mapLancher( TaskLauncher) reduce task reduceLancher( TaskLauncher)
// offerService() // Send the heartbeat and process the jobtracker's directives HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); if (actions != null){ for(TaskTrackerAction action: actions) { if (action instanceof LaunchTaskAction) { addToTaskQueue((LaunchTaskAction)action); } else if (action instanceof CommitTaskAction) { CommitTaskAction commitAction = (CommitTaskAction)action; if (!commitResponses.contains(commitAction.getTaskID())) { LOG.info("Received commit task action for " + commitAction.getTaskID()); commitResponses.add(commitAction.getTaskID()); } } else { tasksToCleanup.put(action); } } } // offerService() private void addToTaskQueue(LaunchTaskAction action) { if (action.getTask().isMapTask()) { mapLauncher.addToTaskQueue(action); } else { reduceLauncher.addToTaskQueue(action); } }

152

15

TaskTracker task task TaskTracker job jar TaskTracker

TaskTracker distributed cache job task jar TaskRunner taskTaskRunner JVM task child JVM TaskTracker TaskTracker.offerService() TaskTracker JobTracker heartbeat reponse LaunchTaskAction addToTaskQueue map task mapLancher( TaskLauncher) reduce task reduceLancher( TaskLauncher) TaskLauncher run queue TaskInProgress startNewTask(TaskInProgress tip) task localizeJob(TaskInProgress tip)

launchTaskForJob(TaskInProgress tip, JobConf jobConf, RunningJob rjob) Task


RunningJob rjob = localizeJob(tip); tip.getTask().setJobFile(rjob.getLocalizedJobConf().toString()); // Localization is done. Neither rjob.jobConf nor rjob.ugi can be null launchTaskForJob(tip, new JobConf(rjob.getJobConf()), rjob);

localizeJob() job workDir job jar HDFS RunJar.unJar() RunningJob addTaskToJob() runningJobs addTaskToJob
153

runningJob tasks runningJob runningJobs launchTaskForJob() Task launchTaskForJob() TaskInProgress. launchTask(RunningJob rjob)
/** * Kick off the task execution */ public synchronized void launchTask(RunningJob rjob) throws IOException { if (this.taskStatus.getRunState() == TaskStatus.State.UNASSIGNED || this.taskStatus.getRunState() == TaskStatus.State.FAILED_UNCLEAN || this.taskStatus.getRunState() == TaskStatus.State.KILLED_UNCLEAN) { localizeTask(task); if (this.taskStatus.getRunState() == TaskStatus.State.UNASSIGNED) { this.taskStatus.setRunState(TaskStatus.State.RUNNING); } setTaskRunner(task.createRunner(TaskTracker.this, this, rjob)); this.runner.start(); long now = System.currentTimeMillis(); this.taskStatus.setStartTime(now); this.lastProgressReport = now; } else { LOG.info("Not launching task: " + task.getTaskID() + " since it's state is " + this.taskStatus.getRunState()); } }

localizeTask() jobConf Task createRunner() TaskRunner start() Task java task.createRunner () Task MapTask ReduceTask Map Reduce TaskRunner MapTask MapTaskRunner Task ReduceTask ReduceTaskRunner

154

TaskRunner TaskRunner.start() TaskRunner run() java workDir CLASSPATH job jar JvmManager TaskTracker Task JvmRunner JvmManager launchJvm map reduce, JvmRunner JvmManagerForType JvmManagerForType reapJvm() JVM JvmManagerForType idle Job spawnNewJvm spawnNewJvm JvmRunner run run runChildrunChild TaskController DefaultTaskController LinuxTaskController
public void runChild(JvmEnv env) throws IOException, InterruptedException{ int exitCode = 0; try { env.vargs.add(Integer.toString(jvmId.getId())); TaskRunner runner = jvmToRunningTask.get(jvmId); if (runner != null) { Task task = runner.getTask(); //Launch the task controller to run task JVM String user = task.getUser(); TaskAttemptID taskAttemptId = task.getTaskID(); String taskAttemptIdStr = task.isTaskCleanupTask() ? (taskAttemptId.toString() + TaskTracker.TASK_CLEANUP_SUFFIX) : taskAttemptId.toString(); exitCode = tracker.getTaskController().launchTask(user,
155

jvmId.jobId.toString(), taskAttemptIdStr, env.setup, env.vargs, env.workDir, env.stdout.toString(), env.stderr.toString()); } }

TaskController DefaultTaskController LinuxTaskController launchTask() DefaultTaskController


/** * Create all of the directories for the task and launches the child jvm. * @param user the user name * @param attemptId the attempt id * @throws IOException */ @Override public int launchTask(String user, String jobId, String attemptId, List<String> setup, List<String> jvmArguments, File currentWorkDirectory, String stdout, String stderr) throws IOException { ShellCommandExecutor shExec = null; try { FileSystem localFs = FileSystem.getLocal(getConf()); //create the attempt dirs new Localizer(localFs, getConf().getStrings(JobConf.MAPRED_LOCAL_DIR_PROPERTY)). initializeAttemptDirs(user, jobId, attemptId); // create the working-directory of the task if (!currentWorkDirectory.mkdir()) { throw new IOException("Mkdirs failed to create " + currentWorkDirectory.toString()); } //mkdir the loglocation String logLocation = TaskLog.getAttemptDir(jobId, attemptId).toString(); if (!localFs.mkdirs(new Path(logLocation))) { throw new IOException("Mkdirs failed to create "
156

+ logLocation); } //read the configuration for the job FileSystem rawFs = FileSystem.getLocal(getConf()).getRaw(); long logSize = 0; //TODO MAPREDUCE-1100 // get the JVM command line. String cmdLine = TaskLog.buildCommandLine(setup, jvmArguments, new File(stdout), new File(stderr), logSize, true); // write the command to a file in the // task specific cache directory // TODO copy to user dir Path p = new Path(allocator.getLocalPathForWrite( TaskTracker.getPrivateDirTaskScriptLocation(user, jobId, attemptId), getConf()), COMMAND_FILE); String commandFile = writeCommand(cmdLine, rawFs, p); rawFs.setPermission(p, TaskController.TASK_LAUNCH_SCRIPT_PERMISSION); shExec = new ShellCommandExecutor(new String[]{ "bash", "-c", commandFile}, currentWorkDirectory); shExec.execute(); } catch (Exception e) { if (shExec == null) { return -1; } int exitCode = shExec.getExitCode(); LOG.warn("Exit code from task is : " + exitCode); LOG.info("Output from DefaultTaskController's launchTask follows:"); logOutput(shExec.getOutput()); return exitCode; } return 0; }

launchTask() Shell JVM Child.main() map task reduce task Child Child main
157

getTask(jvmId) Task Task run() MapReduce Map -->Shuffle-->Reduce

MapTask run()run()
public void run(final JobConf job, final TaskUmbilicalProtocol umbilical) throws IOException, ClassNotFoundException, InterruptedException { this.umbilical = umbilical; // start thread that will handle communication with parent TaskReporter reporter = new TaskReporter(getProgress(), umbilical, jvmContext); reporter.startCommunicationThread(); boolean useNewApi = job.getUseNewMapper(); initialize(job, getJobID(), reporter, useNewApi); // check if it is a cleanupJobTask if (jobCleanup) { runJobCleanupTask(umbilical, reporter); return; } if (jobSetup) { runJobSetupTask(umbilical, reporter); return; } if (taskCleanup) { runTaskCleanupTask(umbilical, reporter); return;
158

} if (useNewApi) { runNewMapper(job, splitMetaInfo, umbilical, reporter); } else { runOldMapper(job, splitMetaInfo, umbilical, reporter); } done(umbilical, reporter); }

run() TaskReporter runJobCleanupTaskrunJobSetupTaskrunTaskCleanupTask Mapper MapReduce APIMapTask API MapTask Mapper runNewMapper runOldMapper runOldMapper
private <INKEY,INVALUE,OUTKEY,OUTVALUE> void runOldMapper(final JobConf job, final TaskSplitIndex splitIndex, final TaskUmbilicalProtocol umbilical, TaskReporter reporter ) throws IOException, InterruptedException, ClassNotFoundException { InputSplit inputSplit = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset()); updateJobWithSplit(job, inputSplit); reporter.setInputSplit(inputSplit); RecordReader<INKEY,INVALUE> in = isSkipping() ? new SkippingRecordReader<INKEY,INVALUE>(inputSplit, umbilical, reporter) : new TrackedRecordReader<INKEY,INVALUE>(inputSplit, job, reporter); job.setBoolean("mapred.skip.on", isSkipping());

int numReduceTasks = conf.getNumReduceTasks(); LOG.info("numReduceTasks: " + numReduceTasks); MapOutputCollector collector = null; if (numReduceTasks > 0) { collector = new MapOutputBuffer(umbilical, job, reporter); } else { collector = new DirectMapOutputCollector(umbilical, job, reporter); } MapRunnable<INKEY,INVALUE,OUTKEY,OUTVALUE> runner =
159

ReflectionUtils.newInstance(job.getMapRunnerClass(), job); try { runner.run(in, new OldOutputCollector(collector, conf), reporter); collector.flush(); } finally { //close in.close(); // close input collector.close(); } }

runOldMapper() Mapper InputSplit Mapper RecordReader map Mapper MapOutputCollector Reducer DirectMapOutputCollector MapOutputBuffer MapRunner run()
public void run(RecordReader<K1, V1> input, OutputCollector<K2, V2> output, Reporter reporter) throws IOException { try { // allocate key & value instances that are re-used for all entries K1 key = input.createKey(); V1 value = input.createValue(); while (input.next(key, value)) { // map pair to output mapper.map(key, value, output, reporter); if(incrProcCount) { reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1); } } } finally { mapper.close(); } }

MapRunner run() keyvalue InputSplit <keyvalue> Mapper map OutputCollector kv


160

kv spill combine OutputCollector map kv spill combine MapOutputCollector MapOutputBuffer DirectMapOutputCollector DirectMapOutputCollector Reduce Mapper reduce MapOutputBuffer MapOutputBuffer map MapOutputBuffer collect()
public synchronized void collect(K key, V value, int partition ) throws IOException { reporter.progress(); if (key.getClass() != keyClass) { throw new IOException("Type mismatch in key from map: expected " + keyClass.getName() + ", recieved " + key.getClass().getName()); } if (value.getClass() != valClass) { throw new IOException("Type mismatch in value from map: expected " + valClass.getName() + ", recieved " + value.getClass().getName()); } final int kvnext = (kvindex + 1) % kvoffsets.length; spillLock.lock(); try { boolean kvfull; do { if (sortSpillException != null) { throw (IOException)new IOException("Spill failed" ).initCause(sortSpillException); } // sufficient acct space kvfull = kvnext == kvstart; final boolean kvsoftlimit = ((kvnext > kvend) ? kvnext - kvend > softRecordLimit : kvend - kvnext <= kvoffsets.length - softRecordLimit); if (kvstart == kvend && kvsoftlimit) {
161

LOG.info("Spilling map output: record full = " + kvsoftlimit); startSpill(); } if (kvfull) { try { while (kvstart != kvend) { reporter.progress(); spillDone.await(); } } catch (InterruptedException e) { throw (IOException)new IOException( "Collector interrupted while waiting for the writer" ).initCause(e); } } } while (kvfull); } finally { spillLock.unlock(); } try { // serialize key bytes into buffer int keystart = bufindex; keySerializer.serialize(key); if (bufindex < keystart) { // wrapped the key; reset required bb.reset(); keystart = 0; } // serialize value bytes into buffer final int valstart = bufindex; valSerializer.serialize(value); int valend = bb.markRecord(); if (partition < 0 || partition >= partitions) { throw new IOException("Illegal partition for " + key + " (" + partition + ")"); } mapOutputRecordCounter.increment(1); mapOutputByteCounter.increment(valend >= keystart ? valend - keystart : (bufvoid - keystart) + valend);

162

// update accounting info int ind = kvindex * ACCTSIZE; kvoffsets[kvindex] = ind; kvindices[ind + PARTITION] = partition; kvindices[ind + KEYSTART] = keystart; kvindices[ind + VALSTART] = valstart; kvindex = kvnext; } catch (MapBufferTooSmallException e) { LOG.info("Record too large for in-memory buffer: " + e.getMessage()); spillSingleRecord(key, value, partition); mapOutputRecordCounter.increment(1); return; } }

map 100M io.sort.mb 80% io.sort.spill.percent spill spillThread spill 1 spill sortAndSpill partition key QuickSort 2 combiner CombinerRunner combine combin buffer spill sortAndSpill

163

private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { //approximate the length of the output file to be the length of the //buffer + header lengths for the partitions long size = (bufend >= bufstart ? bufend - bufstart : (bufvoid - bufend) + bufstart) + partitions * APPROX_HEADER_LENGTH; FSDataOutputStream out = null; try { // create spill file final SpillRecord spillRec = new SpillRecord(partitions); final Path filename = mapOutputFile.getSpillFileForWrite(numSpills, size); out = rfs.create(filename); final int endPosition = (kvend > kvstart) ? kvend : kvoffsets.length + kvend; sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter); int spindex = kvstart; IndexRecord rec = new IndexRecord(); InMemValBytes value = new InMemValBytes(); for (int i = 0; i < partitions; ++i) { IFile.Writer<K, V> writer = null; try { long segmentStart = out.getPos(); writer = new Writer<K, V>(job, out, keyClass, valClass, codec, spilledRecordsCounter); if (combinerRunner == null) { // spill directly
164

DataInputBuffer key = new DataInputBuffer(); while (spindex < endPosition && kvindices[kvoffsets[spindex % kvoffsets.length] + PARTITION] == i) { final int kvoff = kvoffsets[spindex % kvoffsets.length]; getVBytesForOffset(kvoff, value); key.reset(kvbuffer, kvindices[kvoff + KEYSTART], (kvindices[kvoff + VALSTART] kvindices[kvoff + KEYSTART])); writer.append(key, value); ++spindex; } } else { int spstart = spindex; while (spindex < endPosition && kvindices[kvoffsets[spindex % kvoffsets.length] + PARTITION] == i) { ++spindex; } // Note: we would like to avoid the combiner if we've fewer // than some threshold of records for a partition if (spstart != spindex) { combineCollector.setWriter(writer); RawKeyValueIterator kvIter = new MRResultIterator(spstart, spindex); combinerRunner.combine(kvIter, combineCollector); } } // close the writer writer.close(); // record offsets rec.startOffset = segmentStart; rec.rawLength = writer.getRawLength(); rec.partLength = writer.getCompressedLength(); spillRec.putIndex(rec, i); writer = null; } finally { if (null != writer) writer.close(); } }

165

if (totalIndexCacheMemory >= INDEX_CACHE_MEMORY_LIMIT) { // create spill index file Path indexFilename = mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions * MAP_OUTPUT_INDEX_RECORD_LENGTH); spillRec.writeToFile(indexFilename, job); } else { indexCacheList.add(spillRec); totalIndexCacheMemory += spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH; } LOG.info("Finished spill " + numSpills); ++numSpills; } finally { if (out != null) out.close(); } }

map MapOutputBuffer flush sortAndSpill buffer mergeParts spill map combia 1. 2. mapred.compress.map.out true Map Reduce ReduceTask.run MapTask initialize() runJobCleanupTask() runJobSetupTask() runTaskCleanupTask() CopySortReduce
public void run(JobConf job, final TaskUmbilicalProtocol umbilical) throws IOException, InterruptedException, ClassNotFoundException { this.umbilical = umbilical; job.setBoolean("mapred.skip.on", isSkipping()); if (isMapOrReduce()) { copyPhase = getProgress().addPhase("copy");
166

sortPhase = getProgress().addPhase("sort"); reducePhase = getProgress().addPhase("reduce"); } // start thread that will handle communication with parent TaskReporter reporter = new TaskReporter(getProgress(), umbilical, jvmContext); reporter.startCommunicationThread(); boolean useNewApi = job.getUseNewReducer(); initialize(job, getJobID(), reporter, useNewApi); // check if it is a cleanupJobTask if (jobCleanup) { runJobCleanupTask(umbilical, reporter); return; } if (jobSetup) { runJobSetupTask(umbilical, reporter); return; } if (taskCleanup) { runTaskCleanupTask(umbilical, reporter); return; } // Initialize the codec codec = initCodec(); boolean isLocal = "local".equals(job.get("mapred.job.tracker", "local")); if (!isLocal) { reduceCopier = new ReduceCopier(umbilical, job, reporter); if (!reduceCopier.fetchOutputs()) { if(reduceCopier.mergeThrowable instanceof FSError) { throw (FSError)reduceCopier.mergeThrowable; } throw new IOException("Task: " + getTaskID() + " - The reduce copier failed", reduceCopier.mergeThrowable); } } copyPhase.complete(); // copy is already complete setPhase(TaskStatus.Phase.SORT); statusUpdate(umbilical); final FileSystem rfs = FileSystem.getLocal(job).getRaw(); RawKeyValueIterator rIter = isLocal
167

? Merger.merge(job, rfs, job.getMapOutputKeyClass(), job.getMapOutputValueClass(), codec, getMapFiles(rfs, true), !conf.getKeepFailedTaskFiles(), job.getInt("io.sort.factor", 100), new Path(getTaskID().toString()), job.getOutputKeyComparator(), reporter, spilledRecordsCounter, null) : reduceCopier.createKVIterator(job, rfs, reporter); // free up the data structures mapOutputFilesOnDisk.clear(); sortPhase.complete(); // sort is complete setPhase(TaskStatus.Phase.REDUCE); statusUpdate(umbilical); Class keyClass = job.getMapOutputKeyClass(); Class valueClass = job.getMapOutputValueClass(); RawComparator comparator = job.getOutputValueGroupingComparator(); if (useNewApi) { runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass); } else { runOldReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass); } done(umbilical, reporter); }

Reduce runOldReducer() API runOldReducer


private <INKEY,INVALUE,OUTKEY,OUTVALUE> void runOldReducer(JobConf job, TaskUmbilicalProtocol umbilical, final TaskReporter reporter, RawKeyValueIterator rIter, RawComparator<INKEY> comparator, Class<INKEY> keyClass, Class<INVALUE> valueClass) throws IOException { Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer = ReflectionUtils.newInstance(job.getReducerClass(), job); // make output collector String finalName = getOutputName(getPartition()); final RecordWriter<OUTKEY, OUTVALUE> out = new
168

OldTrackingRecordWriter<OUTKEY, OUTVALUE>( reduceOutputCounter, job, reporter, finalName); OutputCollector<OUTKEY,OUTVALUE> collector = new OutputCollector<OUTKEY,OUTVALUE>() { public void collect(OUTKEY key, OUTVALUE value) throws IOException { out.write(key, value); // indicate that progress update needs to be sent reporter.progress(); } }; // apply reduce function try { //increment processed counter only if skipping feature is enabled boolean incrProcCount = SkipBadRecords.getReducerMaxSkipGroups(job)>0 && SkipBadRecords.getAutoIncrReducerProcCount(job); ReduceValuesIterator<INKEY,INVALUE> values = isSkipping() ? new SkippingReduceValuesIterator<INKEY,INVALUE>(rIter, comparator, keyClass, valueClass, job, reporter, umbilical) : new ReduceValuesIterator<INKEY,INVALUE>(rIter, job.getOutputValueGroupingComparator(), keyClass, valueClass, job, reporter); values.informReduceProgress(); while (values.more()) { reduceInputKeyCounter.increment(1); reducer.reduce(values.getKey(), values, collector, reporter); if(incrProcCount) { reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS, 1); } values.nextKey(); values.informReduceProgress(); } //Clean up: repeated in catch block below reducer.close(); out.close(reporter); //End of clean up. } catch (IOException ioe) { try {
169

reducer.close(); } catch (IOException ignored) {} try { out.close(reporter); } catch (IOException ignored) {} throw ioe; } }

runOldReducer() OutputCollector MapTask OutputCollector RecordWriter collect write RecordWriter HDFSReduceTask KeyClass ValueClass KeyComparator Reducer Iterator ReducerReduce MapReduce Map -->Shuffle-->Reduce Map Reduce Map Shuffle Reduce Shuffle job

170

JobTracker task job JobClient JobTracker job runJob

16

171

1.,:,map reduce ,,.. map ,. reduce ,, reduce 2., . , TaskTracker, 3 , 5 , TaskTracker JobTracker. TaskTracker JobTracker, 5 . , JobTracker , . JobTracker 3. JobTracker , ,,

:
, , , ,hadoop 1.map reduce : , JVM TaskTracker , ,.TaskTracker task attempt failed, . 2.JVM Bug jvm : TaskTracker , failed. 10 . 0 , , JobTracker , ,.JobTracker TaskTracker .
172

4 ,. 3.TaskTracker : TaskTracker , JobTracker TaskTracker ., JobTracker TaskTracker map . TaskTracker ,, JobTracker . 4.JobTracker : JobTracker . Hadoop JobTracker -.,. JobTracker ,, JobTracker.

173