Professional Documents
Culture Documents
1
Apache Hadoop HDFSMapReduce Apache Hadoop Hadoop-The Definitive Guide SE Hadoop zhangmengzhi2005@126.com 2013 08 14
Hadoop V1.1 .................................................................................................................... 1 ........................................................................................................................................... 2 1 ............................................................................................................................................. 3 2 ......................................................................................................................................... 5 3 RPC ........................................................................................................................................... 12 4 HDFS ....................................................................................................................... 26 5 HDFS ........................................................................................................................... 39 6 DataNode ............................................................................................................................. 72 7 NameNode ................................................................................................................................ 76 8 Lease ......................................................................................................................................... 90 9 Heartbeat ................................................................................................................................ 100 10 HDFS ............................................................................................................................. 107 11 MapReduce .................................................................................................................. 118 12 ........................................................................................................................... 122 13 ........................................................................................................................... 132 14 ........................................................................................................................... 142 15 ........................................................................................................................... 153 16 ............................................................................................................................... 171
2013-08-24 2013-09-26
Google Google 5
GoogleClusterChubbyGFSMapReduceBigTable
MapReduce.pdf
bigtable-osdi06.p df
Apache Apache Hadoop Chubby-->ZooKeeper GFS-->HDFS BigTable-->HBase MapReduce-->Hadoop MapReduce Facebook HiveYahho Pig Hadoop HDFS MapReduce HDFSMapReduce Hadoop Hadoop 12 October, 2012: Release 1.0.4 available Hadoop core/mapred/tools/hdfs
Hadoop Hadoop
3
Hadoop Package tools mapreduce filecache fs hdfs ipc io Dependences DistCparchive Hadoop Map/Reduce HDFS Map/Reduce HDFSHadoop IPC io /
4
DNSsocket DDL C++ Java Jetty HTTP Servlet HTTP HTTP Servlet
Hadoop remote
procedure callRPCRPC RPC HDFSMapReduce Hadoop Java org.apache.hadoop.io Writable Writable DataOutput DataInput
5
/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.io; import java.io.DataOutput; import java.io.DataInput; import java.io.IOException; /** * A serializable object which implements a simple, efficient, serialization * protocol, based on {@link DataInput} and {@link DataOutput}. * * <p>Any <code>key</code> or <code>value</code> type in the Hadoop Map-Reduce * framework implements this interface.</p> * * <p>Implementations typically implement a static <code>read(DataInput)</code> * method which constructs a new instance, calls {@link #readFields(DataInput)} * and returns the instance.</p> * * <p>Example:</p> * <p><blockquote><pre> * public class MyWritable implements Writable { * // Some data * private int counter; * private long timestamp; * * public void write(DataOutput out) throws IOException { * out.writeInt(counter);
6
* out.writeLong(timestamp); * } * * public void readFields(DataInput in) throws IOException { * counter = in.readInt(); * timestamp = in.readLong(); * } * * public static MyWritable read(DataInput in) throws IOException { * MyWritable w = new MyWritable(); * w.readFields(in); * return w; * } * } * </pre></blockquote></p> */ public interface Writable { /** * Serialize the fields of this object to <code>out</code>. * * @param out <code>DataOuput</code> to serialize this object into. * @throws IOException */ void write(DataOutput out) throws IOException; /** * Deserialize the fields of this object from <code>in</code>. * * <p>For efficiency, implementations should attempt to re-use storage in the * existing object where possible.</p> * * @param in <code>DataInput</code> to deseriablize this object from. * @throws IOException */ void readFields(DataInput in) throws IOException; }
WritableComparable Writable java.lang.Comparable IntWritable WritableComparable MapReduce Hadoop Java Comparator RawComparator
package org.apache.hadoop.io; import java.util.Comparator; import org.apache.hadoop.io.serializer.DeserializerComparator; /** * <p>
8
* A {@link Comparator} that operates directly on byte representations of * objects. * </p> * @param <T> * @see DeserializerComparator */ public interface RawComparator<T> extends Comparator<T> { public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2); }
boolean
BooleanWritable
byte
ByteWritable
int
IntWritable
VIntWritable
1-5
float
FloatWritable
4
10
long
LongWritable
VLongWritable
1-9
double
DoubleWritable
IntWritable LongWritable VIntWritable VLongWritable -127 127 -127 127 Text UTF-8 Writable java.lang.String Writable ObjectWritable Hadoop RPC RPC Java String Writable ObjectWritable RPC / Writable Writable RPC MyWritable ObjectWritable ObjectWritable
ObjectWritable ObjectWritable WritableFactories Writable MyWritable WritableFactories WritableFactories.setFactory MapReduce Writable MapReduce API
11
Hadoop serialization framework API Serialization WritableSerialization Writable Serialization Serialization Serializer Deserializer Hadoop io.serizalizations Serialization org.apache.hadoop.io.serializer. WritableSerialization
RPC
/ Hadoop RPC
IPC Hadoop Writable Hadoop Writable Java String Writable IPC Java Hadoop RPC CORBA IDL stub skeleton IOException RPCorg.apache.hadoop.ipc Client Server Server RPC Server RPC org.apache.hadoop.ipc
12
CallConnectionConnectionId
13
Call Invocation VO Connection Thread ConnectionId Client Server HDFS NameNode / DataNode Client Client ConnectionId Connection ID ConnectionId InetSocketAddress IP ++ InetSocketAddress Client.Connection RPC Connection RPC Connection id Client.Call Connection Hash Call
// calls private Hashtable<Integer, Call> calls = new Hashtable<Integer, Call>();
RPC addCall Connection Java String Writable Call ObjectWritable Client.Connection socket / Client. writeHeader() Writable / Client.Connection socket Call Call Call Obejct wait notify RPC Client Client.Connection
14
Client call()
/** Make a call, passing <code>param</code>, to the IPC server defined by * <code>remoteId</code>, returning the value. * Throws exceptions if there are network problems or if the remote code * threw an exception. */ public Writable call(Writable param, ConnectionId remoteId) throws InterruptedException, IOException { Call call = new Call(param); Connection connection = getConnection(remoteId, call); connection.sendParam(call); // send the parameter boolean interrupted = false; synchronized (call) { while (!call.done) { try { call.wait(); // wait for the result } catch (InterruptedException ie) { // save the fact that we were interrupted interrupted = true; } } if (interrupted) { // set the interrupt flag now that we are done waiting Thread.currentThread().interrupt(); } if (call.error != null) { if (call.error instanceof RemoteException) { call.error.fillInStackTrace(); throw call.error; } else { // local exception // use the connection because it will reflect an ip change, unlike // the remoteId throw wrapException(connection.getRemoteAddress(), call.error); } } else { return call.value; } } }
Client.getConnection()
-->
Client.Connection.setupIOstreams()
-->
Client.Connection.setupConnection()
15
Client.Connection. receiveResponse ()
/* Receive a response. * Because only one receiver, so no synchronization on in. */ private void receiveResponse() { if (shouldCloseConnection.get()) { return; } touch(); try { int id = in.readInt();
// try to read an id
if (LOG.isDebugEnabled()) LOG.debug(getName() + " got value #" + id); Call call = calls.get(id);
16
int state = in.readInt(); // read call status if (state == Status.SUCCESS.state) { Writable value = ReflectionUtils.newInstance(valueClass, conf); value.readFields(in); // read value call.setValue(value); calls.remove(id); } else if (state == Status.ERROR.state) { call.setException(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); calls.remove(id); } else if (state == Status.FATAL.state) { // Close the connection markClosed(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); } } catch (IOException e) { markClosed(e); } }
17
CallListenerResponderConnectionHandler Call Listener Listener Listener.Reader Reader Responder RPC Responder Connection Handler callQueue call Server
/** Called for each call. */ public abstract Writable call(Class<?> protocol, Writable param, long receiveTime) throws IOException;
Server Server call Server.Call Client.Call Server.Call id param Client.Call connection Call connection timestamp response Writable Server.Connection socket Hadoop Server Java NIO socket socket Server accept socket Listener Server.Handle run
18
Server.Call Server.call Responder NIO Responder Server call call Server Listener Listener run()
public void run() { LOG.info(getName() + ": starting"); SERVER.set(Server.this); while (running) { SelectionKey key = null; try { selector.select(); Iterator<SelectionKey> iter = selector.selectedKeys().iterator(); while (iter.hasNext()) { key = iter.next(); iter.remove(); try { if (key.isValid()) { if (key.isAcceptable()) doAccept(key); } } catch (IOException e) { } key = null; } } catch (OutOfMemoryError e) { // we can run out of memory if we have too many threads // log the event and sleep for a minute and give // some thread(s) a chance to finish LOG.warn("Out of Memory in server select", e); closeCurrentConnection(key, e); cleanupConnections(true); try { Thread.sleep(60000); } catch (Exception ie) {} } catch (Exception e) { closeCurrentConnection(key, e);
19
} cleanupConnections(false); } LOG.info("Stopping " + this.getName()); synchronized (this) { try { acceptChannel.close(); selector.close(); } catch (IOException e) { } selector= null; acceptChannel= null; // clean up all connections while (!connectionList.isEmpty()) { closeConnection(connectionList.remove(0)); } } }
Listener doAccept ()
void doAccept(SelectionKey key) throws IOException, OutOfMemoryError { Connection c = null; ServerSocketChannel server = (ServerSocketChannel) key.channel(); SocketChannel channel; while ((channel = server.accept()) != null) { channel.configureBlocking(false); channel.socket().setTcpNoDelay(tcpNoDelay); Reader reader = getReader(); // readers reader try { reader.startAdd(); // readSelector adding true SelectionKey readKey = reader.registerChannel(channel);// c = new Connection(readKey, channel, System.currentTimeMillis());// readKey.attach(c); // connection readKey synchronized (connectionList) { connectionList.add(numConnections, c); numConnections++; } if (LOG.isDebugEnabled()) LOG.debug("Server connection from " + c.toString() + "; # active connections: " + numConnections +
20
"; # queued calls: " + callQueue.size()); } finally { // adding false notify() reader, Listener reader wait() reader.finishAdd(); } } }
reader reader doRead()doRead() Server.Connection readAndProcess()readAndProcess() Server.Connection processOneRpc() processData() processData() call call callQueue rpc call Server Handler call
run() final Call call = callQueue.take(); // pop the queue; maybe blocked here value = call(call.connection.protocol, call.param, call.timestamp); synchronized (call.connection.responseQueue) { // setupResponse() needs to be sync'ed together with // responder.doResponse() since setupResponse may use // SASL to encrypt response data and SASL enforces // its own message ordering. setupResponse(buf, call, (error == null) ? Status.SUCCESS : Status.ERROR, value, errorClass, error); // Discard the large buf and reset it back to // smaller size to freeup heap if (buf.size() > maxRespSize) { LOG.warn("Large response size " + buf.size() + " for call " + call.toString()); buf = new ByteArrayOutputStream(INITIAL_RESP_BUF_SIZE); } responder.doRespond(call); }
21
Server.Responder doRespond()
void doRespond(Call call) throws IOException { synchronized (call.connection.responseQueue) { call.connection.responseQueue.addLast(call); if (call.connection.responseQueue.size() == 1) { // writeSelector processResponse(call.connection.responseQueue, true); } } } Client Server RPC.java
ClientCache client socket factory hash key, hashMap <SocketFactory, Client>; Invoker InvocationHandler; Server ipc.Server Hadoop RPC Invoker InvocationHandler invoke invoke InvocationHandler Invoker Client socket proxy Client Invoker InvocationInvocation : methodNameparameterClasses parameters Writable RPC.Server org.apache.hadoop.ipc.Server RPC Invocation Java socket Dynamic Proxy Invocation VOClientCache Server Invoker Invoker RPC.Invoker invoke()
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { final boolean logDebug = LOG.isDebugEnabled(); long startTime = 0; if (logDebug) { startTime = System.currentTimeMillis(); } ObjectWritable value = (ObjectWritable) client.call(new Invocation(method, args), remoteId);
23
if (logDebug) { long callTime = System.currentTimeMillis() - startTime; LOG.debug("Call: " + method.getName() + " " + callTime); } return value.get(); }
getServer() Server RPC.Server RPC.Server ipc.Server RPC. waitForProxy () Server Server RPC.getServer() Server
24
VersionedProtocol RPC getProtocolVersion() 1HDFS ClientDatanodeProtocol datanode ClientProtocol client Namenode DatanodeProtocol : Datanode Namenode blockreport NamenodeProtocol SecondaryNode Namenode (2Mapreduce InterDatanodeProtocol Datanode block
25
InnerTrackerProtocol TaskTracker JobTracker DatanodeProtocol JobSubmissionProtocol JobClient JobTracker Job Job Job TaskUmbilicalProtocol Task mapreduce TaskTracker
HDFS
HDFS NameNode DataNodeNameNode
DataNode namenode namenode DataNode DataNode DataNode DataNode ID HDFS DataNode DataNode NameNode NameNode NameNode NameNode DataNode NameNode HDFS
26
DataNode NameNode DataNode NameNode Heartbeat NameNode NameNode DataNode DataNode NameNode DataNode / DataNode / DataNode DataNode DataNode DataNode Hadoop HADOOP_HOME/conf/hdfs-site.xml
<property> <name>dfs.data.dir</name> <value>/usr/local/hadoop/data1,/usr/local/hadoop/data2</value> </property>
27
/usr/local/hadoop/data1 /usr/local/hadoop/data2
drwxr-xr-x 6 hadoop hadoop 4096 4 26 15:11 . drwxr-xr-x 24 hadoop hadoop 4096 4 19 14:26 .. drwxrwxr-x 2 hadoop hadoop 4096 4 26 13:57 blocksBeingWritten drwxrwxr-x 2 hadoop hadoop 4096 4 26 13:57 current drwxrwxr-x 2 hadoop hadoop 4096 4 3 14:10 detach -rw-rw-r-- 1 hadoop hadoop 157 4 3 14:10 storage drwxrwxr-x 2 hadoop hadoop 4096 4 26 13:56 tmp
current current detach snapshottmp DataNode tmp current subdir0 subdir63 HDFS HDFS
-rw-rw-r-- 1 hadoop hadoop 66 -rw-rw-r-- 1 hadoop hadoop 11 -rw-rw-r-- 1 hadoop hadoop 727 -rw-rw-r-- 1 hadoop hadoop 15 -rw-rw-r-- 1 hadoop hadoop 7416 -rw-rw-r-- 1 hadoop hadoop 155 4 18 17:19 blk_8027040652559443757 4 18 17:19 blk_8027040652559443757_1055.meta 4 18 17:42 blk_-8559958631634410715 4 18 17:42 blk_-8559958631634410715_1071.meta 6 14 15:36 dncp_block_verification.log.curr 6 14 15:36 VERSION
80270406525594437578559958631634410715 ID 10551071
28
//
29
Hadoop DataNode // DataNode DataStorage VERSION NameNode DataNode Heartbeat DataNode current previous.tmp snapshot current VERSION previous.tmp current previous.tmp current VERSION previous.tmp previous previous current removed.tmp previous current removed.tmp previous finalized.tmp
HDFS
30
previous DataNode DataNode Hadoop Storage DataNode block Datanode block _blk block .meta meta file blokcFileName_generationStamp.meta HDFS storage storage Datanode storage Namenode storage storage ) Datanode FSDataset FSDataset FSVolume FSVolume storage FSVolume FSVolume dataDir blocks meta file) tmpDir detachDir copy on write for blocks in snapshot block detachDir
31
StorageInfo 3
public int layoutVersion; // Version read from the stored file. public int namespaceID; // namespace id of the storage public long cTime; // creation timestamp
COMPLETE_ROLLBACKremoved.tmp current previous RECOVER_ROLLBACKremoved.tmp current previous COMPLETE_CHECKPOINTlastcheckpoint.tmp current RECOVER_CHECKPOINTlastcheckpoint.tmp current NORMAL
StorageDirectory current
previous previous.tmp removed.tmp finalized.tmp lastcheckpoint.tmp NameNode previous.checkpoint NameNode
doRecover
COMPLETE_UPGRADEmv previous.tmp -> previous RECOVER_UPGRADEmv previous.tmp -> current COMPLETE_FINALIZErm finalized.tmp COMPLETE_ROLLBACKrm removed.tmp RECOVER_ROLLBACKmv removed.tmp -> current COMPLETE_CHECKPOINTmv lastcheckpoint.tmp -> previous.checkpoint RECOVER_CHECKPOINTmv lastcheckpoint.tmp -> current
RECOVER_UPGRADE
1. current->previous.tmp 2. current 3. previous.tmp->previous
previous.tmp current previous.tmp current StorageDirectory StorageInfo StorageDirectory VERSION StorageDirectory read/write / DataNode VERSION
33
#Fri Jun 14 15:36:32 CST 2013 namespaceID=1584403768 storageID=DS-1617068520-127.0.1.1-50010-1364969464023 cTime=0 storageType=DATA_NODE layoutVersion=-32
StorageDirectory in_use.lock /StorageDirectory lock unlock Storage StorageDirectory Storage DataStorage Storage DataNode DataNode // DataStorage doUpgrade/doRollback/doFinalize DataStorage format DataNode Storage StorageDirectoryDataStorage Storage FSDataset Storage Block FSDataset
34
Block Block blk_3148782637964391313 blk_3148782637964391313_242812.meta blockId 3148782637964391313242812 numBytesBlock DatanodeBlockInfo Block Block FSVolume detach detach snapshotsnapshot current current detach snapshot current
35
current snapshot detach copy-on-write DatanodeBlockInfo detachBlock Block detach Block DatanodeBlockInfo FSVolumeSet FSVolume FSDir DataNode Storage HDFS Block Storage FSDataset FSVolume Storage FSDir FSVolume FSVolumeSet FSDataset FSVolumeSet FSDir HDFS FSDir Block Storage FSDir FSDir FSDir getBlockInfo Block getVolumeMap Block DatanodeBlockInfo FSVolume Storagedetach FSVolume FSVolume recoverDetachedBlocks detach Storage detach detach FSVolume FSVolume Block FSVolume Block FSVolumeSet FSVolume HDFS chunk FSDataset ActiveFileActiveFile ActiveFile
36
block BlockWriteStreams isRecovery block block writeToBlock ActiveFile ongoingCreates BlockWriteStreams ActiveFile ActiveFile threads blk_3148782637964391313 DataNode Block ID 3148782637964391313 DataNode tmp/blk_3148782637964391313 meta tmp/blk_3148782637964391313_XXXXXX.meta XXXXXX isRecovery true finalizeBlock detached writeToBlock interrupt ongoingCreates / ActiveFile ongoingCreates public void updateBlock(Block oldblock, Block newblock) throws IOException; block updateBlock updateBlock tryUpdateBlock tryUpdateBlock volumeMap tryUpdateBlock updateBlock join public void finalizeBlock(Block b) throws IOException; finalize writeToBlock block Block tmp current FSDataset finalizeBlock ongoingCreates block block DatanodeBlockInfo volumeMap blk_3148782637964391313 DataNode Block ID 3148782637964391313 DataNode tmp/blk_3148782637964391313 current subdir12 tmp/blk_3148782637964391313 current/subdir12/blk_3148782637964391313 meta current/subdir12 public void unfinalizeBlock(Block b) throws IOException; writeToBlock block finalizeBlock public boolean isValidBlock(Block b); Block public void invalidate(Block invalidBlks[]) throws IOException;
38
DataNode
HDFS
DataNode
DataXceiverServer DataXceiverDataNode / RPC RPC DataNode DataXceiverServer DataXceiver DataXceiver BlockSender BlockReceiver
39
DataXceiverServer DataXceiver socket DataXceiverServer run DataXceiverServer socket DataXceiver socket DataXceiver DataXceiver DataXceiver
OP_WRITE_BLOCK (80) OP_READ_BLOCK (81) OP_READ_METADATA (82) OP_REPLACE_BLOCK (83) OP_COPY_BLOCK (84) OP_BLOCK_CHECKSUM (85)
$HADOOP_HOME/bin/hadoop fs -copyFromLocal <localsrc> <dst> (OP_WRITE_BLOCK (80) ) namenode hadoop append namenode namdnode block hdfs datanode namenode IOUtils.copyBytes() client packet namenode datenodes blocksnamenode datanodes blocks client datanode 3 datanode datanode
40
DistributedFileSystem create() DistributedFileSystem namenode RPC namenode DistributedFileSystem datanode namenode FSDataOutputStream DFSOutputStream (data queue) DataStreamer datenode namenode datanode (pipeline) DataStreamer 1 datanode
41
DFSOutputStream (ack queue) datanode datanode close() datanode namenode Namenode hadoop client
org.apache.hadoop.fs. FsShell: public int run(String argv[]) throws Exception { if ("-put".equals(cmd) || "-copyFromLocal".equals(cmd)) { Path[] srcs = new Path[argv.length-2]; for (int j=0 ; i < argv.length-1 ;) srcs[j++] = new Path(argv[i++]); copyFromLocal(srcs, argv[i++]); } } org.apache.hadoop.fs. FsShell: void copyFromLocal(Path[] srcs, String dstf) throws IOException { Path dstPath = new Path(dstf); FileSystem dstFs = dstPath.getFileSystem(getConf()); if (srcs.length == 1 && srcs[0].toString().equals("-")) copyFromStdin(dstPath, dstFs); else dstFs.copyFromLocalFile(false, false, srcs, dstPath); } org.apache.hadoop.fs. FileSystem: public void copyFromLocalFile(boolean delSrc, boolean overwrite, Path[] srcs, Path dst) throws IOException { Configuration conf = getConf();
42
FileUtil.copy(getLocal(conf), srcs, this, dst, delSrc, overwrite, conf); } org.apache.hadoop.fs. FileUtil: public static boolean copy(FileSystem srcFS, Path[] srcs, FileSystem dstFS, Path dst, boolean deleteSource, boolean overwrite, Configuration conf) throws IOException { if (srcs.length == 1) return copy(srcFS, srcs[0], dstFS, dst, deleteSource, overwrite, conf); for (Path src : srcs) { try { if (!copy(srcFS, src, dstFS, dst, deleteSource, overwrite, conf)) returnVal = false; } catch (IOException e) { gotException = true; exceptions.append(e.getMessage()); exceptions.append("\n"); } } return returnVal; }
FsShell hadoop run() hadoop shell shell -put -copyFromLocal copyFromLocal() shell copyFromLocalFile() FileUtil.copy() copy()
org.apache.hadoop.fs. FileUtil: public static boolean copy(FileSystem srcFS, Path src, FileSystem dstFS, Path dst, boolean deleteSource, boolean overwrite, Configuration conf) throws IOException { dst = checkDest(src.getName(), dstFS, dst, overwrite);
43
if (srcFS.getFileStatus(src).isDir()) { checkDependencies(srcFS, src, dstFS, dst); if (!dstFS.mkdirs(dst)) { return false; } FileStatus contents[] = srcFS.listStatus(src); for (int i = 0; i < contents.length; i++) { copy(srcFS, contents[i].getPath(), dstFS, new Path(dst, contents[i].getPath().getName()), deleteSource, overwrite, conf); } } else if (srcFS.isFile(src)) { InputStream in=null; OutputStream out = null; try { in = srcFS.open(src); out = dstFS.create(dst, overwrite); IOUtils.copyBytes(in, out, conf, true); } catch (IOException e) { IOUtils.closeStream(out); IOUtils.closeStream(in); throw e; } } else { throw new IOException(src.toString() + ": No such file or directory"); } if (deleteSource) { return srcFS.delete(src, true); } else { return true; } }
FileSystem
DistributedFileSystem
dstFS.create(dst, overwrite);
return this.create(f, FsPermission.getDefault(),overwrite, bufferSize, replication, blockSize, progress); 1 return new FSDataOutputStream (dfs.create(getPathName(f), permission, overwrite, true, replication, blockSize, progress, bufferSize), statistics);
// public abstract FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException;
create() FileSystem FSDataOutputStream create() FSDateOutputStream 2 FS HDFS DistributedFileSystem create() FSDataOutputStream OutputStream dfs.create() DFSOutputStream
45
DFSClient
DFSOutputStream
ClientProtocol(NameNode)
DFSOutputStream(String src, FsPermission masked, boolean overwrite, boolean createParent, short replication, long blockSize, Progressable progress, int buffersize, int bytesPerChecksum) throws IOException { this(src, blockSize, progress, bytesPerChecksum, replication); computePacketChunkSize(writePacketSize, bytesPerChecksum); try { namenode.create( src, masked, clientName, overwrite, createParent, replication, blockSize); } catch(RemoteException re) { throw re.unwrapRemoteException(AccessControlException.class, FileAlreadyExistsException.class, FileNotFoundException.class, NSQuotaExceededException.class, DSQuotaExceededException.class); } streamer.start(); } new
return new FSDataOutputStream (dfs.create(getPathName(f), permission, overwrite, true, replication, blockSize, progress, bufferSize), statistics);
dfs.create() DFSClient create() OutputStream DFSOutputStreamDFSOutputStream namenode streamer.start() pipeline DataStreamer data queue block 64M 64K packet 1000 packets/block DataStreamer namenode
org.apache.hadoop.hdfs.server.namenode. NameNode: public void create(String src, FsPermission masked, String clientName, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { String clientMachine = getClientMachine();
46
if (stateChangeLog.isDebugEnabled()) { stateChangeLog.debug("*DIR* NameNode.create: file " +src+" for "+clientName+" at "+clientMachine); } if (!checkPathLength(src)) { throw new IOException("create: Pathname too long. Limit " + MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels."); } namesystem.startFile(src, new PermissionStatus(UserGroupInformation.getCurrentUser().getShortUserName(), null, masked), clientName, clientMachine, overwrite, createParent, replication, blockSize); myMetrics.incrNumFilesCreated(); myMetrics.incrNumCreateFileOps(); } org.apache.hadoop.hdfs.server.namenode. FSNamesystem void startFile(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { startFileInternal(src, permissions, holder, clientMachine, overwrite, false, createParent, replication, blockSize); getEditLog().logSync(); if (auditLog.isInfoEnabled() && isExternalInvocation()) { final HdfsFileStatus stat = dir.getFileInfo(src); logAuditEvent(UserGroupInformation.getCurrentUser(), Server.getRemoteIp(), "create", src, null, stat); } } org.apache.hadoop.hdfs.server.namenode. FSNamesystem private synchronized void startFileInternal(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean append, boolean createParent, short replication, long blockSize ) throws IOException {
47
DatanodeDescriptor clientNode = host2DataNodeMap.getDatanodeByHost(clientMachine); if (append) { // // Replace current node with a INodeUnderConstruction. // Recreate in-memory lease record. // INodeFile node = (INodeFile) myFile; INodeFileUnderConstruction cons = new INodeFileUnderConstruction( node.getLocalNameBytes(), node.getReplication(), node.getModificationTime(), node.getPreferredBlockSize(), node.getBlocks(), node.getPermissionStatus(), holder, clientMachine, clientNode); dir.replaceNode(src, node, cons); leaseManager.addLease(cons.clientName, src); } else { // Now we can add the name to the filesystem. This file has no // blocks associated with it. // checkFsObjectLimit(); // increment global generation stamp long genstamp = nextGenerationStamp(); INodeFileUnderConstruction newNode = dir.addFile(src, permissions, replication, blockSize, holder, clientMachine, clientNode, genstamp); if (newNode == null) { throw new IOException("DIR* NameSystem.startFile: " + "Unable to add file to namespace."); } leaseManager.addLease(newNode.clientName, src); if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: " +"add "+src+" to namespace for "+holder); } }
48
namenode create() FSNameSystem startFileInternale() hadoop append INode node under construction blocks stamp client IOUtils.copyBytes() client & block
IOUtils
FSOutputSummer
DFSClient.DFSOutputStream
copyBytes(in, out, conf.getInt("io.file.buffer.size", 4096), close); copyBytes(in, out, buffSize); out.write(buf, 0, bytesRead); for (int n=0;n<len;n+=write1(b, off+n, len-n))
public static void copyBytes(InputStream in, OutputStream out, int buffSize) throws IOException { PrintStream ps = out instanceof PrintStream ? (PrintStream)out : null; byte buf[] = new byte[buffSize]; int bytesRead = in.read(buf); while (bytesRead >= 0) { out.write(buf, 0, bytesRead); if ((ps != null) && ps.checkError()) { throw new IOException("Unable to write to output stream."); } bytesRead = in.read(buf); } }
write1(b, off+n, len-n) writeChecksumChunk(b, off, length, false); writeChunk(b, off, len, checksum);
49
OutputStream
FilterOutputStream FileSystem
FSOutputSummer
DataOutputStream DistributedFileSystem
DFSOutputStream
datanode DataStreamer
create()
FSDataOutputStream(out)
DFSOutputStream FSOutputSummer DFSOutputStream writeChunk() DistributedFileSystem create() DFSOutputStream FSDataOutputStream DFSOutputStream writeChunk() client block packet 3 datanode1datanode2 datanode3 client datanode1 packet1 datanode1 datanode1 datanode2 packet1 datanode2 client packet2 datanode1 datanode2 datanode3 packet1 datanode3 client packet3 datanode1datanode1 packet2 datanode2 datanode datanode3 packet1 ack datanode2 datanode2 ack datanode1 client packet
50
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream // @see FSOutputSummer#writeChunk() @Override protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException { checkOpen(); isClosed(); int cklen = checksum.length; int bytesPerChecksum = this.checksum.getBytesPerChecksum(); if (len > bytesPerChecksum) { throw new IOException("writeChunk() buffer size is " + len + " is larger than supported bytesPerChecksum " + bytesPerChecksum); } if (checksum.length != this.checksum.getChecksumSize()) { throw new IOException("writeChunk() checksum size is supposed to be " + this.checksum.getChecksumSize() + " but found to be " + checksum.length); } synchronized (dataQueue) { // If queue is full, then wait till we can create enough space while (!closed && dataQueue.size() + ackQueue.size() > maxPackets) { try { dataQueue.wait(); } catch (InterruptedException e) { } } isClosed(); if (currentPacket == null) { currentPacket = new Packet(packetSize, chunksPerPacket, bytesCurBlock); if (LOG.isDebugEnabled()) { LOG.debug("DFSClient writeChunk allocating new packet seqno=" + currentPacket.seqno + ", src=" + src + ", packetSize=" + packetSize + ", chunksPerPacket=" + chunksPerPacket + ", bytesCurBlock=" + bytesCurBlock);
51
} } currentPacket.writeChecksum(checksum, 0, cklen); currentPacket.writeData(b, offset, len); currentPacket.numChunks++; bytesCurBlock += len; // If packet is full, enqueue it for transmission // if (currentPacket.numChunks == currentPacket.maxChunks || bytesCurBlock == blockSize) { if (LOG.isDebugEnabled()) { LOG.debug("DFSClient writeChunk packet full seqno=" + currentPacket.seqno + ", src=" + src + ", bytesCurBlock=" + bytesCurBlock + ", blockSize=" + blockSize + ", appendChunk=" + appendChunk); } // // if we allocated a new packet because we encountered a block // boundary, reset bytesCurBlock. // if (bytesCurBlock == blockSize) { currentPacket.lastPacketInBlock = true; bytesCurBlock = 0; lastFlushOffset = 0; } enqueueCurrentPacket(); // If this was the first write after reopening a file, then the above // write filled up any partial chunk. Tell the summer to generate full // crc chunks from now on. if (appendChunk) { appendChunk = false; resetChecksumChunk(bytesPerChecksum); } int psize = Math.min((int)(blockSize-bytesCurBlock), writePacketSize); computePacketChunkSize(psize, bytesPerChecksum); } } //LOG.debug("DFSClient writeChunk done length " + len + // " checksum length " + cklen);
52
} org.apache.hadoop.hdfs.DFSClient.DFSOutputStream private synchronized void enqueueCurrentPacket() { synchronized (dataQueue) { if (currentPacket == null) return; dataQueue.addLast(currentPacket); dataQueue.notifyAll(); lastQueuedSeqno = currentPacket.seqno; currentPacket = null; } }
DFSOutputStream
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream private LinkedList<Packet> dataQueue = new LinkedList<Packet>(); private LinkedList<Packet> ackQueue = new LinkedList<Packet>();
writeChunk() data queue packet currentPacket new Packet packet checksum packet data queue data queue DataStreamer
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream.DataStreamer public void run() { long lastPacket = 0; while (!closed && clientRunning) { // if the Responder encountered an error, shutdown Responder if (hasError && response != null) { try { response.close(); response.join(); response = null; } catch (InterruptedException e) { } } Packet one = null; synchronized (dataQueue) {
53
// process IO errors if any boolean doSleep = processDatanodeError(hasError, false); // wait for a packet to be sent. long now = System.currentTimeMillis(); while ((!closed && !hasError && clientRunning && dataQueue.size() == 0 && (blockStream == null || ( blockStream != null && now - lastPacket < timeoutValue/2))) || doSleep) { long timeout = timeoutValue/2 - (now-lastPacket); timeout = timeout <= 0 ? 1000 : timeout; try { dataQueue.wait(timeout); now = System.currentTimeMillis(); } catch (InterruptedException e) { } doSleep = false; } if (closed || hasError || !clientRunning) { continue; } try { // get packet to be sent. if (dataQueue.isEmpty()) { one = new Packet(); // heartbeat packet } else { one = dataQueue.getFirst(); // regular data packet } long offsetInBlock = one.offsetInBlock; // get new block from namenode. if (blockStream == null) { LOG.debug("Allocating new block"); nodes = nextBlockOutputStream(src); this.setName("DataStreamer for file " + src + " block " + block); response = new ResponseProcessor(nodes); response.start(); }
54
if (offsetInBlock >= blockSize) { throw new IOException("BlockSize " + blockSize + " is smaller than data size. " + " Offset of packet in block " + offsetInBlock + " Aborting file " + src); } ByteBuffer buf = one.getBuffer(); // move packet from dataQueue to ackQueue if (!one.isHeartbeatPacket()) { dataQueue.removeFirst(); dataQueue.notifyAll(); synchronized (ackQueue) { ackQueue.addLast(one); ackQueue.notifyAll(); } } // write out data to remote datanode blockStream.write(buf.array(), buf.position(), buf.remaining()); if (one.lastPacketInBlock) { blockStream.writeInt(0); // indicate end-of-block } blockStream.flush(); lastPacket = System.currentTimeMillis(); if (LOG.isDebugEnabled()) { LOG.debug("DataStreamer block " + block + " wrote packet seqno:" + one.seqno + " size:" + buf.remaining() + " offsetInBlock:" + one.offsetInBlock + " lastPacketInBlock:" + one.lastPacketInBlock); } } catch (Throwable e) { LOG.warn("DataStreamer Exception: " + StringUtils.stringifyException(e)); if (e instanceof IOException) { setLastException((IOException)e); } hasError = true;
55
} } if (closed || hasError || !clientRunning) { continue; } // Is this block full? if (one.lastPacketInBlock) { synchronized (ackQueue) { while (!hasError && ackQueue.size() != 0 && clientRunning) { try { ackQueue.wait(); // wait for acks to arrive from datanodes } catch (InterruptedException e) { } } } LOG.debug("Closing old block " + block); this.setName("DataStreamer for file " + src); response.close(); // ignore all errors in Response try { response.join(); response = null; } catch (InterruptedException e) { } if (closed || hasError || !clientRunning) { continue; } synchronized (dataQueue) { IOUtils.cleanup(LOG, blockStream, blockReplyStream); nodes = null; response = null; blockStream = null; blockReplyStream = null; } } if (progress != null) { progress.progress(); } // This is used by unit test to trigger race conditions. if (artificialSlowdown != 0 && clientRunning) { LOG.debug("Sleeping for artificial slowdown of " +
56
if (errorIndex < nodes.length) { LOG.info("Excluding datanode " + nodes[errorIndex]); excludedNodes.add(nodes[errorIndex]); } // Connection failed. Let's wait a little bit and retry retry = true; } } while (retry && --count >= 0); if (!success) { throw new IOException("Unable to create new block."); } return nodes; }
namenode datanodes blocks namenode client addBlock() FSNamesystem.getAdditionalBlock() DatanodeDescriptor targets[] block datanodesInode[] pathINodes path INode INode pendingFile under construction INode newBlock block LocatedBlock() org.apache.hadoop.hdfs.DFSClient.DFSOutputStream. nextBlockOutputStream() lb client org.apache.hadoop.hdfs.DFSClient.DFSOutputStream createBlockOutputStream()client datanode
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream // connects to the first datanode in the pipeline // Returns true if success, otherwise return failure. // private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, boolean recoveryFlag) { short pipelineStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS; String firstBadLink = ""; if (LOG.isDebugEnabled()) { for (int i = 0; i < nodes.length; i++) { LOG.debug("pipeline = " + nodes[i].getName()); } } // persist blocks on namenode on next flush persistBlocks = true; boolean result = false; try { LOG.debug("Connecting to " + nodes[0].getName()); InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName()); s = socketFactory.createSocket(); timeoutValue = 3000 * nodes.length + socketTimeout; NetUtils.connect(s, target, timeoutValue); s.setSoTimeout(timeoutValue); s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
59
LOG.debug("Send buf size " + s.getSendBufferSize()); long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length + datanodeWriteTimeout; // // Xmit header info to datanode // DataOutputStream out = new DataOutputStream( new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), DataNode.SMALL_BUFFER_SIZE)); blockReplyStream = new DataInputStream(NetUtils.getInputStream(s)); out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); out.write( DataTransferProtocol.OP_WRITE_BLOCK ); out.writeLong( block.getBlockId() ); out.writeLong( block.getGenerationStamp() ); out.writeInt( nodes.length ); out.writeBoolean( recoveryFlag ); // recovery flag Text.writeString( out, client ); out.writeBoolean(false); // Not sending src node information out.writeInt( nodes.length - 1 ); for (int i = 1; i < nodes.length; i++) { nodes[i].write(out); } accessToken.write(out); checksum.writeHeader( out ); out.flush(); // receive ack for connect pipelineStatus = blockReplyStream.readShort(); firstBadLink = Text.readString(blockReplyStream); if (pipelineStatus != DataTransferProtocol.OP_STATUS_SUCCESS) { if (pipelineStatus DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) { throw new InvalidBlockTokenException( "Got access token error for connect ack with firstBadLink as " + firstBadLink); } else { throw new IOException("Bad connect ack with firstBadLink as " + firstBadLink); } }
==
60
blockStream = out; result = true; // success } catch (IOException ie) { } finally { } return result; }
nodes[0] pipeline datanode stamp datanode datanodes datanode for i 1 datanode pipeline
org.apache.hadoop.hdfs.DFSClient.DFSOutputStream.DataStreamer.run()
datanodes blocks datanode one data queue ack queue ack OK ack queue datanode ack queue data queue blockStream.write() datanode block packet block packet 0 block datanode & datanode DataTransferProtocol.OP_WRITE_BLOCK datanode DataXceiver writeBlock()
org.apache.hadoop.hdfs.server.datanode.DataXceiver private void writeBlock(DataInputStream in) throws IOException { DatanodeInfo srcDataNode = null; LOG.debug("writeBlock receive buf size " + s.getReceiveBufferSize() + " tcp no delay " + s.getTcpNoDelay()); // // Read in the header // Block block = new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong());
61
LOG.info("Receiving block " + block + " src: " + remoteAddress + " dest: " + localAddress); int pipelineSize = in.readInt(); // num of datanodes in entire pipeline boolean isRecovery = in.readBoolean(); // is this part of recovery? String client = Text.readString(in); // working on behalf of this client boolean hasSrcDataNode = in.readBoolean(); // is src node info present if (hasSrcDataNode) { srcDataNode = new DatanodeInfo(); srcDataNode.readFields(in); } int numTargets = in.readInt(); if (numTargets < 0) { throw new IOException("Mislabelled incoming datastream."); } DatanodeInfo targets[] = new DatanodeInfo[numTargets]; for (int i = 0; i < targets.length; i++) { DatanodeInfo tmp = new DatanodeInfo(); tmp.readFields(in); targets[i] = tmp; } Token<BlockTokenIdentifier> accessToken = new Token<BlockTokenIdentifier>(); accessToken.readFields(in); DataOutputStream replyOut = null; // stream to prev target replyOut = new DataOutputStream( NetUtils.getOutputStream(s, datanode.socketWriteTimeout)); if (datanode.isBlockTokenEnabled) { try { datanode.blockTokenSecretManager.checkAccess(accessToken, null, block, BlockTokenSecretManager.AccessMode.WRITE); } catch (InvalidToken e) { try { if (client.length() != 0) { replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN); Text.writeString(replyOut, datanode.dnRegistration.getName()); replyOut.flush(); } throw new IOException("Access token verification failed, for client " + remoteAddress + " for OP_WRITE_BLOCK for block " + block); } finally { IOUtils.closeStream(replyOut); } }
62
} DataOutputStream mirrorOut = null; // stream to next target DataInputStream mirrorIn = null; // reply from next target Socket mirrorSock = null; // socket to next target BlockReceiver blockReceiver = null; // responsible for data handling String mirrorNode = null; // the name:port of next target String firstBadLink = ""; // first datanode that failed in connection setup short mirrorInStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS; try { // open a block receiver and check if the block does not exist blockReceiver = new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode); // // Open network conn to backup machine, if // appropriate // if (targets.length > 0) { InetSocketAddress mirrorTarget = null; // Connect to backup machine mirrorNode = targets[0].getName(); mirrorTarget = NetUtils.createSocketAddr(mirrorNode); mirrorSock = datanode.newSocket(); try { int timeoutValue = datanode.socketTimeout + (HdfsConstants.READ_TIMEOUT_EXTENSION numTargets); int writeTimeout = datanode.socketWriteTimeout + (HdfsConstants.WRITE_TIMEOUT_EXTENSION numTargets); NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue); mirrorSock.setSoTimeout(timeoutValue); mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); mirrorOut = new DataOutputStream( new BufferedOutputStream( NetUtils.getOutputStream(mirrorSock, writeTimeout), SMALL_BUFFER_SIZE)); mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock)); // Write header: Copied from DFSClient.java! mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
63
mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK ); mirrorOut.writeLong( block.getBlockId() ); mirrorOut.writeLong( block.getGenerationStamp() ); mirrorOut.writeInt( pipelineSize ); mirrorOut.writeBoolean( isRecovery ); Text.writeString( mirrorOut, client ); mirrorOut.writeBoolean(hasSrcDataNode); if (hasSrcDataNode) { // pass src node information srcDataNode.write(mirrorOut); } mirrorOut.writeInt( targets.length - 1 ); for ( int i = 1; i < targets.length; i++ ) { targets[i].write( mirrorOut ); } accessToken.write(mirrorOut); blockReceiver.writeChecksumHeader(mirrorOut); mirrorOut.flush(); // read connect ack (only for clients, not for replication req) if (client.length() != 0) { mirrorInStatus = mirrorIn.readShort(); firstBadLink = Text.readString(mirrorIn); if (LOG.isDebugEnabled() || mirrorInStatus DataTransferProtocol.OP_STATUS_SUCCESS) { LOG.info("Datanode " + targets.length + " got response for connect ack " + " from downstream datanode with firstbadlink as " + firstBadLink); } } } catch (IOException e) { } } // send connect ack back to source (only for clients) if (client.length() != 0) { if (LOG.isDebugEnabled() || mirrorInStatus DataTransferProtocol.OP_STATUS_SUCCESS) { LOG.info("Datanode " + targets.length + " forwarding connect ack to upstream firstbadlink is " + firstBadLink);
!=
!=
64
} replyOut.writeShort(mirrorInStatus); Text.writeString(replyOut, firstBadLink); replyOut.flush(); } // receive the block and mirror to the next target String mirrorAddr = (mirrorSock == null) ? null : mirrorNode; blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut, mirrorAddr, null, targets.length); // if this write is for a replication request (and not // from a client), then confirm block. For client-writes, // the block is finalized in the PacketResponder. if (client.length() == 0) { datanode.notifyNamenodeReceivedBlock(block, DataNode.EMPTY_DEL_HINT); LOG.info("Received block " + block + " src: " + remoteAddress + " dest: " + localAddress + " of size " + block.getNumBytes()); } if (datanode.blockScanner != null) { datanode.blockScanner.addBlock(block); } } catch (IOException ioe) { } finally { } }
datanode datanodes DatanodeInfo targets[] client datanode replyOut datanode datanode BlockReceiver, DataInputStream in DataNode DataOutputStream mirrorOut DataNode OutputStream out datanodetargets.length>0 datanode
65
// wait for all outstanding packet responses. And then // indicate responder to gracefully shutdown. if (responder != null) { ((PacketResponder)responder.getRunnable()).close(); } // if this write is for a replication request (and not // from a client), then finalize block. For client-writes, // the block is finalized in the PacketResponder. if (clientName.length() == 0) { // close the block/crc files close(); // Finalize the block. Does this fsync()? block.setNumBytes(offsetInBlock); datanode.data.finalizeBlock(block); datanode.myMetrics.incrBlocksWritten(); } } catch (IOException ioe) { } finally { } } org.apache.hadoop.hdfs.server.datanode.BlockReceiver private int receivePacket() throws IOException { int payloadLen = readNextPacket(); if (payloadLen <= 0) { return payloadLen; } buf.mark(); //read the header buf.getInt(); // packet length offsetInBlock = buf.getLong(); // get offset of packet in block long seqno = buf.getLong(); // get seqno boolean lastPacketInBlock = (buf.get() != 0); int endOfHeader = buf.position();
67
buf.reset(); if (LOG.isDebugEnabled()){ LOG.debug("Receiving one packet for block " + block + " of length " + payloadLen + " seqno " + seqno + " offsetInBlock " + offsetInBlock + " lastPacketInBlock " + lastPacketInBlock); } setBlockPosition(offsetInBlock); // First write the packet to the mirror: if (mirrorOut != null && !mirrorError) { try { mirrorOut.write(buf.array(), buf.position(), buf.remaining()); mirrorOut.flush(); } catch (IOException e) { handleMirrorOutError(e); } } buf.position(endOfHeader); int len = buf.getInt(); if (len < 0) { throw new IOException("Got wrong length during writeBlock(" + block + ") from " + inAddr + " at offset " + offsetInBlock + ": " + len); } if (len == 0) { LOG.debug("Receiving empty packet for block " + block); } else { offsetInBlock += len; int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)* checksumSize; if ( buf.remaining() != (checksumLen + len)) { throw new IOException("Data remaining in packet does not match " + "sum of checksumLen and dataLen"); } int checksumOff = buf.position();
68
int dataOff = checksumOff + checksumLen; byte pktBuf[] = buf.array(); buf.position(buf.limit()); // move to the end of the data. /* skip verifying checksum iff this is not the last one in the * pipeline and clientName is non-null. i.e. Checksum is verified * on all the datanodes when the data is being written by a * datanode rather than a client. Whe client is writing the data, * protocol includes acks and only the last datanode needs to verify * checksum. */ if (mirrorOut == null || clientName.length() == 0) { verifyChunks(pktBuf, dataOff, len, pktBuf, checksumOff); } try { if (!finalized) { //finally write to the disk : out.write(pktBuf, dataOff, len); // If this is a partial chunk, then verify that this is the only // chunk in the packet. Calculate new crc for this chunk. if (partialCrc != null) { if (len > bytesPerChecksum) { throw new IOException("Got wrong length during writeBlock(" + block + ") from " + inAddr + " " + "A packet can have only one partial chunk."+ " len = " + len + " bytesPerChecksum " + bytesPerChecksum); } partialCrc.update(pktBuf, dataOff, len); byte[] buf = FSOutputSummer.convertToByteStream(partialCrc, checksumSize); checksumOut.write(buf); LOG.debug("Writing out partial crc for data len " + len); partialCrc = null; } else { checksumOut.write(pktBuf, checksumOff, checksumLen); } datanode.myMetrics.incrBytesWritten(len); /// flush entire packet before sending ack flush();
69
// update length only after flush to disk datanode.data.setVisibleLength(block, offsetInBlock); } } catch (IOException iex) { datanode.checkDiskError(iex); throw iex; } } // put in queue for pending acks if (responder != null) { ((PacketResponder)responder.getRunnable()).enqueue(seqno, lastPacketInBlock); } if (throttler != null) { // throttle I/O throttler.throttle(payloadLen); } return payloadLen; }
receiveBlock() receivePacket() packet 0 client queue datanode ack datanode clientreceivePacket() packet packet datanode client ack org.apache.hadoop.hdfs.DFSClient.DFSOutputStream.ResponseProcessor.run() packet ack queue OP_READ_BLOCK (81)
70
FileSystem open() DistributedFileSystem RPC namenode namenode datanode DistributedFileSystem FSDataInputStream FSDataInputStream datanode namenode I/O DFSInputStream read() datanode DFSInputStream datanode read() datanode DFSInputStream datanode datanode FSDataInputStream close()
71
DataNode
DataNode
public class DataNode extends Configured implements InterDatanodeProtocol, ClientDatanodeProtocol, FSConstants, Runnable, DataNodeMXBean
DataNode DataNode ClientDatanodeProtocol Client InterDatanodeProtocol DataNode ipcServer DataNode IPC DataNode ClientDatanodeProtocol InterDatanodeProtocol DataNode DataNode
static{ Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); }
72
main secureMain secureMain createDataNode DataNode createDataNode instantiateDataNode DataNode runDatanodeDaemon runDatanodeDaemon NameNode DataNode DataNode DataNode instantiateDataNode DataNode storage makeInstance makeInstance new DataNode(conf, dirs); startDataNode DataNode DataNode NameNode socket NameNode DatanodeProtocol.versionRequest NamespaceInfo FSDataset storage data DataXceiverServer run DataBlockScanner offerService DataNode HttpServer ipcServer DataNode DataNode DataNode NameNode DataXceiverServer ipcServer DataNode DataNode run
startDistributedUpgradeIfNeeded()/offerService() offerService offerService offerService NameNode Block DataNode Block Block NameNode
74
DataNode heartBeatInterval sendHeartbeat Block receivedBlockList delHints receivedBlockList DataNode delHints DataXceiver replaceBlock
datanode.notifyNamenodeReceivedBlock(block, sourceID)
DataNode sourceID Block sourceID Block Block DataNode Block NameNode.blockReceived Block blockReportInterval Block NameNode DataNode
DNA_TRANSFER DataNode DNA_INVALIDATE DNA_SHUTDOWN DataNode DNA_REGISTERDataNode DNA_FINALIZE DNA_RECOVERBLOCK
DataNode transferBlocks transferBlocks Block DataTransfer DataTransfer DataNode OP_WRITE_BLOCK NameNode lease DataNode
75
FSDataset: FSDataset http://caibinbupt.iteye.com/blog/284365 DataXceiverServer:, DataXceiver http://caibinbupt.iteye.com/blog/284979 DataXceiver: http://caibinbupt.iteye.com/blog/284979 http://caibinbupt.iteye.com/blog/286533 BlockReceiver: http://caibinbupt.iteye.com/blog/286259 BlockSender: DataBlockScanner: http://caibinbupt.iteye.com/blog/286650
NameNode
HDFS NameNode DataNode NameNode inode
DatanodeProtocol DataNode NameNode DataNode register DataNode sendHeartbeat/blockReport/blockReceived DataNode offerService errorReport NameNode Block BlockReceiver DataBlockScanner nextGenerationStamp
78
RefreshUserMappingsProtocol { // static{ Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); } public static final int DEFAULT_PORT = 8020; // public static final Log LOG = LogFactory.getLog(NameNode.class.getName()); public static final Log stateChangeLog = LogFactory.getLog("org.apache.hadoop.hdfs.StateChange"); public FSNamesystem namesystem; // TODO: This should private. Use getNamesystem() instead. // Datanode /** RPC server */ private Server server; /** RPC server for HDFS Services communication. BackupNode, Datanodes and all other services should be connecting to this server if it is configured. Clients should only go to NameNode#server */ private Server serviceRpcServer; /** RPC server address */ private InetSocketAddress serverAddress = null; /** RPC server for DN address */ protected InetSocketAddress serviceRPCAddress = null; /** httpServer */ private HttpServer httpServer; /** HTTP server address */ private InetSocketAddress httpAddress = null; private Thread emptier; /** only used for testing purposes */ private boolean stopRequested = false; /** Is service level authorization enabled? */ private boolean serviceAuthEnabled = false; static NameNodeInstrumentation myMetrics; //
public static void main(String argv[]) throws Exception { try { ... NameNode namenode = createNameNode(argv, null); if (namenode != null) namenode.join(); } ... } }
org.apache.hadoop.hdfs.server.namenode.FSNamesystem Namenode NameNode FSNamesystem NameNode FSNamesystem => FSImage =>DataNode DataNode DataNode DataNodeLRU FSNamesystem
public class FSNamesystem implements FSConstants, FSNamesystemMBean, NameNodeMXBean, MetricsSource { // public FSDirectory dir; //BlocksMap Block inode Datanode final BlocksMap blocksMap = new BlocksMap(DEFAULT_INITIAL_MAP_CAPACITY,DEFAULT_MAP_LOAD_FACTOR); //
82
public CorruptReplicasMap corruptReplicas = new CorruptReplicasMap(); //datanode NavigableMap<String, DatanodeDescriptor> datanodeMap = new TreeMap<String, DatanodeDescriptor>(); //datanodeMap DatanodeDescriptorHeartbeatMonitor ArrayList<DatanodeDescriptor> heartbeats = new ArrayList<DatanodeDescriptor>(); // private UnderReplicatedBlocks neededReplications = new UnderReplicatedBlocks(); // private PendingReplicationBlocks pendingReplications; // public LeaseManager leaseManager = new LeaseManager(this); Daemon hbthread = null; // FSNamesystem heartbeatCheck Datanode public Daemon lmthread = null; // LeaseMonitor thread Daemon smmthread = null; // threshold public Daemon replthread = null; // : Datanode ; private ReplicationMonitor replmon = null; // Replication metrics // Datanode -> DatanodeDescriptor private Host2NodesMap host2DataNodeMap = new Host2NodesMap();
// Data Center Rack NetworkTopology clusterMap = new NetworkTopology(); // DNS-name/IP-address -> RackID private DNSToSwitchMapping dnsToSwitchMapping; // ReplicationTargetChooser replicator; // Datanode Datanode Namenode
83
FSNamesystem
private void initialize(NameNode nn, Configuration conf) throws IOException { this.systemStart = now(); setConfigurationParameters(conf); dtSecretManager = createDelegationTokenSecretManager(conf); this.nameNodeAddress = nn.getNameNodeAddress(); this.registerMBean(conf); // register the MBean for the FSNamesystemStutus this.dir = new FSDirectory(this, conf); StartupOption startOpt = NameNode.getStartupOption(conf); // fsimage edits this.dir.loadFSImage(getNamespaceDirs(conf), getNamespaceEditsDirs(conf), startOpt); long timeTakenToLoadFSImage = now() - systemStart; LOG.info("Finished loading FSImage in " + timeTakenToLoadFSImage + " msecs"); NameNode.getNameNodeMetrics().setFsImageLoadTime(timeTakenToLoadFSImage); this.safeMode = new SafeModeInfo(conf); setBlockTotal(); pendingReplications = new PendingReplicationBlocks( conf.getInt("dfs.replication.pending.timeout.sec", -1) * 1000L); if (isAccessTokenEnabled) { accessTokenHandler = new BlockTokenSecretManager(true, accessKeyUpdateInterval, accessTokenLifetime); } this.hbthread = new Daemon(new HeartbeatMonitor());// Datanode this.lmthread = new Daemon(leaseManager.new Monitor());// this.replmon = new ReplicationMonitor(); this.replthread = new Daemon(replmon); // hbthread.start(); lmthread.start(); replthread.start(); // datanode this.hostsReader = new HostsFileReader(conf.get("dfs.hosts",""), conf.get("dfs.hosts.exclude","")); //, this.dnthread = new Daemon(new DecommissionManager(this).new Monitor( conf.getInt("dfs.namenode.decommission.interval", 30),
84
conf.getInt("dfs.namenode.decommission.nodes.per.interval", 5))); dnthread.start(); this.dnsToSwitchMapping = ReflectionUtils.newInstance( conf.getClass("topology.node.switch.mapping.impl", ScriptBasedMapping.class, DNSToSwitchMapping.class), conf); /* If the dns to swith mapping supports cache, resolve network * locations of those hosts in the include list, * and store the mapping in the cache; so future calls to resolve * will be fast. */ if (dnsToSwitchMapping instanceof CachedDNSToSwitchMapping) { dnsToSwitchMapping.resolve(new ArrayList<String>(hostsReader.getHosts())); } InetSocketAddress socAddr = NameNode.getAddress(conf); this.nameNodeHostName = socAddr.getHostName(); registerWith(DefaultMetricsSystem.INSTANCE); }
FSDirectory FSNamesystem FSDirectory hdfs INode file/block INode inode Field INodeDirectory INodeDirectory INode INode INodeFile INodeFile INode INodeDirectory INodeFile Datanode INodeFileUnderConstruction HDFS Namenode
85
INodeFile INodeFile Hadoop INodeFileUnderConstruction INodeFile INodeFile INodeFileUnderConstruction INodeFileUnderConstruction HDFS Datanode FSDirectory FSDirectory filename->blockset FSImage fsImage
class FSDirectory implements FSConstants, Closeable { final INodeDirectoryWithQuota rootDir;// INodeDirectory hdfs , FSImage fsImage; // FSImage , }
FSDirectory(FSNamesystem ns, Configuration conf) { this(new FSImage(), ns, conf); ... } FSDirectory(FSImage fsImage, FSNamesystem ns, Configuration conf) { rootDir = new INodeDirectoryWithQuota(INodeDirectory.ROOT_NAME, ns.createFsOwnerPermissions(new FsPermission((short)0755)), Integer.MAX_VALUE, -1); this.fsImage = fsImage; .... namesystem = ns; .... } //FSNamesystem FSDirectory dir loadFSImage fsimage edits
86
void loadFSImage(Collection<File> dataDirs,Collection<File> editsDirs,StartupOption startOpt) throws IOException { // format before starting up if requested if (startOpt == StartupOption.FORMAT) {// FORMAT fsImage.setStorageDirectories(dataDirs, editsDirs);// FSImage ${dfs.name.dir},/tmp/hadoop/dfs/name, fsImage.format();// FSImage startOpt = StartupOption.REGULAR; } try { if (fsImage.recoverTransitionRead(dataDirs, editsDirs, startOpt)) { // (${dfs.name.dir}) fsImage.saveNamespace(true); } FSEditLog editLog = fsImage.getEditLog(); assert editLog != null : "editLog must be initialized"; if (!editLog.isOpen()) editLog.open(); fsImage.setCheckpointDirectories(null, null); } ... }
loadFSImage FSImage FSImage EditLog FSImage EditLog EditLog FSImage FSImage EditLog FSImage namenode namenode hdfs rpc namenode namenode FSNamesystem namesystem namesystem
87
namesystem FSDirectory dir dir dir FSImage fsImage fsImage hdfs EditLog Secondrary Namenoe () namenode EditLog fsimage fsimage EditLog
INode*
NameNode inode inode INode* INode*
INode INodeDirectory
88
INodeFile INodeDirectoryWithQuota INodeFileUnderConstruction HDFS INode name / modificationTime accessTime parent permission HDFS UNIX/Linux UNIX groupuser IDpermission INode long INode get set collectSubtreeBlocksAndClear INode BlockcomputeContentSummary INode INodeDirectory HDFS private List<INode> children; /INodeDirectory get set INodeDirectoryWithQuota INodeDirectory INodeDirectory NameSpace INodeFile HDFS protected BlockInfo blocks[] = null; Block BlockInfo Block INodeFileUnderConstruction clientName clientMachine DataNode clientNode targets
89
Lease
Lease Lease
LeaseManager Lease LeaseManager Monitor Lease holder lastUpdate paths LeaseManager Lease LeaseManager addLease Lease renewLease remove add LeaseManager Monitor Lease Lease FSNamesystem internalReleaseLease LeaseManager
90
Hadoop UNIX FsAction org.apache.hadoop.fs.permission FsAction FsPermission / applyUMask FsPermission PermissionStatus FsPermission INode PermissionStatus long SerialNumberManager PermissionStatus SerialNumberManager FSImage SerialNumberManager
91
SerialNumberManager INode long FsPermissionMODE USERGROUP PermissionChecker Lease Management hadoop lease GFS hadoop lease client GFS lease client datanode hadoop -- append client write client append lease wirte lease management 1 createwritecomplete lease lease 2 lease lease
lease management
create ClientProtocol create INode client ClientProtocol INode completed
92
client lease client lease lease writer client lease lease client create Namenode create
public void create(String src, FsPermission masked, String clientName, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { String clientMachine = getClientMachine(); if (stateChangeLog.isDebugEnabled()) { stateChangeLog.debug("*DIR* NameNode.create: file " +src+" for "+clientName+" at "+clientMachine); } if (!checkPathLength(src)) { throw new IOException("create: Pathname too long. Limit " + MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels."); } namesystem.startFile(src, new PermissionStatus(UserGroupInformation.getCurrentUser().getShortUserName(), null, masked), clientName, clientMachine, overwrite, createParent, replication, blockSize); myMetrics.incrNumFilesCreated(); myMetrics.incrNumCreateFileOps(); }
boolean append, boolean createParent, short replication, long blockSize ) throws IOException { if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: src=" + src + ", holder=" + holder + ", clientMachine=" + clientMachine + ", createParent=" + createParent + ", replication=" + replication + ", overwrite=" + overwrite + ", append=" + append); } if (isInSafeMode()) throw new SafeModeException("Cannot create file" + src, safeMode); if (!DFSUtil.isValidName(src)) { throw new IOException("Invalid file name: " + src); } // Verify that the destination does not exist as a directory already. boolean pathExists = dir.exists(src); if (pathExists && dir.isDir(src)) { throw new IOException("Cannot create file "+ src + "; already exists as a directory."); } if (isPermissionEnabled) { if (append || (overwrite && pathExists)) { checkPathAccess(src, FsAction.WRITE); } else { checkAncestorAccess(src, FsAction.WRITE); } } if (!createParent) { verifyParentDir(src); } try { INode myFile = dir.getFileINode(src); recoverLeaseInternal(myFile, src, holder, clientMachine, false);
94
try { verifyReplication(src, replication, clientMachine); } catch(IOException e) { throw new IOException("failed to create "+e.getMessage()); } if (append) { if (myFile == null) { throw new FileNotFoundException("failed to append to non-existent file " + src + " on client " + clientMachine); } else if (myFile.isDirectory()) { throw new IOException("failed to append to directory " + src +" on client " + clientMachine); } } else if (!dir.isValidToCreate(src)) { if (overwrite) { delete(src, true); } else { throw new IOException("failed to create file " + src +" on client " + clientMachine +" either because the filename is invalid or the file exists"); } } DatanodeDescriptor clientNode = host2DataNodeMap.getDatanodeByHost(clientMachine); if (append) { // // Replace current node with a INodeUnderConstruction. // Recreate in-memory lease record. // INodeFile node = (INodeFile) myFile; INodeFileUnderConstruction cons = new INodeFileUnderConstruction( node.getLocalNameBytes(), node.getReplication(), node.getModificationTime(), node.getPreferredBlockSize(), node.getBlocks(), node.getPermissionStatus(), holder, clientMachine, clientNode); dir.replaceNode(src, node, cons);
95
leaseManager.addLease(cons.clientName, src); } else { // Now we can add the name to the filesystem. This file has no // blocks associated with it. // checkFsObjectLimit(); // increment global generation stamp long genstamp = nextGenerationStamp(); INodeFileUnderConstruction newNode = dir.addFile(src, permissions, replication, blockSize, holder, clientMachine, clientNode, genstamp); if (newNode == null) { throw new IOException("DIR* NameSystem.startFile: " + "Unable to add file to namespace."); } leaseManager.addLease(newNode.clientName, src); if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: " +"add "+src+" to namespace for "+holder); } } } catch (IOException ie) { NameNode.stateChangeLog.warn("DIR* NameSystem.startFile: " +ie.getMessage()); throw ie; } }
lease lease lease lease lease management write client lease completed create
private void finalizeINodeFileUnderConstruction(String src, INodeFileUnderConstruction pendingFile) throws IOException { NameNode.stateChangeLog.info("Removing lease on file " + src + " from client " + pendingFile.clientName); leaseManager.removeLease(pendingFile.clientName, src); // The file is no longer pending. // Create permanent INode, update blockmap INodeFile newFile = pendingFile.convertToInodeFile(); dir.replaceNode(src, pendingFile, newFile); // close file and persist block allocations for this file dir.closeFile(src, newFile); checkReplicationFactor(newFile); }
lease management
FsNamesystem initialize this.lmthread = new Daemon(leaseManager.new Monitor());
97
lmthread.start();
/****************************************************** * Monitor checks for leases that have expired, * and disposes of them. ******************************************************/ class Monitor implements Runnable { final String name = getClass().getSimpleName(); /** Check leases periodically. */ public void run() { for(; fsnamesystem.isRunning(); ) { synchronized(fsnamesystem) { checkLeases(); } try { Thread.sleep(2000); } catch(InterruptedException ie) { if (LOG.isDebugEnabled()) { LOG.debug(name + " is interrupted", ie); } } } } }
/** Check the leases beginning from the oldest. */ synchronized void checkLeases() { for(; sortedLeases.size() > 0; ) { final Lease oldest = sortedLeases.first(); if (!oldest.expiredHardLimit()) { return; // } LOG.info("Lease " + oldest + " has expired hard limit"); final List<String> removing = new ArrayList<String>(); // need to create a copy of the oldest lease paths, becuase // internalReleaseLease() removes paths corresponding to empty files, // i.e. it needs to modify the collection being iterated over
98
// causing ConcurrentModificationException String[] leasePaths = new String[oldest.getPaths().size()]; oldest.getPaths().toArray(leasePaths); for(String p : leasePaths) { try { fsnamesystem.internalReleaseLeaseOne(oldest, p); } catch (IOException e) { LOG.error("Cannot release the path "+p+" in the lease "+oldest, e); removing.add(p); } } for(String p : removing) { removeLease(oldest, p); } } }
fsnamesystem.internalReleaseLease(oldest, p); lease lease removeLease(oldest, p); lease lease LeaseManager private SortedMap leases = new TreeMap();holder->lease map private SortedSet sortedLeases = new TreeSet(); lease private SortedMap sortedLeasesByPath = new TreeMap();paths->leases map addLease lease leases sortedLeases lease client lease lease sortedLeasesByPath --lease recovery * Lease Recovery Algorithm * 1) Namenode retrieves lease information
99
* 2) For each file f in the lease, consider the last block b of f * 2.1) Get the datanodes which contains b * 2.2) Assign one of the datanodes as the primary datanode p * 2.3) p obtains a new generation stamp form the namenode * 2.4) p get the block info from each datanode * 2.5) p computes the minimum block length * 2.6) p updates the datanodes, which have a valid generation stamp, * with the new generation stamp and the minimum block length * 2.7) p acknowledges the namenode the update results * 2.8) Namenode updates the BlockInfo * 2.9) Namenode removes f from the lease * and removes the lease once all files have been removed * 2.10) Namenode commit changes to edit log
Heartbeat
Hadoop RPC
1. hadoop master/slave master Namenode Jobtracker slave Datanode Tasktracker 2. master ipc server slave 3. slave master 3 master heartbeat.recheck.interval
100
master master slave 4. namenode datanode jobtracker tasktracker Datanode Namenode Datanode offerService
/** * Main loop for the DataNode. Runs until shutdown, * forever calling remote NameNode functions. */ public void offerService() throws Exception { LOG.info("using BLOCKREPORT_INTERVAL of " + blockReportInterval + "msec" + " Initial delay: " + initialBlockReportDelay + "msec"); // // Now loop for a long time.... // while (shouldRun) { try { long startTime = now(); // // Every so often, send heartbeat or block-report // if (startTime - lastHeartbeat > heartBeatInterval) { // // All heartbeat messages include following info: // -- Datanode name // -- data transfer port // -- Total capacity // -- Bytes remaining // lastHeartbeat = startTime; DatanodeCommand[] cmds = namenode.sendHeartbeat(dnRegistration, data.getCapacity(), data.getDfsUsed(), data.getRemaining(),
101
xmitsInProgress.get(), getXceiverCount()); myMetrics.addHeartBeat(now() - startTime); //LOG.info("Just sent heartbeat, with name " + localName); if (!processCommand(cmds)) continue; } } // while (shouldRun) } // offerService
Hadoop Datanode Namenode 2 2 JVM Datanode namenode namenode public DatanodeProtocol namenode = null; NameNode
public class NameNode implements ClientProtocol, DatanodeProtocol, NamenodeProtocol, FSConstants, RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol
namenode DatanodeProtocol Hadoop RPC Datanode Namenode sendHeartbeat() DataNode NameNode DataNode startDataNode
// connect to name node this.namenode = (DatanodeProtocol) RPC.waitForProxy(DatanodeProtocol.class, DatanodeProtocol.versionID, nameNodeAddr, conf);
102
RPC Datanode Namenode heartbeat 1) datanode namenode proxy 2) datanode namenode proxy sendHeartbeat 3) datanode namenode ( ) Invocation client.call 4) client call Invocation Call 5) client call namenode 6) namenode namenode Call process DatanodeCommand[] sendHeartbeat
/** * Data node notify the name node that it is alive * Return an array of block-oriented commands for the datanode to execute. * This will be either a transfer or a delete operation. */ public DatanodeCommand[] sendHeartbeat(DatanodeRegistration nodeReg, long capacity, long dfsUsed, long remaining, int xmitsInProgress, int xceiverCount) throws IOException { verifyRequest(nodeReg); return namesystem.handleHeartbeat(nodeReg, capacity, dfsUsed, remaining, xceiverCount, xmitsInProgress); }
103
DatanodeProtocol DatanodeCommand
/** * Determines actions that data node should perform * when receiving a datanode command. */ final static int DNA_UNKNOWN = 0; // unknown action final static int DNA_TRANSFER = 1; // transfer blocks to another datanode final static int DNA_INVALIDATE = 2; // invalidate blocks final static int DNA_SHUTDOWN = 3; // shutdown node final static int DNA_REGISTER = 4; // re-register final static int DNA_FINALIZE = 5; // finalize previous upgrade final static int DNA_RECOVERBLOCK = 6; // request a block recovery final static int DNA_ACCESSKEYUPDATE = 7; // update access key final static int DNA_BALANCERBANDWIDTHUPDATE = 8; // update balancer bandwidth
FSNamesystem.handleHeartbeat 1 getDatanode DatanodeDescriptor nodeinfo null NameNode StorageID DatanodeCommand.REGISTER DataNode 2 isDecommissioned DisallowedDatanodeException
104
updateStats(nodeinfo, false); nodeinfo.updateHeartbeat(capacity, dfsUsed, remaining, xceiverCount); updateStats(nodeinfo, true); //check lease recovery cmd = nodeinfo.getLeaseRecoveryCommand(Integer.MAX_VALUE); if (cmd != null) { return new DatanodeCommand[] {cmd}; } ArrayList<DatanodeCommand> cmds = new ArrayList<DatanodeCommand>(); //check pending replication cmd = nodeinfo.getReplicationCommand( maxReplicationStreams - xmitsInProgress); if (cmd != null) { cmds.add(cmd); } //check block invalidation cmd = nodeinfo.getInvalidateBlocks(blockInvalidateLimit); if (cmd != null) { cmds.add(cmd); } // check access key update if (isAccessTokenEnabled && nodeinfo.needKeyUpdate) { cmds.add(new KeyUpdateCommand(accessTokenHandler.exportKeys())); nodeinfo.needKeyUpdate = false; } // check for balancer bandwidth update if (nodeinfo.getBalancerBandwidth() > 0) { cmds.add(new BalancerBandwidthCommand(nodeinfo.getBalancerBandwidth())); // set back to 0 to indicate that datanode has been sent the new value nodeinfo.setBalancerBandwidth(0); } if (!cmds.isEmpty()) { return cmds.toArray(new DatanodeCommand[cmds.size()]); } } } //check distributed upgrade cmd = getDistributedUpgradeCommand(); if (cmd != null) { return new DatanodeCommand[] {cmd}; }
106
return null; }
10
HDFS
HDFS HDFS
HDFS
// FileSystem public class FileCopyWithProgress { public static void main(String[] args) throws Exception { String localSrc = args[0]; String dst = args[1]; InputStream in = new BufferedInputStream(new FileInputStream(localSrc)); Configuration conf = new Configuration(); // FileSystem HDFS FileSystem fs = FileSystem.get(URI.create(dst), conf); OutputStream out = fs.create(new Path(dst), new Progressable() { //MapReduce public void progress() { System.out.print("."); } }); IOUtils.copyBytes(in, out, 4096, true); } } hadoop FileCopyWithProgress input/1.txt hdfs://localhost/user/hadoop/1.txt // FileSystem API public class FileSystemCat { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); // FileSystem HDFS FileSystem fs = FileSystem.get(URI.create(uri), conf); InputStream in = null; try { in = fs.open(new Path(uri));
107
FileSystem
108
FileSystem CACHE: cache cache statisticsTable: key: CACHE statistics: deleteOnExit: Java clientFinalizer: FileSystem FileSystem getFileBlockLocations: exists: isFile: getContentSummary: listStatus: globStatus: Linux
109
getHomeDirectory: get/set*etWorkingDirectory: copyFromLocalFile: copyToLocalFile: moveFromLocalFile: moveToLocalFile: getFileStatus: setPermission: setOwner: setTimes: getAllStatistics: getStatistics: get: URI FileSystem createFileSystem createFileSystem: URI scheme scheme FileSystem Hadoop HDFS
FileSystem fs = FileSystem.get(URI.create(dst), conf);
uri
FileSystem.Cache
HashMapMap
110
Key FileSystem Key scheme: URI URIhttp://server/index.html scheme http authority: URI authority server authority ugi:
FileSystem.Statistics
scheme: HDFS hdfs bytesRead: AtomicLong bytesWritten: AtomicLong
private final String scheme; private AtomicLong bytesRead = new AtomicLong(); private AtomicLong bytesWritten = new AtomicLong(); private AtomicInteger readOps = new AtomicInteger(); private AtomicInteger largeReadOps = new AtomicInteger(); private AtomicInteger writeOps = new AtomicInteger();
Path
/ URI
111
BlockLocation
HDFS Block
private String[] hosts; //hostnames of datanodes private String[] names; //hostname:portNumber of datanodes private String[] topologyPaths; // full path name in network topology private long offset; //offset of the of the block in the file private long length; //
FileStatus
/ UnixLinux
private Path path; private long length; private boolean isdir; private short block_replication; private long blocksize; private long modification_time; private long access_time; private FsPermission permission; private String owner; private String group; / / / / / / /
FsPermission
/ POSIX
FSDataOutputStream
DataOutputStream Syncable sync FSDataOutputStream FileSystem
112
FSDataInputStream
DataInputStream Seekable PositionReadable seek FSDataInputStream FileSystem FileSystem FileSystem FileSystem.get
/** Returns the FileSystem for this URI's scheme and authority. The scheme * of the URI determines a configuration property name, * <tt>fs.<i>scheme</i>.class</tt> whose value names the FileSystem class. * The entire URI is passed to the FileSystem instance's initialize method. */ public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null) { return get(conf); } // no scheme: use default FS
if (authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme()) // if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); }
FileSystem
113
private static FileSystem createFileSystem(URI uri, Configuration conf ) throws IOException { Class<?> clazz = conf.getClass("fs." + uri.getScheme() + ".impl", null); LOG.debug("Creating filesystem for " + uri); if (clazz == null) { throw new IOException("No FileSystem for scheme: " + uri.getScheme()); } FileSystem fs = (FileSystem)ReflectionUtils.newInstance(clazz, conf); fs.initialize(uri, conf); return fs; }
FileSystem Scheme hdfs FileSystem fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem JAVA DistributedFileSystem DFS FileSystem HDFS DistributedFileSystem DFSClient DFSClient Hadoop FileSystem Hadoop DistributedFileSystem
114
HDFS org.apache.hadoop.fs DistributedFileSystem DFSClien DFSClient hdfs-default.xml hdfs-site.xml namenode HDFS uri namenode checkPath schemeport authority
115
makeQualified getPathName DFSClient Hadoop ClientProtocol NameNode Socket DataNode /Hadoop DistributedFileSystem DFSClient DFSClient DFSClient
DFSClient MAX_BLOCK_ACQUIRE_FAILURES 3
116
TCP_WINDOW_SIZETCP 128KB seek TCP_WINDOW_SIZE TCP rpcNamenode RPC namenode namenode rcpNamenode Retry RetryPolicy leasechecker defaultBlockSize 64MB defaultReplication 3 socketTimeoutsocket 60 datanodeWriteTimeoutdatanode 480 writePacketSize packet 64KB maxBlockAcquireFailures 3 datanode DSClient dfs.socket.timeout dfs.datanode.socket.write.timeout dfs.write.packet.size packet dfs.client.max.block.acquire.failures mapred.task.id map reduce ID clientName DFSClient_ clientName DFSClient_ dfs.block.size
117
dfs.replication DFSClient RPC namenode checkOpen clientRunning ; getBlockLocations namenode LocatedBlocks LocatedBlocks datanode BlockLocation BlockLocation ; getFileChecksum checksum datanode checksum checksum MD5 datanode datanode MD5 MD5 bestNode deadNodes
11
MapReduce
MapReduce Google 1TB
Map Reduce Map Reduce MapReduce 1Client: MapReduce 2JobTracker 3 TaskTracker: Map Reduce 4 Shared
118
FileSystem( HDFS
1. Job waitForCompletion(true) jobtracker ID 2 InputSplit JAR ID jobtracker JAR mapred.submit.replication 10 3 jobtracker 4 2. JobTracker job scheduler 5 Job InputSplit 6 InputSplit map
119
3. TaskTracker (heartbeat) JobTracker. jobtracker tasktracker 7 jobtracker tasktracker jobtracker Hadoop MapReduce FIFO Fair Scheduler Capacity Scheduler jobtracker map reduce ,tasktracker 4. tasktracker JAR tasktracker tasktracker 8 tasktracker JAR tasktracker TaskRunner TaskRunner child JVM 9
120
Hadoop wordcount
hadoop jar hadoop-examples-1.0.4.jar wordcount /usr/input /usr/output
JobTracker Map M1 M2 M3 Reduce R1 R2 Map Reduce TaskTracker TaskTracker Java HDFS InputFormat ASCII JDBC InputFormat InputSplit splite1 splite5 InputFormat RecordReader <k,v><k,v> map map context.collect OutputCollector. collect context Mapper Partitioner Mapper Combiner Mapper
121
<k,v> list key list Combiner Partitioner M1 Combiner Partitioner Map Reduce 3 Shuffle sort reduce Hadoop MapReduce Map key Reducer Mapper key key Reducer HTTP Mapper key <key,value> Reduce Shuffle sort <key, (list of values)> Reducer. reduce OutputFormat DFS
12
1. jobtracker ID2.
HADOOP_HOME/BIN/hadoop
elif [ "$COMMAND" = "jar" ] ; then CLASS=org.apache.hadoop.util.RunJar HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" # run it exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"
} public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); /** * Reducer reduce * void reduce(Text key, Iterable<IntWritable> values, Context context) * k/v map context,(combiner), context */ public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { //Configurationmap/reduce j hadoop map-reduce Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count");// job job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); // job Mapper job.setCombinerClass(IntSumReducer.class); // job Combiner job.setReducerClass(IntSumReducer.class); // job Reduce job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // map-reduce OutputFormat /** * InputFormat map-reduce job * setInputPaths(): map-reduce job * setInputPath() map-reduce job
124
*/ FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); // job job.runJob(conf); } }
waitForCompletion(true)MapReduce submit():
125
/** * Submit the job to the cluster and return immediately. * @throws IOException */ public void submit() throws IOException, InterruptedException, ClassNotFoundException { ensureState(JobState.DEFINE); // DEFINE setUseNewAPI();// // Connect to the JobTracker and submit the job connect();// JobTracker JobClient info = jobClient.submitJobInternal(conf); // super.setJobID(info.getID());// ID state = JobState.RUNNING; // Job RUNNING }
connect() JobClient JobClient private JobSubmissionProtocol jobSubmitClient; JobTracker DataNode namenode JobClient RPC jobSubmitClient connect()-->(JobContext : UserGroupInformation) ugi doAs() PrivilegedExceptionAction run() -->new JobClient((JobConf) getConfiguration())-->JobClient init(JobConf conf) -->this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf);--> (JobSubmissionProtocol) RPC.getProxy();
Job.connect() | |-->UserGroupInformation.doAs() | | | |-->PrivilegedExceptionAction<Object>() |<--| | | | |run() | | | |-->JobClient.JobClient(JobConf conf) | | | | | |-->JobClient.setConf(Configuration conf) | | |<---| | | | | | |-->JobClient.init(JobConf conf)
126
| | | | | | | | | | | |
| | | | | |-->JobClient.createRPCProxy(InetSocketAddress addr, | | | Configuration conf) | | | | | | | |-->(JobSubmissionProtocol) RPC.getProxy() | | | |<---------------------------------| | | | | | | |<--| | |<---| | |<---| | | | | | | | | | |
jobClient.submitJobInternal(conf);
/** * Internal method for submitting jobs to the system. * @param job the configuration to submit * @return a proxy object for the running job * @throws FileNotFoundException * @throws ClassNotFoundException * @throws InterruptedException * @throws IOException */ public RunningJob submitJobInternal(final JobConf job ) throws FileNotFoundException, ClassNotFoundException, InterruptedException, IOException { /* * configure the command line options correctly on the submitting dfs */ return ugi.doAs(new PrivilegedExceptionAction<RunningJob>() { public RunningJob run() throws FileNotFoundException, ClassNotFoundException, InterruptedException, IOException{ JobConf jobCopy = job; Path jobStagingArea = JobSubmissionFiles.getStagingDir(JobClient.this, jobCopy); // jobtracker ID JobID jobId = jobSubmitClient.getNewJobId(); Path submitJobDir = new Path(jobStagingArea, jobId.toString()); jobCopy.set("mapreduce.job.dir", submitJobDir.toString());
127
JobStatus status = null; try { populateTokenCache(jobCopy, jobCopy.getCredentials()); copyAndConfigureFiles(jobCopy, submitJobDir); // get delegation token for the dir TokenCache.obtainTokensForNamenodes(jobCopy.getCredentials(), new Path [] {submitJobDir}, jobCopy); Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir); int reduces = jobCopy.getNumReduceTasks(); InetAddress ip = InetAddress.getLocalHost(); if (ip != null) { job.setJobSubmitHostAddress(ip.getHostAddress()); job.setJobSubmitHostName(ip.getHostName()); } JobContext context = new JobContext(jobCopy, jobId); // Check the output specification // // MapReduce if (reduces == 0 ? jobCopy.getUseNewMapper() : jobCopy.getUseNewReducer()) { org.apache.hadoop.mapreduce.OutputFormat<?,?> output = ReflectionUtils.newInstance(context.getOutputFormatClass(), jobCopy); output.checkOutputSpecs(context); } else { jobCopy.getOutputFormat().checkOutputSpecs(fs, jobCopy); } jobCopy = (JobConf)context.getConfiguration(); // Create the splits for the job FileSystem fs = submitJobDir.getFileSystem(jobCopy); LOG.debug("Creating splits at " + fs.makeQualified(submitJobDir)); int maps = writeSplits(context, submitJobDir); jobCopy.setNumMapTasks(maps); // write "queue admins of the queue to which job is being submitted" // to job file. String queue = jobCopy.getQueueName();
128
AccessControlList acl = jobSubmitClient.getQueueAdmins(queue); jobCopy.set(QueueManager.toFullPropertyName(queue, QueueACL.ADMINISTER_JOBS.getAclName()), acl.getACLString()); // Write job file to JobTracker's fs // ID jobtracker FSDataOutputStream out = FileSystem.create(fs, submitJobFile, new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION)); try { jobCopy.writeXml(out); } finally { out.close(); } // // Now, actually submit the job (using the submit name) // jobtracker printTokens(jobId, jobCopy.getCredentials()); status = jobSubmitClient.submitJob( jobId, submitJobDir.toString(), jobCopy.getCredentials()); JobProfile prof = jobSubmitClient.getJobProfile(jobId); if (status != null && prof != null) { return new NetworkedJob(status, prof, jobSubmitClient); } else { throw new IOException("Could not launch job"); } } finally { if (status == null) { LOG.info("Cleaning up the staging area " + submitJobDir); if (fs != null && submitJobDir != null) fs.delete(submitJobDir, true); } } } }); }
org.apache.hadoop.mapreduce.
OutputFormat / org.apache.hadoop.mapred. OutputFormat
* of the JobTracker. But JobInProgress adds info that's useful for * the JobTracker alone. */ public JobStatus submitJob(JobID jobId, String jobSubmitDir, Credentials ts) throws IOException { JobInfo jobInfo = null; UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); synchronized (this) { if (jobs.containsKey(jobId)) { // job already running, don't start twice return jobs.get(jobId).getStatus(); } jobInfo = new JobInfo(jobId, new Text(ugi.getShortUserName()), new Path(jobSubmitDir)); } // Create the JobInProgress, do not lock the JobTracker since // we are about to copy job.xml from HDFS JobInProgress job = null; try { job = new JobInProgress(this, this.conf, jobInfo, 0, ts); } catch (Exception e) { throw new IOException(e); } synchronized (this) { // check if queue is RUNNING String queue = job.getProfile().getQueueName(); if (!queueManager.isRunning(queue)) { throw new IOException("Queue \"" + queue + "\" is not running"); } try { aclsManager.checkAccess(job, ugi, Operation.SUBMIT_JOB); } catch (IOException ioe) { LOG.warn("Access denied for user " + job.getJobConf().getUser() + ". Ignoring job " + jobId, ioe); job.fail(); throw ioe; } // Check the job if it cannot run in the cluster because of invalid memory // requirements. try { checkMemoryRequirements(job);
131
} catch (IOException ioe) { throw ioe; } boolean recovered = true; // TODO: Once the Job recovery code is there, // (MAPREDUCE-873) we // must pass the "recovered" flag accurately. // This is handled in the trunk/0.22 if (!recovered) { // Store the information in a file so that the job can be recovered // later (if at all) Path jobDir = getSystemDirectoryForJob(jobId); FileSystem.mkdirs(fs, jobDir, new FsPermission(SYSTEM_DIR_PERMISSION)); FSDataOutputStream out = fs.create(getSystemFileForJob(jobId)); jobInfo.write(out); out.close(); } // Submit the job JobStatus status; try { status = addJob(jobId, job); } catch (IOException ioe) { LOG.info("Job " + jobId + " submission failed!", ioe); status = job.getStatus(); status.setFailureInfo(StringUtils.stringifyException(ioe)); failJob(job); throw ioe; } return status; } }
jobtracker
13
JobTracker job
scheduler
132
Job InputSplit InputSplit map JobTracker JobTracker start-mapred.sh start-all.sh start-mapred.sh MapReduce
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker
hadoop-daemon.sh HADOOP_HOME/BIN/hadoop
(start) mkdir -p "$HADOOP_PID_DIR" if [ -f $pid ]; then if kill -0 `cat $pid` > /dev/null 2>&1; then echo $command running as process `cat $pid`. exit 1 fi fi
Stop it first.
if [ "$HADOOP_MASTER" != "" ]; then echo rsync from $HADOOP_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $HADOOP_MASTER/ "$HADOOP_HOME" fi hadoop_rotate_log $log echo starting $command, logging to $log cd "$HADOOP_PREFIX" nohup nice -n $HADOOP_NICENESS "$HADOOP_PREFIX"/bin/hadoop $HADOOP_CONF_DIR $command "$@" > "$log" 2>&1 < /dev/null & echo $! > $pid sleep 1; head "$log" ;;
--config
HADOOP_HOME/BIN/hadoop
133
elif [ "$COMMAND" = "jobtracker" ] ; then CLASS=org.apache.hadoop.mapred.JobTracker HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS" # run it exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"
startTracker JobTracker
result = new JobTracker(conf, identifier); result.taskScheduler.setTaskTrackerManager(result);
JobTracker JobTracker(final JobConf conf,String identifier,Clock clock,QueueManager qm) throws IOException, InterruptedException
// Create the scheduler Class<? extends TaskScheduler> schedulerClass = conf.getClass("mapred.jobtracker.taskScheduler", JobQueueTaskScheduler.class, TaskScheduler.class); taskScheduler = (TaskScheduler) ReflectionUtils.newInstance(schedulerClass, conf);
Hadoop Hadoop Hadoop FIFO Scheduler FIFO Hadoop job tasks Capacity Scheduler
134
Fair Scheduler Hadoop FIFO Scheduler Capacity Scheduler Fair Scheduler mapred-site.xml
<property> <name> mapreduce.jobtracker.taskscheduler </name> <value> org.apache.hadoop.mapred.CapacityTaskScheduler </value> <property>
offerService MapReduce
/** * Run forever */ public void offerService() throws InterruptedException, IOException { // Prepare for recovery. This is done irrespective of the status of restart // flag. while (true) { try { recoveryManager.updateRestartCount(); break; } catch (IOException ioe) { LOG.warn("Failed to initialize recovery manager. ", ioe); // wait for some time Thread.sleep(FS_ACCESS_RETRY_PERIOD); LOG.warn("Retrying..."); } } taskScheduler.start(); // Start the recovery after starting the scheduler try { recoveryManager.recover();
135
} catch (Throwable t) { LOG.warn("Recovery manager crashed! Ignoring.", t); } // refresh the node list as the recovery manager might have added // disallowed trackers refreshHosts(); this.expireTrackersThread = new Thread(this.expireTrackers, "expireTrackers"); this.expireTrackersThread.start(); this.retireJobsThread = new Thread(this.retireJobs, "retireJobs"); this.retireJobsThread.start(); expireLaunchingTaskThread.start(); if (completedJobStatusStore.isActive()) { completedJobsStoreThread = new Thread(completedJobStatusStore, "completedjobsStore-housekeeper"); completedJobsStoreThread.start(); } // start the inter-tracker server once the jt is ready this.interTrackerServer.start(); synchronized (this) { state = State.RUNNING; } LOG.info("Starting RUNNING"); this.interTrackerServer.join(); LOG.info("Stopped interTrackerServer"); }
EagerTaskInitializationListener EagerTaskInitializationListener jobAdded job private List<JobInProgress> jobInitQueue = new ArrayList<JobInProgress>(); EagerTaskInitializationListener start JobInitManager
///////////////////////////////////////////////////////////////// // Used to init new jobs that have just been created ///////////////////////////////////////////////////////////////// class JobInitManager implements Runnable { public void run() { JobInProgress job = null; while (true) { try { synchronized (jobInitQueue) { while (jobInitQueue.isEmpty()) { jobInitQueue.wait();
137
} job = jobInitQueue.remove(0); } threadPool.execute(new InitJob(job)); } catch (InterruptedException t) { LOG.info("JobInitManagerThread interrupted."); break; } } LOG.info("Shutting down thread pool"); threadPool.shutdownNow(); } }
JobInitManager jobInitQueue job InitJob TaskTrackerManager initJob job TaskTrackerManager JobTracker JobTracker job job.initTasks(); job(JobInProgress)
/** * Construct the splits, etc. This is invoked from an async * thread so that split-computation doesn't block anyone. */ public synchronized void initTasks() throws IOException, KillInterruptedException, UnknownHostException { if (tasksInited || isComplete()) { return; } synchronized(jobInitKillStatus){ if(jobInitKillStatus.killed || jobInitKillStatus.initStarted) { return; } jobInitKillStatus.initStarted = true; }
138
LOG.info("Initializing " + jobId); final long startTimeFinal = this.startTime; // log job info as the user running the job try { userUGI.doAs(new PrivilegedExceptionAction<Object>() { @Override public Object run() throws Exception { JobHistory.JobInfo.logSubmitted(getJobID(), conf, jobFile, startTimeFinal, hasRestarted()); return null; } }); } catch(InterruptedException ie) { throw new IOException(ie); } // log the job priority setPriority(this.priority); // // generate security keys needed by Tasks // generateAndStoreTokens(); // // read input splits and create a map per a split // TaskSplitMetaInfo[] splits = createSplits(jobId); if (numMapTasks != splits.length) { throw new IOException("Number of maps in JobConf doesn't match number of " + "recieved splits for job " + jobId + "! " + "numMapTasks=" + numMapTasks + ", #splits=" + splits.length); } numMapTasks = splits.length; // Sanity check the locations so we don't create/initialize unnecessary tasks for (TaskSplitMetaInfo split : splits) { NetUtils.verifyHostnames(split.getLocations()); } jobtracker.getInstrumentation().addWaitingMaps(getJobID(), numMapTasks); jobtracker.getInstrumentation().addWaitingReduces(getJobID(), numReduceTasks); this.queueMetrics.addWaitingMaps(getJobID(), numMapTasks); this.queueMetrics.addWaitingReduces(getJobID(), numReduceTasks);
139
maps = new TaskInProgress[numMapTasks]; for(int i=0; i < numMapTasks; ++i) { inputLength += splits[i].getInputDataLength(); maps[i] = new TaskInProgress(jobId, jobFile, splits[i], jobtracker, conf, this, i, numSlotsPerMap); } LOG.info("Input size for job " + jobId + " = " + inputLength + ". Number of splits = " + splits.length); // Set localityWaitFactor before creating cache localityWaitFactor = conf.getFloat(LOCALITY_WAIT_FACTOR, DEFAULT_LOCALITY_WAIT_FACTOR); if (numMapTasks > 0) { nonRunningMapCache = createCache(splits, maxLevel); } // set the launch time this.launchTime = jobtracker.getClock().getTime(); // // Create reduce tasks // this.reduces = new TaskInProgress[numReduceTasks]; for (int i = 0; i < numReduceTasks; i++) { reduces[i] = new TaskInProgress(jobId, jobFile, numMapTasks, i, jobtracker, conf, this, numSlotsPerReduce); nonRunningReduces.add(reduces[i]); } // Calculate the minimum number of maps to be complete before // we should start scheduling reduces completedMapsForReduceSlowstart = (int)Math.ceil( (conf.getFloat("mapred.reduce.slowstart.completed.maps", DEFAULT_COMPLETED_MAPS_PERCENT_FOR_REDUCE_SLOWSTART) * numMapTasks)); // ... use the same for estimating the total output of all maps resourceEstimator.setThreshhold(completedMapsForReduceSlowstart);
140
// create cleanup two cleanup tips, one map and one reduce. cleanup = new TaskInProgress[2]; // cleanup map tip. This map doesn't use any splits. Just assign an empty // split. TaskSplitMetaInfo emptySplit = JobSplit.EMPTY_TASK_SPLIT; cleanup[0] = new TaskInProgress(jobId, jobFile, emptySplit, jobtracker, conf, this, numMapTasks, 1); cleanup[0].setJobCleanupTask(); // cleanup reduce tip. cleanup[1] = new TaskInProgress(jobId, jobFile, numMapTasks, numReduceTasks, jobtracker, conf, this, 1); cleanup[1].setJobCleanupTask(); // create two setup tips, one map and one reduce. setup = new TaskInProgress[2]; // setup map tip. This map doesn't use any split. Just assign an empty // split. setup[0] = new TaskInProgress(jobId, jobFile, emptySplit, jobtracker, conf, this, numMapTasks + 1, 1); setup[0].setJobSetupTask(); // setup reduce tip. setup[1] = new TaskInProgress(jobId, jobFile, numMapTasks, numReduceTasks + 1, jobtracker, conf, this, 1); setup[1].setJobSetupTask(); synchronized(jobInitKillStatus){ jobInitKillStatus.initDone = true; // set this before the throw to make sure cleanup works properly tasksInited = true; if(jobInitKillStatus.killed) { throw new KillInterruptedException("Job " + jobId + " killed in init"); } } JobHistory.JobInfo.logInited(profile.getJobID(), this.launchTime, numMapTasks, numReduceTasks); // Log the number of map and reduce tasks
141
LOG.info("Job " + jobId + " initialized successfully with " + numMapTasks + " map tasks and " + numReduceTasks + " reduce tasks."); }
initTasks() InputSplit InputSplit map map TaskInProgress maps[] reduce reduce TaskInProgress reduces[] JobQueueJobInProgressListener job private Map<JobSchedulingInfo, JobInProgress> jobQueue;
14
TaskTracker JobTracker taskJobTracker task JobTracker TaskTracker task JobTracker Job Job task map task reduce task reduce task JobTracker task reduce task TaskTracker map task reduce task heartbeat JobTracker TaskTracker TaskTracker TaskTracker JobTracker JobTracker TaskTracker
tt.run();
TaskTracker initialize() JobTracker InterTrackerProtocol JobTracker TaskTracker offerService() 10 JobTracker transmitHeartBeat() HeartbeatResponse HeartbeatResponse getActions() JobTracker TaskTrackerAction LaunchTaskAction addToTaskQueue tasksToCleanup taskCleanupThread KillJobAction KillTaskAction heartbeat transmitHeartBeat()
// Check if the last heartbeat got through... // if so then build the heartbeat information for the JobTracker; // else resend the previous status information. // Check if we should ask for a new Task //add node health information // Xmit the heartbeat HeartbeatResponse heartbeatResponse = jobClient.heartbeat(status, justStarted, justInited, askForNewTask, heartbeatResponseId); // The heartbeat got through successfully! // Clear transient status information which should only // be sent once to the JobTracker // Force a rebuild of 'status' on the next iteration
Task Task heartbeat() askForNewTask true IPC JobTracker heartbeat()heartbeat() TaskTrackerAction JobTracker heartbeat()
/** * The periodic heartbeat mechanism between the {@link TaskTracker} and * the {@link JobTracker}. * * The {@link JobTracker} processes the status information sent by the * {@link TaskTracker} and responds with instructions to start/stop * tasks or jobs, and also 'reset' instructions during contingencies. */ public synchronized HeartbeatResponse heartbeat(TaskTrackerStatus status, boolean restarted, boolean initialContact, boolean acceptNewTasks, short responseId) throws IOException { if (LOG.isDebugEnabled()) { LOG.debug("Got heartbeat from: " + status.getTrackerName() + " (restarted: " + restarted + " initialContact: " + initialContact + " acceptNewTasks: " + acceptNewTasks + ")" + " with responseId: " + responseId); } // Make sure heartbeat is from a tasktracker allowed by the jobtracker. if (!acceptTaskTracker(status)) { throw new DisallowedTaskTrackerException(status); } // First check if the last heartbeat response got through String trackerName = status.getTrackerName(); long now = clock.getTime(); if (restarted) { faultyTrackers.markTrackerHealthy(status.getHost()); } else { faultyTrackers.checkTrackerFaultTimeout(status.getHost(), now); }
144
HeartbeatResponse prevHeartbeatResponse = trackerToHeartbeatResponseMap.get(trackerName); boolean addRestartInfo = false; if (initialContact != true) { // If this isn't the 'initial contact' from the tasktracker, // there is something seriously wrong if the JobTracker has // no record of the 'previous heartbeat'; if so, ask the // tasktracker to re-initialize itself. if (prevHeartbeatResponse == null) { // This is the first heartbeat from the old tracker to the newly // started JobTracker if (hasRestarted()) { addRestartInfo = true; // inform the recovery manager about this tracker joining back recoveryManager.unMarkTracker(trackerName); } else { // Jobtracker might have restarted but no recovery is needed // otherwise this code should not be reached LOG.warn("Serious problem, cannot find record of 'previous' " + "heartbeat for '" + trackerName + "'; reinitializing the tasktracker"); return new HeartbeatResponse(responseId, new TaskTrackerAction[] {new ReinitTrackerAction()}); } } else { // It is completely safe to not process a 'duplicate' heartbeat from a // {@link TaskTracker} since it resends the heartbeat when rpcs are // lost see {@link TaskTracker.transmitHeartbeat()}; // acknowledge it by re-sending the previous response to let the // {@link TaskTracker} go forward. if (prevHeartbeatResponse.getResponseId() != responseId) { LOG.info("Ignoring 'duplicate' heartbeat from '" + trackerName + "'; resending the previous 'lost' response"); return prevHeartbeatResponse; } } } // Process this heartbeat short newResponseId = (short)(responseId + 1); status.setLastSeen(now);
145
if (!processHeartbeat(status, initialContact, now)) { if (prevHeartbeatResponse != null) { trackerToHeartbeatResponseMap.remove(trackerName); } return new HeartbeatResponse(newResponseId, new TaskTrackerAction[] {new ReinitTrackerAction()}); } // Initialize the response to be sent for the heartbeat HeartbeatResponse response = new HeartbeatResponse(newResponseId, null); List<TaskTrackerAction> actions = new ArrayList<TaskTrackerAction>(); boolean isBlacklisted = faultyTrackers.isBlacklisted(status.getHost()); // Check for new tasks to be executed on the tasktracker if (recoveryManager.shouldSchedule() && acceptNewTasks && !isBlacklisted) { TaskTrackerStatus taskTrackerStatus = getTaskTrackerStatus(trackerName); if (taskTrackerStatus == null) { LOG.warn("Unknown task tracker polling; ignoring: " + trackerName); } else { //setup cleanup task List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus); if (tasks == null ) { // tasks = taskScheduler.assignTasks(taskTrackers.get(trackerName)); } if (tasks != null) { for (Task task : tasks) { // actions TaskTracker expireLaunchingTasks.addNewTask(task.getTaskID()); if(LOG.isDebugEnabled()) { LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID()); } actions.add(new LaunchTaskAction(task)); } } } } // Check for tasks to be killed List<TaskTrackerAction> killTasksList = getTasksToKill(trackerName); if (killTasksList != null) { actions.addAll(killTasksList); } // Check for jobs to be killed/cleanedup
146
List<TaskTrackerAction> killJobsList = getJobsForCleanup(trackerName); if (killJobsList != null) { actions.addAll(killJobsList); } // Check for tasks whose outputs can be saved List<TaskTrackerAction> commitTasksList = getTasksToSave(status); if (commitTasksList != null) { actions.addAll(commitTasksList); } // calculate next heartbeat interval and put in heartbeat response int nextInterval = getNextHeartbeatInterval(); response.setHeartbeatInterval(nextInterval); response.setActions( actions.toArray(new TaskTrackerAction[actions.size()])); // check if the restart info is req if (addRestartInfo) { response.setRecoveredJobs(recoveryManager.getJobsToRecover()); } // Update the trackerToHeartbeatResponseMap trackerToHeartbeatResponseMap.put(trackerName, response); // Done processing the hearbeat, now remove 'marked' tasks removeMarkedTasks(trackerName); return response; }
JobQueueTaskScheduler assignTasks()
@Override public synchronized List<Task> assignTasks(TaskTracker taskTracker) throws IOException { TaskTrackerStatus taskTrackerStatus = taskTracker.getStatus(); ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus(); final int numTaskTrackers = clusterStatus.getTaskTrackers(); final int clusterMapCapacity = clusterStatus.getMaxMapTasks(); final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks(); Collection<JobInProgress> jobQueue = jobQueueJobInProgressListener.getJobQueue();
147
// // Get map + reduce counts for the current tracker. // final int trackerMapCapacity = taskTrackerStatus.getMaxMapSlots(); final int trackerReduceCapacity = taskTrackerStatus.getMaxReduceSlots(); final int trackerRunningMaps = taskTrackerStatus.countMapTasks(); final int trackerRunningReduces = taskTrackerStatus.countReduceTasks(); // Assigned tasks List<Task> assignedTasks = new ArrayList<Task>(); // // Compute (running + pending) map and reduce task numbers across pool // int remainingReduceLoad = 0; int remainingMapLoad = 0; synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() == JobStatus.RUNNING) { remainingMapLoad += (job.desiredMaps() - job.finishedMaps()); if (job.scheduleReduces()) { remainingReduceLoad += (job.desiredReduces() - job.finishedReduces()); } } } } // Compute the 'load factor' for maps and reduces double mapLoadFactor = 0.0; if (clusterMapCapacity > 0) { mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity; } double reduceLoadFactor = 0.0; if (clusterReduceCapacity > 0) { reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity; } // // In the below steps, we allocate first map tasks (if appropriate), // and then reduce tasks if appropriate. We go through all jobs // in order of job arrival; jobs only get serviced if their // predecessors are serviced, too. //
148
// // We assign tasks to the current taskTracker if the given machine // has a workload that's less than the maximum load of that kind of // task. // However, if the cluster is close to getting loaded i.e. we don't // have enough _padding_ for speculative executions etc., we only // schedule the "highest priority" task i.e. the task from the job // with the highest priority. // final int trackerCurrentMapCapacity = Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity), trackerMapCapacity); int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps; boolean exceededMapPadding = false; if (availableMapSlots > 0) { exceededMapPadding = exceededPadding(true, clusterStatus, trackerMapCapacity); } int numLocalMaps = 0; int numNonLocalMaps = 0; scheduleMaps: for (int i=0; i < availableMapSlots; ++i) { synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING) { continue; } Task t = null; // Try to schedule a node-local or rack-local Map task t= job.obtainNewNodeOrRackLocalMapTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { assignedTasks.add(t); ++numLocalMaps; // Don't assign map tasks to the hilt! // Leave some free slots in the cluster for future task-failures, // speculative tasks etc. beyond the highest priority job
149
if (exceededMapPadding) { break scheduleMaps; } // Try all jobs again for the next Map task break; } // Try to schedule a node-local or rack-local Map task t= job.obtainNewNonLocalMapTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { assignedTasks.add(t); ++numNonLocalMaps; // We assign at most 1 off-switch or speculative task // This is to prevent TaskTrackers from stealing local-tasks // from other TaskTrackers. break scheduleMaps; } } } } int assignedMaps = assignedTasks.size(); // // Same thing, but for reduce tasks // However we _never_ assign more than 1 reduce task per heartbeat // final int trackerCurrentReduceCapacity = Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), trackerReduceCapacity); final int availableReduceSlots = Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1); boolean exceededReducePadding = false; if (availableReduceSlots > 0) { exceededReducePadding = exceededPadding(false, clusterStatus, trackerReduceCapacity); synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING || job.numReduceTasks == 0) {
150
continue; } Task t = job.obtainNewReduceTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts() ); if (t != null) { assignedTasks.add(t); break; } // Don't assign reduce tasks to the hilt! // Leave some free slots in the cluster for future task-failures, // speculative tasks etc. beyond the highest priority job if (exceededReducePadding) { break; } } } } if (LOG.isDebugEnabled()) { LOG.debug("Task assignments for " + taskTrackerStatus.getTrackerName() + " --> " + "[" + mapLoadFactor + ", " + trackerMapCapacity + ", " + trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" + (trackerCurrentMapCapacity - trackerRunningMaps) + ", " + assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps + ")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " + trackerCurrentReduceCapacity + "," + trackerRunningReduces + "] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) + ", " + (assignedTasks.size()-assignedMaps) + "]"); } return assignedTasks; }
JobInProgress obtainNewMapTask map task findNewMapTask TaskTracker Node nonRunningMapCache TaskInProgress JobInProgress obtainNewReduceTask reduce task findNewReduceTask nonRunningReduces TaskInProgress TaskTracker offerService()
151
TaskTracker JobTracker heartbeat reponse LaunchTaskAction addToTaskQueue map task mapLancher( TaskLauncher) reduce task reduceLancher( TaskLauncher)
// offerService() // Send the heartbeat and process the jobtracker's directives HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); if (actions != null){ for(TaskTrackerAction action: actions) { if (action instanceof LaunchTaskAction) { addToTaskQueue((LaunchTaskAction)action); } else if (action instanceof CommitTaskAction) { CommitTaskAction commitAction = (CommitTaskAction)action; if (!commitResponses.contains(commitAction.getTaskID())) { LOG.info("Received commit task action for " + commitAction.getTaskID()); commitResponses.add(commitAction.getTaskID()); } } else { tasksToCleanup.put(action); } } } // offerService() private void addToTaskQueue(LaunchTaskAction action) { if (action.getTask().isMapTask()) { mapLauncher.addToTaskQueue(action); } else { reduceLauncher.addToTaskQueue(action); } }
152
15
TaskTracker distributed cache job task jar TaskRunner taskTaskRunner JVM task child JVM TaskTracker TaskTracker.offerService() TaskTracker JobTracker heartbeat reponse LaunchTaskAction addToTaskQueue map task mapLancher( TaskLauncher) reduce task reduceLancher( TaskLauncher) TaskLauncher run queue TaskInProgress startNewTask(TaskInProgress tip) task localizeJob(TaskInProgress tip)
localizeJob() job workDir job jar HDFS RunJar.unJar() RunningJob addTaskToJob() runningJobs addTaskToJob
153
runningJob tasks runningJob runningJobs launchTaskForJob() Task launchTaskForJob() TaskInProgress. launchTask(RunningJob rjob)
/** * Kick off the task execution */ public synchronized void launchTask(RunningJob rjob) throws IOException { if (this.taskStatus.getRunState() == TaskStatus.State.UNASSIGNED || this.taskStatus.getRunState() == TaskStatus.State.FAILED_UNCLEAN || this.taskStatus.getRunState() == TaskStatus.State.KILLED_UNCLEAN) { localizeTask(task); if (this.taskStatus.getRunState() == TaskStatus.State.UNASSIGNED) { this.taskStatus.setRunState(TaskStatus.State.RUNNING); } setTaskRunner(task.createRunner(TaskTracker.this, this, rjob)); this.runner.start(); long now = System.currentTimeMillis(); this.taskStatus.setStartTime(now); this.lastProgressReport = now; } else { LOG.info("Not launching task: " + task.getTaskID() + " since it's state is " + this.taskStatus.getRunState()); } }
localizeTask() jobConf Task createRunner() TaskRunner start() Task java task.createRunner () Task MapTask ReduceTask Map Reduce TaskRunner MapTask MapTaskRunner Task ReduceTask ReduceTaskRunner
154
TaskRunner TaskRunner.start() TaskRunner run() java workDir CLASSPATH job jar JvmManager TaskTracker Task JvmRunner JvmManager launchJvm map reduce, JvmRunner JvmManagerForType JvmManagerForType reapJvm() JVM JvmManagerForType idle Job spawnNewJvm spawnNewJvm JvmRunner run run runChildrunChild TaskController DefaultTaskController LinuxTaskController
public void runChild(JvmEnv env) throws IOException, InterruptedException{ int exitCode = 0; try { env.vargs.add(Integer.toString(jvmId.getId())); TaskRunner runner = jvmToRunningTask.get(jvmId); if (runner != null) { Task task = runner.getTask(); //Launch the task controller to run task JVM String user = task.getUser(); TaskAttemptID taskAttemptId = task.getTaskID(); String taskAttemptIdStr = task.isTaskCleanupTask() ? (taskAttemptId.toString() + TaskTracker.TASK_CLEANUP_SUFFIX) : taskAttemptId.toString(); exitCode = tracker.getTaskController().launchTask(user,
155
+ logLocation); } //read the configuration for the job FileSystem rawFs = FileSystem.getLocal(getConf()).getRaw(); long logSize = 0; //TODO MAPREDUCE-1100 // get the JVM command line. String cmdLine = TaskLog.buildCommandLine(setup, jvmArguments, new File(stdout), new File(stderr), logSize, true); // write the command to a file in the // task specific cache directory // TODO copy to user dir Path p = new Path(allocator.getLocalPathForWrite( TaskTracker.getPrivateDirTaskScriptLocation(user, jobId, attemptId), getConf()), COMMAND_FILE); String commandFile = writeCommand(cmdLine, rawFs, p); rawFs.setPermission(p, TaskController.TASK_LAUNCH_SCRIPT_PERMISSION); shExec = new ShellCommandExecutor(new String[]{ "bash", "-c", commandFile}, currentWorkDirectory); shExec.execute(); } catch (Exception e) { if (shExec == null) { return -1; } int exitCode = shExec.getExitCode(); LOG.warn("Exit code from task is : " + exitCode); LOG.info("Output from DefaultTaskController's launchTask follows:"); logOutput(shExec.getOutput()); return exitCode; } return 0; }
launchTask() Shell JVM Child.main() map task reduce task Child Child main
157
MapTask run()run()
public void run(final JobConf job, final TaskUmbilicalProtocol umbilical) throws IOException, ClassNotFoundException, InterruptedException { this.umbilical = umbilical; // start thread that will handle communication with parent TaskReporter reporter = new TaskReporter(getProgress(), umbilical, jvmContext); reporter.startCommunicationThread(); boolean useNewApi = job.getUseNewMapper(); initialize(job, getJobID(), reporter, useNewApi); // check if it is a cleanupJobTask if (jobCleanup) { runJobCleanupTask(umbilical, reporter); return; } if (jobSetup) { runJobSetupTask(umbilical, reporter); return; } if (taskCleanup) { runTaskCleanupTask(umbilical, reporter); return;
158
} if (useNewApi) { runNewMapper(job, splitMetaInfo, umbilical, reporter); } else { runOldMapper(job, splitMetaInfo, umbilical, reporter); } done(umbilical, reporter); }
run() TaskReporter runJobCleanupTaskrunJobSetupTaskrunTaskCleanupTask Mapper MapReduce APIMapTask API MapTask Mapper runNewMapper runOldMapper runOldMapper
private <INKEY,INVALUE,OUTKEY,OUTVALUE> void runOldMapper(final JobConf job, final TaskSplitIndex splitIndex, final TaskUmbilicalProtocol umbilical, TaskReporter reporter ) throws IOException, InterruptedException, ClassNotFoundException { InputSplit inputSplit = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset()); updateJobWithSplit(job, inputSplit); reporter.setInputSplit(inputSplit); RecordReader<INKEY,INVALUE> in = isSkipping() ? new SkippingRecordReader<INKEY,INVALUE>(inputSplit, umbilical, reporter) : new TrackedRecordReader<INKEY,INVALUE>(inputSplit, job, reporter); job.setBoolean("mapred.skip.on", isSkipping());
int numReduceTasks = conf.getNumReduceTasks(); LOG.info("numReduceTasks: " + numReduceTasks); MapOutputCollector collector = null; if (numReduceTasks > 0) { collector = new MapOutputBuffer(umbilical, job, reporter); } else { collector = new DirectMapOutputCollector(umbilical, job, reporter); } MapRunnable<INKEY,INVALUE,OUTKEY,OUTVALUE> runner =
159
ReflectionUtils.newInstance(job.getMapRunnerClass(), job); try { runner.run(in, new OldOutputCollector(collector, conf), reporter); collector.flush(); } finally { //close in.close(); // close input collector.close(); } }
runOldMapper() Mapper InputSplit Mapper RecordReader map Mapper MapOutputCollector Reducer DirectMapOutputCollector MapOutputBuffer MapRunner run()
public void run(RecordReader<K1, V1> input, OutputCollector<K2, V2> output, Reporter reporter) throws IOException { try { // allocate key & value instances that are re-used for all entries K1 key = input.createKey(); V1 value = input.createValue(); while (input.next(key, value)) { // map pair to output mapper.map(key, value, output, reporter); if(incrProcCount) { reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1); } } } finally { mapper.close(); } }
kv spill combine OutputCollector map kv spill combine MapOutputCollector MapOutputBuffer DirectMapOutputCollector DirectMapOutputCollector Reduce Mapper reduce MapOutputBuffer MapOutputBuffer map MapOutputBuffer collect()
public synchronized void collect(K key, V value, int partition ) throws IOException { reporter.progress(); if (key.getClass() != keyClass) { throw new IOException("Type mismatch in key from map: expected " + keyClass.getName() + ", recieved " + key.getClass().getName()); } if (value.getClass() != valClass) { throw new IOException("Type mismatch in value from map: expected " + valClass.getName() + ", recieved " + value.getClass().getName()); } final int kvnext = (kvindex + 1) % kvoffsets.length; spillLock.lock(); try { boolean kvfull; do { if (sortSpillException != null) { throw (IOException)new IOException("Spill failed" ).initCause(sortSpillException); } // sufficient acct space kvfull = kvnext == kvstart; final boolean kvsoftlimit = ((kvnext > kvend) ? kvnext - kvend > softRecordLimit : kvend - kvnext <= kvoffsets.length - softRecordLimit); if (kvstart == kvend && kvsoftlimit) {
161
LOG.info("Spilling map output: record full = " + kvsoftlimit); startSpill(); } if (kvfull) { try { while (kvstart != kvend) { reporter.progress(); spillDone.await(); } } catch (InterruptedException e) { throw (IOException)new IOException( "Collector interrupted while waiting for the writer" ).initCause(e); } } } while (kvfull); } finally { spillLock.unlock(); } try { // serialize key bytes into buffer int keystart = bufindex; keySerializer.serialize(key); if (bufindex < keystart) { // wrapped the key; reset required bb.reset(); keystart = 0; } // serialize value bytes into buffer final int valstart = bufindex; valSerializer.serialize(value); int valend = bb.markRecord(); if (partition < 0 || partition >= partitions) { throw new IOException("Illegal partition for " + key + " (" + partition + ")"); } mapOutputRecordCounter.increment(1); mapOutputByteCounter.increment(valend >= keystart ? valend - keystart : (bufvoid - keystart) + valend);
162
// update accounting info int ind = kvindex * ACCTSIZE; kvoffsets[kvindex] = ind; kvindices[ind + PARTITION] = partition; kvindices[ind + KEYSTART] = keystart; kvindices[ind + VALSTART] = valstart; kvindex = kvnext; } catch (MapBufferTooSmallException e) { LOG.info("Record too large for in-memory buffer: " + e.getMessage()); spillSingleRecord(key, value, partition); mapOutputRecordCounter.increment(1); return; } }
map 100M io.sort.mb 80% io.sort.spill.percent spill spillThread spill 1 spill sortAndSpill partition key QuickSort 2 combiner CombinerRunner combine combin buffer spill sortAndSpill
163
private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { //approximate the length of the output file to be the length of the //buffer + header lengths for the partitions long size = (bufend >= bufstart ? bufend - bufstart : (bufvoid - bufend) + bufstart) + partitions * APPROX_HEADER_LENGTH; FSDataOutputStream out = null; try { // create spill file final SpillRecord spillRec = new SpillRecord(partitions); final Path filename = mapOutputFile.getSpillFileForWrite(numSpills, size); out = rfs.create(filename); final int endPosition = (kvend > kvstart) ? kvend : kvoffsets.length + kvend; sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter); int spindex = kvstart; IndexRecord rec = new IndexRecord(); InMemValBytes value = new InMemValBytes(); for (int i = 0; i < partitions; ++i) { IFile.Writer<K, V> writer = null; try { long segmentStart = out.getPos(); writer = new Writer<K, V>(job, out, keyClass, valClass, codec, spilledRecordsCounter); if (combinerRunner == null) { // spill directly
164
DataInputBuffer key = new DataInputBuffer(); while (spindex < endPosition && kvindices[kvoffsets[spindex % kvoffsets.length] + PARTITION] == i) { final int kvoff = kvoffsets[spindex % kvoffsets.length]; getVBytesForOffset(kvoff, value); key.reset(kvbuffer, kvindices[kvoff + KEYSTART], (kvindices[kvoff + VALSTART] kvindices[kvoff + KEYSTART])); writer.append(key, value); ++spindex; } } else { int spstart = spindex; while (spindex < endPosition && kvindices[kvoffsets[spindex % kvoffsets.length] + PARTITION] == i) { ++spindex; } // Note: we would like to avoid the combiner if we've fewer // than some threshold of records for a partition if (spstart != spindex) { combineCollector.setWriter(writer); RawKeyValueIterator kvIter = new MRResultIterator(spstart, spindex); combinerRunner.combine(kvIter, combineCollector); } } // close the writer writer.close(); // record offsets rec.startOffset = segmentStart; rec.rawLength = writer.getRawLength(); rec.partLength = writer.getCompressedLength(); spillRec.putIndex(rec, i); writer = null; } finally { if (null != writer) writer.close(); } }
165
if (totalIndexCacheMemory >= INDEX_CACHE_MEMORY_LIMIT) { // create spill index file Path indexFilename = mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions * MAP_OUTPUT_INDEX_RECORD_LENGTH); spillRec.writeToFile(indexFilename, job); } else { indexCacheList.add(spillRec); totalIndexCacheMemory += spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH; } LOG.info("Finished spill " + numSpills); ++numSpills; } finally { if (out != null) out.close(); } }
map MapOutputBuffer flush sortAndSpill buffer mergeParts spill map combia 1. 2. mapred.compress.map.out true Map Reduce ReduceTask.run MapTask initialize() runJobCleanupTask() runJobSetupTask() runTaskCleanupTask() CopySortReduce
public void run(JobConf job, final TaskUmbilicalProtocol umbilical) throws IOException, InterruptedException, ClassNotFoundException { this.umbilical = umbilical; job.setBoolean("mapred.skip.on", isSkipping()); if (isMapOrReduce()) { copyPhase = getProgress().addPhase("copy");
166
sortPhase = getProgress().addPhase("sort"); reducePhase = getProgress().addPhase("reduce"); } // start thread that will handle communication with parent TaskReporter reporter = new TaskReporter(getProgress(), umbilical, jvmContext); reporter.startCommunicationThread(); boolean useNewApi = job.getUseNewReducer(); initialize(job, getJobID(), reporter, useNewApi); // check if it is a cleanupJobTask if (jobCleanup) { runJobCleanupTask(umbilical, reporter); return; } if (jobSetup) { runJobSetupTask(umbilical, reporter); return; } if (taskCleanup) { runTaskCleanupTask(umbilical, reporter); return; } // Initialize the codec codec = initCodec(); boolean isLocal = "local".equals(job.get("mapred.job.tracker", "local")); if (!isLocal) { reduceCopier = new ReduceCopier(umbilical, job, reporter); if (!reduceCopier.fetchOutputs()) { if(reduceCopier.mergeThrowable instanceof FSError) { throw (FSError)reduceCopier.mergeThrowable; } throw new IOException("Task: " + getTaskID() + " - The reduce copier failed", reduceCopier.mergeThrowable); } } copyPhase.complete(); // copy is already complete setPhase(TaskStatus.Phase.SORT); statusUpdate(umbilical); final FileSystem rfs = FileSystem.getLocal(job).getRaw(); RawKeyValueIterator rIter = isLocal
167
? Merger.merge(job, rfs, job.getMapOutputKeyClass(), job.getMapOutputValueClass(), codec, getMapFiles(rfs, true), !conf.getKeepFailedTaskFiles(), job.getInt("io.sort.factor", 100), new Path(getTaskID().toString()), job.getOutputKeyComparator(), reporter, spilledRecordsCounter, null) : reduceCopier.createKVIterator(job, rfs, reporter); // free up the data structures mapOutputFilesOnDisk.clear(); sortPhase.complete(); // sort is complete setPhase(TaskStatus.Phase.REDUCE); statusUpdate(umbilical); Class keyClass = job.getMapOutputKeyClass(); Class valueClass = job.getMapOutputValueClass(); RawComparator comparator = job.getOutputValueGroupingComparator(); if (useNewApi) { runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass); } else { runOldReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass); } done(umbilical, reporter); }
OldTrackingRecordWriter<OUTKEY, OUTVALUE>( reduceOutputCounter, job, reporter, finalName); OutputCollector<OUTKEY,OUTVALUE> collector = new OutputCollector<OUTKEY,OUTVALUE>() { public void collect(OUTKEY key, OUTVALUE value) throws IOException { out.write(key, value); // indicate that progress update needs to be sent reporter.progress(); } }; // apply reduce function try { //increment processed counter only if skipping feature is enabled boolean incrProcCount = SkipBadRecords.getReducerMaxSkipGroups(job)>0 && SkipBadRecords.getAutoIncrReducerProcCount(job); ReduceValuesIterator<INKEY,INVALUE> values = isSkipping() ? new SkippingReduceValuesIterator<INKEY,INVALUE>(rIter, comparator, keyClass, valueClass, job, reporter, umbilical) : new ReduceValuesIterator<INKEY,INVALUE>(rIter, job.getOutputValueGroupingComparator(), keyClass, valueClass, job, reporter); values.informReduceProgress(); while (values.more()) { reduceInputKeyCounter.increment(1); reducer.reduce(values.getKey(), values, collector, reporter); if(incrProcCount) { reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS, 1); } values.nextKey(); values.informReduceProgress(); } //Clean up: repeated in catch block below reducer.close(); out.close(reporter); //End of clean up. } catch (IOException ioe) { try {
169
reducer.close(); } catch (IOException ignored) {} try { out.close(reporter); } catch (IOException ignored) {} throw ioe; } }
runOldReducer() OutputCollector MapTask OutputCollector RecordWriter collect write RecordWriter HDFSReduceTask KeyClass ValueClass KeyComparator Reducer Iterator ReducerReduce MapReduce Map -->Shuffle-->Reduce Map Reduce Map Shuffle Reduce Shuffle job
170
16
171
1.,:,map reduce ,,.. map ,. reduce ,, reduce 2., . , TaskTracker, 3 , 5 , TaskTracker JobTracker. TaskTracker JobTracker, 5 . , JobTracker , . JobTracker 3. JobTracker , ,,
:
, , , ,hadoop 1.map reduce : , JVM TaskTracker , ,.TaskTracker task attempt failed, . 2.JVM Bug jvm : TaskTracker , failed. 10 . 0 , , JobTracker , ,.JobTracker TaskTracker .
172
4 ,. 3.TaskTracker : TaskTracker , JobTracker TaskTracker ., JobTracker TaskTracker map . TaskTracker ,, JobTracker . 4.JobTracker : JobTracker . Hadoop JobTracker -.,. JobTracker ,, JobTracker.
173