You are on page 1of 68

University of Science and Technology of China

A dissertation for masters degree

Research and Implementation


on Technologies of Metadata
Management in Distributed
File System
Authors Name

Youle Feng

Speciality: Network Comunication System & Control


Supervisor

Prof. Ming Zhu

Finished time:

May 10nd, 2010

___________

_______________

____

_______________

_______________

_______________

_______________

LandFile

863
2008AA01A317 LandFile

Abstract

Abstract
By connecting many computers together, distributed file system can provide
storage service with a uniform interface, large capacity, high performance,
high-availability and high scalability, which meet the need of large scale application.
Distributed file system has been a difficult and popular study point in storage fields.
To address the different access features of file data and metadata, recent distributed
file systems are divided into to parts: the data storage system and the metadata
management system. To handle increasing metadata requests and keep the coherence
and reliability of metadata operation, the design and implementation of metadata
management system is very important. This article designed and implemented the
metadata management system of distributed file system LandFile, and studied the key
problems and solutions of metadata management system in distributed file system.
This article mainly focuses on the following aspects:
The energy-saving-based load balancing strategy of metadata server cluster: in
large scale application, energy cost has been a more and more important problem.
Based on the dynamic metadata management, we proposal an energy-saving load
balancing strategy. By merging the workload of two metadata servers into one and
turning off the other metadata server when the overall load is low, the overall energy
consumed is saved. The sleeping metadata server will be added into the cluster when
the overall load increases. The carefully designed dynamic metadata management
ensures a smooth transition when the server is added and removed. Experiment shows
that the energy cost is dramatically reduced in energy-saving mode while the
performance is not affected.
The coherence of metadata management: Metadata may have multiple replicas in
the system. A few kinds of metadata operation may involve two metadata servers. We
designed the primary-node based cache structure and implemented the update strategy
of multiple replicas. We also implemented a two-phase-commit protocol to ensure the
coherence of distributed metadata operation.
High reliability metadata management: We studied the nodes management,
failure detection and failure recovery method of metadata server. By using the nodes
management strategy based on region autonomy and log-based failure recovery, the
failure node can be replaced and recovered in a short time, the system service is still
II

Abstract

available during the failure.


Using the load balancing strategy, coherence strategy and reliable management
strategy, we designed and implemented the metadata management system, which
includes metadata operation server module and journal module; we also refined the
load balancing module.
The metadata management system researched is an important component of 863
topic:the Development

of

the

New

Generation

Collaborative Supporting

Environment of Business Operation, Managementand Control.


Keywords: distributed file system, metadata management, load balancing,
energy saving, cache coherence, log-based recovery

III

VOD

Video On Demand

DAS

Direct Access Storage

NFS

Network File System

VFS

Virtual File System

OSD

Object-based Storage Devices

MDS

MetaData Server

BS

Binding Server

MAID

Massive Arrays of Idle Disks

IV


........................................................................................................... I
Abstract ...................................................................................................... II
...................................................................................................... IV
.......................................................................................................... V
................................................................................................ VIII
1 ...............................................................................................1
1.1 ............................................................................................................. 1
1.2 ................................................................................................. 1
1.2.1 NFS .................................................................................................................................. 2
1.2.2 Coda ................................................................................................................................. 3
1.2.3 xFS ................................................................................................................................... 4
1.2.4 GPFS ................................................................................................................................ 4
1.2.5 Lustre ............................................................................................................................... 5
1.2.6 Google File System .......................................................................................................... 5
1.2.7 Dawning Cluster File System (DCFS2) ........................................................................... 6
1.2.8 (BWFS) ................................................................................................... 7

1.3 ................................................................................................. 7
1.4 ............................................................................................................. 9

2 ........................................................................10
2.1 ........................................................................... 10
2.2 ....................................................................................... 12
2.2.1 ............................................................................................... 12
2.2.2 ................................................................................................... 13
2.2.3 ................................................................................................... 14
V

2.3 ............................................................... 14
2.3.1 MDS .................................................................................. 14
2.3.2 ............................................................................... 15
2.3.3 ....................................................................................... 15

2.4 ....................................................................... 16
2.5 ............................................................................... 17
2.6 ........................................................................................................... 17

3 ........................................................19
3.1 ............................................................................................... 19
3.2 ............................................................................................... 20
3.2.1 ............................................................................... 20
3.2.2 MDS .......................................................................................... 20
3.2.3 MDS .................................................................................. 22

3.3 ............................................................................................................ 22
3.4 ............................................................................... 24
3.4.1 ................................................................................................... 24
3.4.2 MDS ........................................................................................................ 25
3.4.3 ................................................................................................... 27
3.4.4 MDS ........................................................................................................ 29

3.5 ........................................................................................... 29
3.6 ....................................................................................................... 31
3.6.1 ....................................................................................................................... 31
3.6.2 ........................................................................................................... 32

3.7 ........................................................................................................... 33

4 ................................34
4.1 ....................................................................... 34
4.1.1 ................................................................................... 34
4.1.2 ................................................................................... 35
4.1.3 ................................................................................... 36

4.2 ............................................................................... 36
4.2.1 ................................................................................... 36
VI


4.2.2 .................................................................................... 37

4.3 ........................................................................................................... 39

5 LandFile .............................................40
5.1 LandFile ............................................................. 40
5.1.1 LandFile ........................................................................ 40
5.1.2 LandFile ........................................................................ 41
5.1.3 LandFile ........................................................................ 41

5.2 / .................................................................. 42
5.3 ................................................................... 43
5.4 ............................................................................... 46
5.4.1 ........................................................................................................... 46
5.4.2 ................................................................................................... 47
5.4.3 ....................................................................................................... 49

5.5 ........................................................................................................... 49

6 .........................................................................................50
6.1 ................................................................................................................... 50
6.2 ....................................................................................................... 51

...................................................................................................52
...........................................................................................................54
......................................55

VII

1.1 NFS ................................................................................................................... 2


1.2 NFS ........................................................................................................... 3
1.3 xFS ............................................................................................................ 4
1.4 Lustre ......................................................................................................... 5
1.5 Google File System ........................................................................................... 6
1.6 DCFS2 ....................................................................................................... 6
1.7 ................................................................................................... 7
2.1 ..................................................................... 11
2.2 ................................................................................................. 11
3.1 web ................................................................................ 21
3.2 ......................................................................................... 21
3.3 ..................................................................................................................... 32
3.4 ............................................................................................................. 32
4.1 ................................................................................................. 35
5.1 LandFile .......................................................................... 40
5.2 LandFile .................................................................................. 41
5.3 LandFile .......................................................................... 42
5.4 ..................................................................................... 44
5.5 ......................................................................................................... 44
5.6 ..................................................................................................... 45
5.7 ..................................................................................................... 46
5.8 ......................................................................................... 46

VIII

1
1.1

30% 114%

VOD
(I/O )
Direct Access Storage,
DAS

863

LandFile

1.2

AFSCodaLocusSprite File
SystemSun NFS

Internet

NFS3 NFS4FrangipaniPetal
1

Global File SystemPVFSxFSGPFSGeneral Parallel File SystemStorage


TankSliceLustreGoogle File System
P2P P2P
P2P

OceanStore

D2K-COSMOSD3K-COSMOSDCFS
DCFS2

P2P
Granary MDNMedia Distribution Network

1.2.1 NFS
[1]

[2]

[3]

[4]

Network File SystemNFS Sun

Microsystems 1985 Solaris NFSv4


Sun Microsystems NFS NFS UNIX/Linux
Windows
,

1.1 NFS

1.1 Unix Virtual File SystemVFS

VFS
2

NFSv3 stateless

cache NFSv4
statefullease

1.2 NFS

NFS 1.2
windows cifs

NFS

1.2.2 Coda
Coda[7] 1987 CMU AFS-2
Coda

Coda

Coda

Coda
Coda
Resolution

1.2.3 xFS
xFS[8]
xFS
invalidation-based write back
aggressive client caching

xFS (Sevrerless) 1.3

1.3 xFS

1.2.4 GPFS
General Parallel File System[9] GPFS IBM
shared-disk
iSCSI GPFS serverless

extensible hashing
GPFS

GPFS

1.2.5 Lustre
Lustre[10] Cluster File System
POSIX UNIX
1.4
Object-based Storage
DevicesOSDLustre
Lustre
OSD

1.4 Lustre

Lustre Lustre

1.2.6 Google File System


Google File System[11]
Google

PC Google
GmailVideo Blog Google File System
1.5 Google File System master
5

chunkserver Google File System


master chunkserver
Master
master
leaseorphaned chunkschunkserver
Master HearBeat chunkserver
chunkservers client master
metadata chunkserver
chunkserver
master master
Google File System
master

1.5 Google File System

1.2.7 Dawning Cluster File System (DCFS2)

1.6 DCFS2

DCFS2[14]

IP-SAN

DCFS2 1.6

1.2.8 (BWFS)

1.7

[12]

IP-SAN NAS

1.7

1.3

LandFile

MDS MDS
MDS

MDS
MDS
MDS

1.4

10

50 80[15]

2.1
/

NFS

2.1
I/O

10

2.1

PB TB

2.2

LandFile

11

,
id lookup create unlink
stat

POSIX UNIX

2.2

2.2.1
MDS
MDS MDS

hash

NFSAFS
Coda
MDS

rename MDS

MDS MDS

MDS MDS

hash hash
hashLustrezFSIntermezzo hash
hash
12

MDS hash
hash

rename

/ MDS
MDS
hash
hash

hash DOIDFH

[16]

(Directory Object IDentifier and Filename Hashing)


Hash id hash
MDS

MDS MDS

2.2.2

MDS MDS
MDS
MDS

[12]

BS MDS MDS BS
MDS BS MDS

MDS

[13] UCSC , CEPH


MDS
MDS
13

MDS
MDS
MDS

MDS

MDS

2.2.3
LandFile MDS
MDS
MDS MDS

MDS

2.3

MDS MDS

MDS

MDS

2.3.1 MDS

MDS
MDS MDS

14

MDS
MDS MDS

MDS

MDS

2.3.2

hash
MDS

MDS

Coda

MDS
CEPH
CEPH

MDS

2.3.3
MDS
15

MDS MDS
MDS
MDS MDS

MDS

MDS
MDS MDS
MDS MDS
MDS
MDS

CPU

MDS MDs

2.4

MDS

MDS
MDS
MDS

16

2.5

2.6

17

18

3.1

IDC
300 1 0.64
1659
100
50
50
0.64 100 55
15%20%
IT
Google
Google

INTELIBM

SNIA

Massive Arrays of Idle Disks


MAID
MAID
19

MAID
MAID
Power OFF
MAID
Nearline Storage

3.2
3.2.1

web

MDS MDS

3.2.2 MDS

3.1 web

MDS
20

web
2000
1500
1000
500
0
1

11

13

15

17

19

21

23

3.1 web

3.2

3.2
CPU 5%
CPU 95%

21

3.2.3 MDS
MDS
1

MDS

MDS

MDS

MDS

MDS
MDS MDS
MDS
MDS MDS

3.3
MDS MDS

MDS

MDS L
CPU

CPU
CPU
i lcpu lmem lbw CPU
1
Li [0,1] Li
MDS Li 0 MDS Li 1 MDS

Li * lcpu * lmem * lbw

22

3.1

p
readstat
mkdir, write

t
f n

pnew pold * f (t ) 1
pancestor _ new pancestor _ old * f (t ) 1/ 2n

3.2

MDS PMDS

i MDS Pi 3.2
t p
Pi _ new Pi _ old * f (t ) p

3.2

MDS sMDS

MDS

MDS MDS
MDS i MDS
Ci MDS Pi MDS Li Ci
MDS Li 1

Ci Pi / Li

3.3

MDS O MDS
MDS MDS
MDS MDS MDS
MDS 3.3
23

3.4 3.5 Ci MDS Li 1


Ci MDS Pi
MDS MDS

C L
i

i 1
n

C L P
i 1
n

3.4

i 1

nMDS

Ci
i 1

i 1
n

Ci

nMDS

3.5

i 1

cost MDS

MDS MDS
MDS
MDS MDS
MDS MDS
MDS MDS MDS

3.4
MDS
MDS
MDS

MDS
MDS

3.4.1
MDS
MDS MDS
MDS MDS
24

MDS
MDS MDS

MDS

3.4.2 MDS
MDS MDS MDS
MDS MDS
MDS MDS
MDS
MDS 3.5
MDS Pi MDS
Pi
j MDS O j 3.6 Pi

Ci C j O j
MDS MDS MDS
MDS
n

Oj

P
i 1

( Ci ) C j

nMDS

3.6

i 1

3.6 MDS
MDS
MDS
MDS MDS
B
__

B
i 1

| Li L |
__

| Li O |
nMDS
O
i 1
n

3.7

MDS B 0
MDS MDS
B 3.8

25

3
n

B n 2 Ci , nMDS

3.8

i 1

j MDS k MDS B jk
3.9 O '
MDS MDS 3.5
MDS O ' 3.9

| Li O ' |
B jk

O'
i 1,i j , k
n

| L O'|
i

O'
i 1,i j
n

| L O'|
i

O'
i 1,i j
n

| Lk

C j Lj

C j Lj

O'|

| Lk

| Lk

Ck
O'
C j Lj
Ck

Ck
O'

O'|

| Lk O ' |
O'

3.9

O ' | | Lk O ' |
O'

(nMDS)

T jk 3.10 MDS T jk MDS


T jk | Lk

C j Lj
Ck

O ' | | Lk O ' |

3.10

T jk MDS MDS
MDS MDS
MDS
T jk MDS
x MDS MDS T jk
T jk MDS MDS x
MDS m
cn MDS Omax
m 3.12 m MDS
MDS
m 1

m 1

i 1

i 1

i 1

Omax * ci Pi Omax * ci

x nm

x 3.13
26

3.12
3.13

MDS 3.11
MDS MDS MDS

3.4.3
MDS MDS
MDS MDS T1MDS
MDS MDS
MDS T2MDS
t MDS
MDS MDS
MDS MDS MDS
MDS MDS

MDS
1

a) MDS MDS
SendNode
MDS SendNode
b) MDS m MDS

x
c) MDS x MDS
d) SendNode O '
e) MDS T jk T jk MDS
ReceiveNode
f)
2

SendNode

a) SendNode
b) SendNode MDS
SendNode SendNode
SendNode
SendNode
ReceiveNodeMDS
MDS

27

c) SendNode

MDS
d)
e) ReceiveNode

SendNode ReceiveNode
ReceiveNode MDS

f)

ReceiveNode MDS SendNode

g) SendNode
ReceiveNode
h) ReceiveNode
i)

ReceiveNode

j)

ReceiveNode

a) SendNode

b) SendNode MDS
MDS
SendNode
SendNode
ReceiveNode
ReceiveNode

c)
d)
e) MDS
f)

28

3.4.4 MDS
MDS MDS MDS
MDS MDS MDS
MDS MDS

MDS MDS
MDS WakeupNode
WakeupNode MDS

MDS MDS SendNode


SendNode LSendNode PSendNode MDS O
SendNode CWakeupNode P
PSendNode PSendNode *

LSendNode O
LSendNode

3.12

PWakeupNode CWakeupNode * O

3.13

P min(PSendNode , PWakeupNode )

3.14

3.12 SendNode PSendNode 3.13


WakeupNode PWakeupNode
P

SendNode P
WakeUpNode

3.5

2.2.2

MDS MDS

MDS MDS
29

MDS
MDS MDS
MDS MDS
MDS

MDS

MDS
MDS MDS
MDS

MDS
MDS MDS
MDS MDS

MDS MDS
MDS
MDS
MDS

MDS

30

MDS

MDS MDS

3.6

MDS0MDS1MDS2MDS3MDS4 MDS5 MDS0~2 PC


25W 150W 200WMDS3~4
32W 200W 260W MDS3~4

MDS
MDS

3.6.1

10
3

24

0000~23:59
4

MDS

10
3

24

0000~23:59
4

MDS

31

3.6.2

W)

1000
800
600

400
200
0
1 3 5 7 9 11 13 15 17 19 21 23
h
3.3

120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%

10 13 16 19 22
h
3.4

3.3
5

3.4 2 7

32

3.7

33

MDS

4.1
4.1.1

MDS
MDS

MDS MDS

MDS
MDS
MDS inode
MDS inode MDS inode
MDS MDS

MDS
MDS MDS
MDS

34

MDS MDS
k

k MDS MDS MDS

MDS

4.1.2
MDS
MDS
MDS MDS
MDS
MDS MDS
MDS

LOCK_SYNC

LOCK_LOCK

LOCK_ GLOCKR LOCK_SYNC LOCK_LOCK

SYNC

GLOCKR

LOCK

4.1

1
LOCK_ GLOCKR

2
LOCK_LOCK ACK
LOCK_LOCK
35

3
LOCK_LOCK LOCK_SYNC
LOCK_LOCK LOCK_SYNC

4.1.3
MDS MDS
linkunlink rename MDS

1.

2.

3. 1

4.

4.2

4.2.1

MDS
MDS

MDS MDS MDS


36

MDS
MDS MDS

MDS MDS
MDS MDS
MDS
MDS

4.2.2
MDS
MDS MDS

MDS
MDS
MDS MDS
MDS

MDS
MDS
MDS
MDS

MDS

37

MDS MDS

MDS

renamelinkunlink

MDS

MDS
2

MDS MDS

MDS MDS
MDS MDS
MDS MDS
MDS

MDS

MDS
4

MDS MDS

MDS MDS MDS

MDS
MDS MDS

38

MDS MDSMDS
MDS MDS
MDS

4.3

MDS

39

5 LandFile

5 LandFile
LandFile

5.1 LandFile
5.1.1 LandFile
LandFile

GUI API

5.1 LandFile

40

5 LandFile

5.1.2 LandFile
5.2

5.2 LandFile

5.1.3 LandFile
5.3
/
9

/
41

5 LandFile

5.3 LandFile

5.2 /
/
/

LandFile [17]
/

MDS
MDS

42

5 LandFile

MDS MDS
3

MDS

5.3

MDS
MDS 5.4

1 /usr/test/ 5.4 MDS1


/usr/test/ 0
5.4 MDS0
1 /usr/test//usr/test/
/usr/test/
1 /usr /tmp
1 /usr/test/file0

MDS MDS
MDS MDS MDS
MDS

43

5 LandFile

5.4

5.4 5.5 5.6


5.4
5.5 5.6
/usr/test/dir2

(0,1)

(0,1)

/usr

(0,1)

/usr

(0,1)

/usr/test

(1,0)

/usr/test/dir2

(1)

5.5
44

5 LandFile

idid

5.6

0 /usr/test/

/usr/test/ 1
1
2

1 /usr/test/

/usr/test/ dir2

/usr/test/dir2/
4

45

5 LandFile

5.4
5.7

5.7

5.4.1

5.8

46

5 LandFile

MDS

5.8

5.4.2

EntrySubtreeMap
struct ESubtreeMap{
bufferlist data; //
map<dirId, list<dirId> > subtrees; //
}
utimechmodmknodopencmkdir
MDS linkunlinkrename
MDS

struct EentryUpdate{
bufferlist data; //
string type; // mkdir,link
bufferlist client_map;// rename client
reqid_t reqid; // id
47

5 LandFile

bool had_helpers; // MDS


}
had_helper false
MDS link
unlink rename MDS MDS
MDS
EentryUpdate had_helper true
EntryCommittedEntryCommitted
id
struct EntryCommitted{
reqid_t reqid; // id
}
MDS
EntryHelperUpdate op

struct EntryHelperUpdate{
bufferlist commit;

//

bufferlist rollback; //
string type; //
reqid_t reqid;// id
__s32 mdsId;// MDS id
__u8 op;

//

__u8 origop; // link rename


}

MDS EntryExport MDS


EntryImportStart MDS
EntryImportFinish
struct EntryExport{
bufferlist dir; //
dirId

base; // id

set<dirId > bounds;//


}
MDS EntryExport

48

5 LandFile

struct EImportStart{
dirId base;

// id

list<dirId> bounds; //
bufferlist data; //
bufferlist client_map;

//

}
struct EImportFinish{
protected:
dirID base; // id
bool success;//
}
MDS EntryImportStart EntryImportFinish
EntryImportFinish

5.4.3


EntrySubtreeMap
EntrySubtreeMap

Int recover(MDSCache *cache)

0 -1

5.5
LandFile

49

6
6.1

LandFile

50

6.2

51

[1]

Sun Microsystems, Inc. NFS: Network File System Protocol Specification, March 1989.

[2]

B. Callaghan, B. Pawlowski, P. Staubach, Sun Microsystems, Inc. NFS Version 3 Protocol


Specification, June 1995.

[3]

S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, Sun Microsystems, Inc. C. Beame,


Hummingbird Ltd. M. Eisler, Zambeel, Inc. D. Noveck, Network Appliance, Inc. NFS
Version 4 Protocol Specification, December 2000.

[4]

R. Sandberg. The Sun Network File System: Design, Implementation and Experience. IN
Proceedings of USENIX Summer Conference, summer 1987. University of California
Press, pp. 300-313.

[5]

. Coda : []. : .
2005.

[6]

. : []. : . 2006.

[7]

Brama, P. J. The Coda Distributed File system, LINUX Jounral, June 1998.

[8]

Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S.


Roselli, and Randolph Y. Wang. Serverless network file systems. Proceedings of the
Fifteenth ACM Symposium on Operating Systems Principles, December 1995: 109-126.

[9]

Frank Schmuck and Roger Haskin. GPFS: A Shared-Disk File System for Large
Computing Clusters. Proceedings of the Conference on File and Storage Technologies
(FAST02), January 2002: 231-244.

[10]

Peter J. Braam, RumiZahir. Lustre Technical Project Summary, Version 2. July 29, 2001

[11]

Sanjay Ghemwaat, HowardGobioffand shun TakLeung. The Google File System.


Syosium on Operating Systems Prineiples(SOSP).

[12]

. : []. :
. 2003.

[13]

Sage A. Weil. Ceph: Reliable, Scalable, and High-Performance Distributed Storage. Ph.D.
thesis, University of California, Santa Cruz, December, 2007

[14]

. : []. :
. 2005.

[15]

Drew Roselli, Jay Lorch, and Tom Anderson. A comparison of file system workloads In
Proceedings of the 2000 USENIX Annual Technical Conference, pages 4154, San Diego,
CA, June 2000. USENIX Association.
52


[16]

Dan Feng, Juan Wang, Fang Wang, Peng Xia. DOIDFH: an Effective Distributed Metadata
Management Scheme. The 5th International Conference on Computational Science and
Applications, 2007, pages 245-252.

[17]

. : []. :
. 2009.

53

SA0710

2010 5

54

[1] , . CEPH . .

[1] , . (No:
200910236456.7).

2008 7 2010 5 863


2008AA01A317

55

(10)
1. 2006

CGSP

Web

2
2 GridFTP
2 2

OptorSim

2. 1998
,,I/O.
,COSMOS,S2FS.S2FS.,S2FS
,UNIX.,S2FS,.,
,S2FS,.,
,I/O,I/O,
.

3. . CEPH -2010,47(9)
(CEPH).CEPH,
.,,.

4. 2009

<br>

<br>

protoDFS<br>

MDSMDS

5. . SAN
SAN,,.
SAN.,,;
:,.

6. ZD-DFS 2006

ZD-DFS
Web

ZD-DFSXA
ZD-DFSZD-DFS

7. 2005

DPISMDS

8. 2005

()(
)
1.
OCFS
2.OCFSDPISMDS

3.OCFS

4.OCFSDPIS
MDS

9. 2007

SANNAS

IDWORMCASN
CASDCASDHash
CASDCASD

CASDIntel OSDNNFSiSCSICASN
CASN

10. PCDCFS 2000


PC,DISCOS-DCFS.DCFS
,.DCFS,.,Linux
, SIOI/O,,Linux.DCFS,
Serverless, ,stripe,.DCFS,
I/O,DISCOSI/O(SIO-LLAPI).

http://d.g.wanfangdata.com.cn/Thesis_WFA00010723.aspx
(zkyjsc)001bfe75-8f5d-43ca-b8b2-9e4001074195
2010122

You might also like