Professional Documents
Culture Documents
Youle Feng
Finished time:
___________
_______________
____
_______________
_______________
_______________
_______________
LandFile
863
2008AA01A317 LandFile
Abstract
Abstract
By connecting many computers together, distributed file system can provide
storage service with a uniform interface, large capacity, high performance,
high-availability and high scalability, which meet the need of large scale application.
Distributed file system has been a difficult and popular study point in storage fields.
To address the different access features of file data and metadata, recent distributed
file systems are divided into to parts: the data storage system and the metadata
management system. To handle increasing metadata requests and keep the coherence
and reliability of metadata operation, the design and implementation of metadata
management system is very important. This article designed and implemented the
metadata management system of distributed file system LandFile, and studied the key
problems and solutions of metadata management system in distributed file system.
This article mainly focuses on the following aspects:
The energy-saving-based load balancing strategy of metadata server cluster: in
large scale application, energy cost has been a more and more important problem.
Based on the dynamic metadata management, we proposal an energy-saving load
balancing strategy. By merging the workload of two metadata servers into one and
turning off the other metadata server when the overall load is low, the overall energy
consumed is saved. The sleeping metadata server will be added into the cluster when
the overall load increases. The carefully designed dynamic metadata management
ensures a smooth transition when the server is added and removed. Experiment shows
that the energy cost is dramatically reduced in energy-saving mode while the
performance is not affected.
The coherence of metadata management: Metadata may have multiple replicas in
the system. A few kinds of metadata operation may involve two metadata servers. We
designed the primary-node based cache structure and implemented the update strategy
of multiple replicas. We also implemented a two-phase-commit protocol to ensure the
coherence of distributed metadata operation.
High reliability metadata management: We studied the nodes management,
failure detection and failure recovery method of metadata server. By using the nodes
management strategy based on region autonomy and log-based failure recovery, the
failure node can be replaced and recovered in a short time, the system service is still
II
Abstract
of
the
New
Generation
Collaborative Supporting
III
VOD
Video On Demand
DAS
NFS
VFS
OSD
MDS
MetaData Server
BS
Binding Server
MAID
IV
........................................................................................................... I
Abstract ...................................................................................................... II
...................................................................................................... IV
.......................................................................................................... V
................................................................................................ VIII
1 ...............................................................................................1
1.1 ............................................................................................................. 1
1.2 ................................................................................................. 1
1.2.1 NFS .................................................................................................................................. 2
1.2.2 Coda ................................................................................................................................. 3
1.2.3 xFS ................................................................................................................................... 4
1.2.4 GPFS ................................................................................................................................ 4
1.2.5 Lustre ............................................................................................................................... 5
1.2.6 Google File System .......................................................................................................... 5
1.2.7 Dawning Cluster File System (DCFS2) ........................................................................... 6
1.2.8 (BWFS) ................................................................................................... 7
1.3 ................................................................................................. 7
1.4 ............................................................................................................. 9
2 ........................................................................10
2.1 ........................................................................... 10
2.2 ....................................................................................... 12
2.2.1 ............................................................................................... 12
2.2.2 ................................................................................................... 13
2.2.3 ................................................................................................... 14
V
2.3 ............................................................... 14
2.3.1 MDS .................................................................................. 14
2.3.2 ............................................................................... 15
2.3.3 ....................................................................................... 15
2.4 ....................................................................... 16
2.5 ............................................................................... 17
2.6 ........................................................................................................... 17
3 ........................................................19
3.1 ............................................................................................... 19
3.2 ............................................................................................... 20
3.2.1 ............................................................................... 20
3.2.2 MDS .......................................................................................... 20
3.2.3 MDS .................................................................................. 22
3.3 ............................................................................................................ 22
3.4 ............................................................................... 24
3.4.1 ................................................................................................... 24
3.4.2 MDS ........................................................................................................ 25
3.4.3 ................................................................................................... 27
3.4.4 MDS ........................................................................................................ 29
3.5 ........................................................................................... 29
3.6 ....................................................................................................... 31
3.6.1 ....................................................................................................................... 31
3.6.2 ........................................................................................................... 32
3.7 ........................................................................................................... 33
4 ................................34
4.1 ....................................................................... 34
4.1.1 ................................................................................... 34
4.1.2 ................................................................................... 35
4.1.3 ................................................................................... 36
4.2 ............................................................................... 36
4.2.1 ................................................................................... 36
VI
4.2.2 .................................................................................... 37
4.3 ........................................................................................................... 39
5 LandFile .............................................40
5.1 LandFile ............................................................. 40
5.1.1 LandFile ........................................................................ 40
5.1.2 LandFile ........................................................................ 41
5.1.3 LandFile ........................................................................ 41
5.2 / .................................................................. 42
5.3 ................................................................... 43
5.4 ............................................................................... 46
5.4.1 ........................................................................................................... 46
5.4.2 ................................................................................................... 47
5.4.3 ....................................................................................................... 49
5.5 ........................................................................................................... 49
6 .........................................................................................50
6.1 ................................................................................................................... 50
6.2 ....................................................................................................... 51
...................................................................................................52
...........................................................................................................54
......................................55
VII
VIII
1
1.1
30% 114%
VOD
(I/O )
Direct Access Storage,
DAS
863
LandFile
1.2
AFSCodaLocusSprite File
SystemSun NFS
Internet
NFS3 NFS4FrangipaniPetal
1
OceanStore
D2K-COSMOSD3K-COSMOSDCFS
DCFS2
P2P
Granary MDNMedia Distribution Network
1.2.1 NFS
[1]
[2]
[3]
[4]
1.1 NFS
VFS
2
NFSv3 stateless
cache NFSv4
statefullease
1.2 NFS
NFS 1.2
windows cifs
NFS
1.2.2 Coda
Coda[7] 1987 CMU AFS-2
Coda
Coda
Coda
Coda
Coda
Resolution
1.2.3 xFS
xFS[8]
xFS
invalidation-based write back
aggressive client caching
1.3 xFS
1.2.4 GPFS
General Parallel File System[9] GPFS IBM
shared-disk
iSCSI GPFS serverless
extensible hashing
GPFS
GPFS
1.2.5 Lustre
Lustre[10] Cluster File System
POSIX UNIX
1.4
Object-based Storage
DevicesOSDLustre
Lustre
OSD
1.4 Lustre
Lustre Lustre
PC Google
GmailVideo Blog Google File System
1.5 Google File System master
5
1.6 DCFS2
DCFS2[14]
IP-SAN
DCFS2 1.6
1.2.8 (BWFS)
1.7
[12]
IP-SAN NAS
1.7
1.3
LandFile
MDS MDS
MDS
MDS
MDS
MDS
1.4
10
50 80[15]
2.1
/
NFS
2.1
I/O
10
2.1
PB TB
2.2
LandFile
11
,
id lookup create unlink
stat
POSIX UNIX
2.2
2.2.1
MDS
MDS MDS
hash
NFSAFS
Coda
MDS
rename MDS
MDS MDS
MDS MDS
hash hash
hashLustrezFSIntermezzo hash
hash
12
MDS hash
hash
rename
/ MDS
MDS
hash
hash
hash DOIDFH
[16]
MDS MDS
2.2.2
MDS MDS
MDS
MDS
[12]
BS MDS MDS BS
MDS BS MDS
MDS
MDS
MDS
MDS
MDS
MDS
2.2.3
LandFile MDS
MDS
MDS MDS
MDS
2.3
MDS MDS
MDS
MDS
2.3.1 MDS
MDS
MDS MDS
14
MDS
MDS MDS
MDS
MDS
2.3.2
hash
MDS
MDS
Coda
MDS
CEPH
CEPH
MDS
2.3.3
MDS
15
MDS MDS
MDS
MDS MDS
MDS
MDS
MDS MDS
MDS MDS
MDS
MDS
CPU
MDS MDs
2.4
MDS
MDS
MDS
MDS
16
2.5
2.6
17
18
3.1
IDC
300 1 0.64
1659
100
50
50
0.64 100 55
15%20%
IT
Google
Google
INTELIBM
SNIA
MAID
MAID
Power OFF
MAID
Nearline Storage
3.2
3.2.1
web
MDS MDS
3.2.2 MDS
3.1 web
MDS
20
web
2000
1500
1000
500
0
1
11
13
15
17
19
21
23
3.1 web
3.2
3.2
CPU 5%
CPU 95%
21
3.2.3 MDS
MDS
1
MDS
MDS
MDS
MDS
MDS
MDS MDS
MDS
MDS MDS
3.3
MDS MDS
MDS
MDS L
CPU
CPU
CPU
i lcpu lmem lbw CPU
1
Li [0,1] Li
MDS Li 0 MDS Li 1 MDS
22
3.1
p
readstat
mkdir, write
t
f n
pnew pold * f (t ) 1
pancestor _ new pancestor _ old * f (t ) 1/ 2n
3.2
MDS PMDS
i MDS Pi 3.2
t p
Pi _ new Pi _ old * f (t ) p
3.2
MDS sMDS
MDS
MDS MDS
MDS i MDS
Ci MDS Pi MDS Li Ci
MDS Li 1
Ci Pi / Li
3.3
MDS O MDS
MDS MDS
MDS MDS MDS
MDS 3.3
23
C L
i
i 1
n
C L P
i 1
n
3.4
i 1
nMDS
Ci
i 1
i 1
n
Ci
nMDS
3.5
i 1
cost MDS
MDS MDS
MDS
MDS MDS
MDS MDS
MDS MDS MDS
3.4
MDS
MDS
MDS
MDS
MDS
3.4.1
MDS
MDS MDS
MDS MDS
24
MDS
MDS MDS
MDS
3.4.2 MDS
MDS MDS MDS
MDS MDS
MDS MDS
MDS
MDS 3.5
MDS Pi MDS
Pi
j MDS O j 3.6 Pi
Ci C j O j
MDS MDS MDS
MDS
n
Oj
P
i 1
( Ci ) C j
nMDS
3.6
i 1
3.6 MDS
MDS
MDS
MDS MDS
B
__
B
i 1
| Li L |
__
| Li O |
nMDS
O
i 1
n
3.7
MDS B 0
MDS MDS
B 3.8
25
3
n
B n 2 Ci , nMDS
3.8
i 1
j MDS k MDS B jk
3.9 O '
MDS MDS 3.5
MDS O ' 3.9
| Li O ' |
B jk
O'
i 1,i j , k
n
| L O'|
i
O'
i 1,i j
n
| L O'|
i
O'
i 1,i j
n
| Lk
C j Lj
C j Lj
O'|
| Lk
| Lk
Ck
O'
C j Lj
Ck
Ck
O'
O'|
| Lk O ' |
O'
3.9
O ' | | Lk O ' |
O'
(nMDS)
C j Lj
Ck
O ' | | Lk O ' |
3.10
T jk MDS MDS
MDS MDS
MDS
T jk MDS
x MDS MDS T jk
T jk MDS MDS x
MDS m
cn MDS Omax
m 3.12 m MDS
MDS
m 1
m 1
i 1
i 1
i 1
Omax * ci Pi Omax * ci
x nm
x 3.13
26
3.12
3.13
MDS 3.11
MDS MDS MDS
3.4.3
MDS MDS
MDS MDS T1MDS
MDS MDS
MDS T2MDS
t MDS
MDS MDS
MDS MDS MDS
MDS MDS
MDS
1
a) MDS MDS
SendNode
MDS SendNode
b) MDS m MDS
x
c) MDS x MDS
d) SendNode O '
e) MDS T jk T jk MDS
ReceiveNode
f)
2
SendNode
a) SendNode
b) SendNode MDS
SendNode SendNode
SendNode
SendNode
ReceiveNodeMDS
MDS
27
c) SendNode
MDS
d)
e) ReceiveNode
SendNode ReceiveNode
ReceiveNode MDS
f)
g) SendNode
ReceiveNode
h) ReceiveNode
i)
ReceiveNode
j)
ReceiveNode
a) SendNode
b) SendNode MDS
MDS
SendNode
SendNode
ReceiveNode
ReceiveNode
c)
d)
e) MDS
f)
28
3.4.4 MDS
MDS MDS MDS
MDS MDS MDS
MDS MDS
MDS MDS
MDS WakeupNode
WakeupNode MDS
LSendNode O
LSendNode
3.12
PWakeupNode CWakeupNode * O
3.13
P min(PSendNode , PWakeupNode )
3.14
SendNode P
WakeUpNode
3.5
2.2.2
MDS MDS
MDS MDS
29
MDS
MDS MDS
MDS MDS
MDS
MDS
MDS
MDS MDS
MDS
MDS
MDS MDS
MDS MDS
MDS MDS
MDS
MDS
MDS
MDS
30
MDS
MDS MDS
3.6
MDS
MDS
3.6.1
10
3
24
0000~23:59
4
MDS
10
3
24
0000~23:59
4
MDS
31
3.6.2
W)
1000
800
600
400
200
0
1 3 5 7 9 11 13 15 17 19 21 23
h
3.3
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
10 13 16 19 22
h
3.4
3.3
5
3.4 2 7
32
3.7
33
MDS
4.1
4.1.1
MDS
MDS
MDS MDS
MDS
MDS
MDS inode
MDS inode MDS inode
MDS MDS
MDS
MDS MDS
MDS
34
MDS MDS
k
MDS
4.1.2
MDS
MDS
MDS MDS
MDS
MDS MDS
MDS
LOCK_SYNC
LOCK_LOCK
SYNC
GLOCKR
LOCK
4.1
1
LOCK_ GLOCKR
2
LOCK_LOCK ACK
LOCK_LOCK
35
3
LOCK_LOCK LOCK_SYNC
LOCK_LOCK LOCK_SYNC
4.1.3
MDS MDS
linkunlink rename MDS
1.
2.
3. 1
4.
4.2
4.2.1
MDS
MDS
MDS
MDS MDS
MDS MDS
MDS MDS
MDS
MDS
4.2.2
MDS
MDS MDS
MDS
MDS
MDS MDS
MDS
MDS
MDS
MDS
MDS
MDS
37
MDS MDS
MDS
renamelinkunlink
MDS
MDS
2
MDS MDS
MDS MDS
MDS MDS
MDS MDS
MDS
MDS
MDS
4
MDS MDS
MDS
MDS MDS
38
MDS MDSMDS
MDS MDS
MDS
4.3
MDS
39
5 LandFile
5 LandFile
LandFile
5.1 LandFile
5.1.1 LandFile
LandFile
GUI API
5.1 LandFile
40
5 LandFile
5.1.2 LandFile
5.2
5.2 LandFile
5.1.3 LandFile
5.3
/
9
/
41
5 LandFile
5.3 LandFile
5.2 /
/
/
LandFile [17]
/
MDS
MDS
42
5 LandFile
MDS MDS
3
MDS
5.3
MDS
MDS 5.4
MDS MDS
MDS MDS MDS
MDS
43
5 LandFile
5.4
(0,1)
(0,1)
/usr
(0,1)
/usr
(0,1)
/usr/test
(1,0)
/usr/test/dir2
(1)
5.5
44
5 LandFile
idid
5.6
0 /usr/test/
/usr/test/ 1
1
2
1 /usr/test/
/usr/test/ dir2
/usr/test/dir2/
4
45
5 LandFile
5.4
5.7
5.7
5.4.1
5.8
46
5 LandFile
MDS
5.8
5.4.2
EntrySubtreeMap
struct ESubtreeMap{
bufferlist data; //
map<dirId, list<dirId> > subtrees; //
}
utimechmodmknodopencmkdir
MDS linkunlinkrename
MDS
struct EentryUpdate{
bufferlist data; //
string type; // mkdir,link
bufferlist client_map;// rename client
reqid_t reqid; // id
47
5 LandFile
struct EntryHelperUpdate{
bufferlist commit;
//
bufferlist rollback; //
string type; //
reqid_t reqid;// id
__s32 mdsId;// MDS id
__u8 op;
//
base; // id
48
5 LandFile
struct EImportStart{
dirId base;
// id
list<dirId> bounds; //
bufferlist data; //
bufferlist client_map;
//
}
struct EImportFinish{
protected:
dirID base; // id
bool success;//
}
MDS EntryImportStart EntryImportFinish
EntryImportFinish
5.4.3
EntrySubtreeMap
EntrySubtreeMap
0 -1
5.5
LandFile
49
6
6.1
LandFile
50
6.2
51
[1]
Sun Microsystems, Inc. NFS: Network File System Protocol Specification, March 1989.
[2]
[3]
[4]
R. Sandberg. The Sun Network File System: Design, Implementation and Experience. IN
Proceedings of USENIX Summer Conference, summer 1987. University of California
Press, pp. 300-313.
[5]
. Coda : []. : .
2005.
[6]
. : []. : . 2006.
[7]
Brama, P. J. The Coda Distributed File system, LINUX Jounral, June 1998.
[8]
[9]
Frank Schmuck and Roger Haskin. GPFS: A Shared-Disk File System for Large
Computing Clusters. Proceedings of the Conference on File and Storage Technologies
(FAST02), January 2002: 231-244.
[10]
Peter J. Braam, RumiZahir. Lustre Technical Project Summary, Version 2. July 29, 2001
[11]
[12]
. : []. :
. 2003.
[13]
Sage A. Weil. Ceph: Reliable, Scalable, and High-Performance Distributed Storage. Ph.D.
thesis, University of California, Santa Cruz, December, 2007
[14]
. : []. :
. 2005.
[15]
Drew Roselli, Jay Lorch, and Tom Anderson. A comparison of file system workloads In
Proceedings of the 2000 USENIX Annual Technical Conference, pages 4154, San Diego,
CA, June 2000. USENIX Association.
52
[16]
Dan Feng, Juan Wang, Fang Wang, Peng Xia. DOIDFH: an Effective Distributed Metadata
Management Scheme. The 5th International Conference on Computational Science and
Applications, 2007, pages 245-252.
[17]
. : []. :
. 2009.
53
SA0710
2010 5
54
[1] , . CEPH . .
[1] , . (No:
200910236456.7).
55
(10)
1. 2006
CGSP
Web
2
2 GridFTP
2 2
OptorSim
2. 1998
,,I/O.
,COSMOS,S2FS.S2FS.,S2FS
,UNIX.,S2FS,.,
,S2FS,.,
,I/O,I/O,
.
3. . CEPH -2010,47(9)
(CEPH).CEPH,
.,,.
4. 2009
<br>
<br>
protoDFS<br>
MDSMDS
5. . SAN
SAN,,.
SAN.,,;
:,.
6. ZD-DFS 2006
ZD-DFS
Web
ZD-DFSXA
ZD-DFSZD-DFS
7. 2005
DPISMDS
8. 2005
()(
)
1.
OCFS
2.OCFSDPISMDS
3.OCFS
4.OCFSDPIS
MDS
9. 2007
SANNAS
IDWORMCASN
CASDCASDHash
CASDCASD
CASDIntel OSDNNFSiSCSICASN
CASN
http://d.g.wanfangdata.com.cn/Thesis_WFA00010723.aspx
(zkyjsc)001bfe75-8f5d-43ca-b8b2-9e4001074195
2010122