You are on page 1of 24

NCCS Lustre File System Overview

presented by
p y
Sarp Oral, Ph.D.

NCCS
CCS Sca
Scalingg Workshop
o s op
August 1st, 2007 Oak Ridge National Laboratory
U.S. Department of Energy
Outline

• What is Lustre

• Lustre Architecture

• NCCS Jaguar Lustre

• Other NCCS Lustre File Systems

• NCCS Lustre Efforts

• Lustre Centre of Excellence (LCE) at ORNL

Oak Ridge National Laboratory U.S. Department of Energy


What is Lustre

• Lustre
− POSIX compliant
p
− Parallel file system

• Lustre provides
− High-scalability
g y
− High-performance
− Single
g g global name space
p

• Lustre is a software only architecture

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

• Lustre consists of four major components


− MetaData Server (MDS)
− Object Storage Servers (OSSs)
− Object Storage Targets (OSTs)
− and of course “Clients”

• MDS
− Manages the name space, directory and file operations
− Stores file system metadata
− Extended attributes point to objects on OSTs

• OSS
− Manages
g the OSTs

• OST
− Manages underlying block devices
− Stores file data stripes

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

Metadata Ops
p MDS

File creation, stats,


Client
recovery

OSS
OSS
Block I/O and file OSS
locking
OSS
OST
OST
OST
OST
Block
Block
Device
Block
Device
Block
Device
Device

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

• All servers have a full-blown file system they operate on


− Today,
Today ext3/ext4 (vastly improved by CFS)

• Today, only a single active MDS is supported


− Goal is to have many MDSs in near future
− Current downside
• Whole file system is limited by that single MDS’ performance
− Although not that bad, sometimes can be a problem

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

• Failover
− Active-passive pairs for MDS and OSS
− Works fine on all NIX based systems except Catamount
• Failover is not supported with current UNICOS
• Failover will be supported with the CNL

• Supports sparse files

• Supports up to 8TB partitions currently


− We are using 2TB partitions

• Unlike all other NIX based systems,


y on Catamount clients, Lustre
access is
i achieved
hi d over libl
liblustre
t
− Catamount clients are uninterruptible and I/O is not cached
− Liblustre is directly linked into the application

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

• Striping is the key for achieving high scalability and performance

• File data is written to and read from multiple OSTs


− Provides higher aggregate R/W BW than a single server can
deliver

• Allows file sizes to be larger than a single OSS could handle

• Simple tips
− Over striping might be bad
• Too small chunks to write into each OST
− Under utilizing OSTs and the network
− Under striping might be bad
• Too much stress per each OST
− Contention

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

• Stripe pattern can be changed by the user


− Before the file or directory is created
− Once created, the stripe pattern is fixed

− Command line
• “lfs setstripe” to set the stripe pattern
• “lfs getstripe” to query the stripe pattern

− Within the application


• Several low-level ioctl calls available to set and q
query
y stripe
p
patterns and some other EA

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

OST1 OST2 OST3 OST1 OST2 OST3

1 1 1 File A 1 1 2 File A
2 2
File B 2 1 File B
3
4 File C 4 3 File C

Single Striped Two Striped

OST1 OST2 OST3 • Stripe count (or width)


− # of OSTs the file has been
1 1 2 File A stripped over
• Stripe size
1 2 3 File B
− Size of each stripe on an OST
4 File C • Normally same for all OSTs
for a given file
Fully Striped

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

• Everything is based on RPCs


− Control and data is requested and transferred over
RPCs
− Sometimes messages are dropped or lost
• Timeouts
− If the error is caused by the client side
• Client will simply disconnect from that particular server
• Keep retrying to connect
• Eviction
− If the error is caused by the server side
• Client will discover it has been evicted by the next request
• Client’s all buffer cache will be invalidated
• Dirty data will be lost

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

• Architecture has changed with Lustre 1.4.6


− LNET and LNDs
− Independent
p network conduits has been introduced
− A single “network and recovery” layer establishes connection with
upper “Lustre file system” layers and the lower “network conduits”
• TCP, Cray Portals, Infiniband, Myricom, Elan

Lustre File System Layer

Lustre Networking (LNET) Layer

TCP Myricom Elan


LND LND LND

TCP Myricom Elan


Vendor Vendor Vendor
Lib
Library Lib
Library Lib
Library

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Architecture

• POSIX compliant (in a cluster)


− Atomic ops
− Clients don’t see stale data or metadata
− Semantics are guaranteed by strict locking

− Exceptions
• Flock/lockf is still not supported

• Security
− NFS comparable today
− Kerberos capabilities on the way
− Encrypted Lustre file system is under development

Oak Ridge National Laboratory U.S. Department of Energy


NCCS Jaguar Lustre

• Cray XT4 (Catamount) Lustre, 3-D tori

− 3 Lustre file systems for production runs: 2×150 TB


TB, 1×300 TB
• Uses 3 MDS service nodes
• 72 XT4 service nodes as OSSs
− 4 OSTs/OSS
− 2 OSTS for the 300 TB FS
− 1 OST per each remaining 150 TB FS
• 2 1-port
p 4 Gb FC HBAs p per OSS
• 45 GB/s block I/O for the 300 TB FS

− DDN 9550s
• 18 racks/couplets
• Write-back cache is 1MB on each controller
• 36 TB per couplet w/ Fibre Channel drives
• Each LUN has a capacity of 2 TB and 4 KB block size

Oak Ridge National Laboratory U.S. Department of Energy


NCCS Jaguar Lustre

Cray XT4 (Catamount) LUN configuration

Oak Ridge National Laboratory U.S. Department of Energy


NCCS Jaguar Lustre

• Default stripe count/width on Jaguar (Catamount)


−4

• Default stripe size on Jaguar (Catamount)


− 1MB

Oak Ridge National Laboratory U.S. Department of Energy


NCCS Jaguar Lustre

• Jaguar Compute Node Linux (CNL) Lustre


− Small
• Compared to the Catamount side

− Exact configuration
f details are to be determined

− Open issues
• A mechanism to transfer files between the Catamount side and
the CNL side
− Can be done by NCCS
− Can be done by users

Oak Ridge National Laboratory U.S. Department of Energy


Other NCCS Lustre File Systems

• End-to-end cluster (Ewok)


− Lustre 1.4.10.1
1 4 10 1 for production runs
− 1 MDS, 6 OSS, 2 OST/OSS, OFED 1.1 IB, 81 clients
− 20 TB, ~3-4 GB/s

• Viz cluster (Everest)


− Coming soon
− 1 MDS, 10 OSS, 2 OST/OSS, OFED 1.2
− ~Couple
p tens of TB,, ~4-5 GB/s

Oak Ridge National Laboratory U.S. Department of Energy


Other NCCS Lustre File Systems

• Center-wide Lustre cluster (Spider)


− To serve all NCCS resources
• Jaguar, Everest, and Ewok by the end of 2007
• Baker by the end of 2008
• And all new additions from that point on

− Phase 0
• Will be in production soon over Jaguar
• 20 OSS,
OSS 80 OSTs
OSTs, 4 OST/OSS,
OST/OSS 10GE & 4xSDR IB
• 10 couplets of DDN 8500s, FC 2 Gb direct links w/ failover
configured

− Phase 1: additional 20 GB/s by the end of 2007

− Phase 2: total 200 GB/s by the end of 2008

Oak Ridge National Laboratory U.S. Department of Energy


Other NCCS Lustre File Systems

• Spider provides
+ Ease of data transfer between clusters
+ Ability to analyze data offline
+ On the flyy data analysis/visualization
y capability
p y
+ Ease of diagnostics/decoupling
+ Lower acquisition/expansion cost

Oak Ridge National Laboratory U.S. Department of Energy


Other NCCS Lustre File Systems

• Lustre router nodes on Jaguar


− Route Lustre packets between
• TCP to/from Cray Portals
• IB to/from Cray Portals

− ~ 450 MB/s/XT4 SIO node over TCP/Cray Portals


− ~ 600-700 MB/s/XT4 SIO node over IB/Cray Portals

Oak Ridge National Laboratory U.S. Department of Energy


Other NCCS Lustre File Systems
Jaguar
Ewok
Jaguar
g SIO
Router Node Everest
(Cray Portals/IB)

Infiniband
Network
TCP
Network

Spider

Legacy
Backend Systems
Disks

Oak Ridge National Laboratory U.S. Department of Energy


NCCS Lustre Efforts

• Lustre tool development


− Parallel Lustre copy tool
− Portals I/O function shipping tool
− Text-based top-like LLNL’s LMT tool
− Web-based LLNL’s LMT tool
• Lustre and HPSS integration
• Server-side client statistics
• File Joins
• High speed storage options for MDS
• Lustre 1.4/1.6 on Cray UNICOS with IB
• Lustre
L t 1.4/1.6
1 4/1 6 on CNL with
ith IB
• NCCS production Lustre FS (Jaguar, Ewok)
• Center-wide Lustre cluster ((Spider)
p )
• Jaguar performance tracking

Oak Ridge National Laboratory U.S. Department of Energy


Lustre Centre of Excellence (LCE) at ORNL

Lustre Centre of Excellence (LCE) established in December 2006.

• Create an on-site presence at ORNL (1st floor back hall)


− Two on
on-site
site staff,
staff rotating additional
− Oleg Drokin and Wang Di

• Develop a risk mitigation Lustre package for ORNL


− A single lowest risk
risk, scalable implementation to 1PF
− MPILND
− In out-years explore possible 1 TB/s solutions

• Train ORNL staff in Lustre Source


− Develop local expertise to reduce dependence on CFS and Cray
− Peter Braam gave a 3 day tutorial on Lustre Internals in January
− A sys admin training is being planned

• Assist Science teams in tuning their application I/O


− Focus on 2-3 key apps initially and document results
− Wang Di
− Started with S3D
− On-site Lustre workshops for application teams

Oak Ridge National Laboratory U.S. Department of Energy

You might also like