You are on page 1of 45

Object Storage

Manas Minglani
Wednesday, 2015

The Storage Evolution: From Blocks, Files and


Objects to Object Storage Systems
Structure Data vs Unstructured

Topics
Block-Based Data Access
File-Based Data Access
Object-Based Data Access
Object-Based Storage Devices (OSD)
Object Storage Systems
Object Storage Server (OSS)
Content Addressable Storage (CAS)
Content Aware Storage (CAS)

Data Access
Data Access

Block

SCSI
SAS
iSCSI
SATA

Object

OSD

File

Local FS
Distributed FS
Global, Distributed & Parallel FS

The Block Paradigm

Block Paradigm (contd.)


It allows for convenient manipulation of data at a byte-level, useful for Random
I/O and where only small portions of data are required.
Serious overhead when block-based storage system expanded beyond hundred
terabytes or beyond multiple petabytes. Durability issues will also present itself.
Solving the provisioning management issues presented by the expansion of
storage at this scale is where object storage shines.
Object storage is resilient while mitigating costs. Objects remain protected by
storing multiple copies of data over a distributed system; if one or more nodes
fail, the data can still be made available, in most cases, without the application or
the end user ever being impacted.

Block Paradigm (contd.)


However, the degree to which low-level interaction can be performed is very
limited compared to Block/File storage.
Interaction can occur via a single API endpoint. This eliminates complex LUN
mapping, storage network topologies, etc. from the application infrastructure
design.

Local File Systems (One more level of Indirection)

Distributed File Systems


e.g NAS with NFS

Inode Metadata

File Storage (contd.)

File systems allow concurrent access to smaller groups & they enable read and
write operations. Overhead to manage permissions and operations such as file
locking.
Much of the unstructured data that is generated today, does not require
concurrent access; so the file systems overhead is unnecessary and simply adds
cost and complexity.
Hierarchical file structure consisting of directories and a hierarchy of nested
folders, subfolders and files; the content and data contained within a file is not
anywhere as important as is the location in the directories; as a consequence
each file will have only basic metadata attached to it, such as file name, date
created, last modified, file type, and person who created the file.

The Old Block Paradigm

The New Object Paradigm Flat Address Space

The New Object Paradigm (contd.)


WRITE 26,763 Bytes
QoS = High
Description = X-Ray
Retention = 50 years
Access Key = *&^%#

Object Storage Responsibilities:


Space Management
Access Control (Identity Mgmt)
QoS Management
Cache, Backup
Policy Migration, Retention

In a Flat Namespace

Benefits over File Storage


No directory Tree
Object Storage uses a flat structure, storing objects in containers, rather than a nested tree
structure
Eliminating the overhead of keeping track of large quantities of directory metadata, one
major performance bottleneck that is typically seen once tens of millions of files are
present on a filesystem is eliminated.

Metadata lives with the object

Objects

Inodes vs Objects

Virtual View/Virtual File Systems

Benefits over Virtualization


Object Storage is one (or potentially few in the case of multi-region
deployments) giant volume. Virtually all storage management
overhead of Block and File Storage is eliminated as a result (i.e. no
more resizing or remapping volumes).

Comparison
We dont recommend you use object storage for transactional data,
especially because of the eventual consistency model outlined
previously. In addition, its very important to recognize that object
storage was not created as a replacement for NAS file access and
sharing; it does not support the locking and sharing mechanisms
needed to maintain a single accurately updated version of a file.
Good examples for block storage use cases are structured database
storage, random read/write loads, and virtual machine file system
(VMFS) volumes.

Object Types
The ANSI T10 SCSI OSD standard defines four different objects:
The root object -- The OSD itself
User objects -- Created by SCSI commands from the application or client
Collection objects -- A group of user objects, such as all .mp3 objects or all objects belonging to a project
Partition objects -- Containers for user objects and collections that share common security and space management
characteristics, such as quotas and keys

Characteristics for Object Storage


Scalability: Many of the features of Object Storage seem inconvenient at very small scales, but as data
scale reaches hundreds of TB and moves into the PB range and beyond, these features become
invaluable, and allow continued horizontal scalability for virtually any quantity of data.

Durability: Due to the design of most Object Storage systems (3 file replicas is the most common
paradigm), durability levels at scale are extremely high compared to conventional storage solutions
(think 99.99999% to 99.999999999%, 7 to 11 nines). Object Storage systems have internal mechanisms
to verify file consistency, and handle failed drives, bit-rot, server and cabinet failures, etc. These
features allow the system to automatically replicate data as needed to retain the desired number of
replicas, which results in extremely high durability and availability of data.
Cost: Because many Object Storage platforms are designed to run on commodity hardware, even with
3x overhead, the price point is attractive when compared to block or file storage. At scale, costs of
pennies per gig per month are typical. Think the comparable or better durability than tape, nearly the
same cost as tape, and the convenience and performance of hot storage, plus all the benefits of a
cloudy storage platform.

Other Characteristics for Object Storage


Redundancy: redundancy and high availability by storing copies of the same object on multiple nodes. When an object
is created, it's created on one node and subsequently copied to one or more additional nodes, depending on the
policies in place The lack of in-place update support enables multicopy object redundancy with very little complexity.
For traditional storage systems, keeping copied (replicated) files and blocks in-sync across multiple instances is a
tremendous challenge; it's very complex and can only be done with a set of very strict restrictions, such as within
defined latency constraints
Protocol Support: Traditional block- and file-based protocols work well within the data center where performance is
good and latency isn't an issue. But they're not appropriate for geographically distributed access and the cloud where
latency is unpredictable. Furthermore, traditional file systemprotocols (CIFS and NFS) communicate on TCP ports that
are available on internal networks, but rarely exposed to the Internet. Conversely, object storage is usually accessed
through a REST API over HTTP (OpenStack, Amazon S3). Simple commands put, get, delete, list.
Application support and integration: Since the industry is still struggling to agree on standards, widespread object
storage integration is still sparse. Custom application integration, some commercial applications, especially for backup
and archiving, have added object storage integration support, primarily linking to Amazon S3 cloud storage.
Cloud features: Multi-tenancy and the ability to securely segregate different users' data are must-haves for an object
storage product to be used beyond the enterprise. Security is more than encryption, and should include provisions to
control access to tenants, namespaces and objects.

The ANSI T10 object-based Storage Standard


In 2004, the ANSI (American National Standards Institute) T10 Standards body ratified an Object-based Storage
Device (OSD) command set for SCSI (Small Computer System Interface) storage devices that implements the OSD
interface.

New Paradigm
Evolving the storage interface. The figure indicates how the
OSD migrates (lower arrow) a portion of the file system into
the storage device while leaving the file system high-level
policy functions (e.g., user authentication) to the server
(upper arrow). A cryptographic mechanism (symbolized by
the keys in the figure and generated using private keys
stored in the OSD) is used to authorize client access.

current storage devices already


perform extensive processing, virtualizing block storage
addresses to remap unusable sectors and automatically
recover from I/O errors. However, block-based storage
devices have no information about the relationship
between blocks, making it impossible for the device to
intelligently allocate its internal resources

Rationale
An object can grow or shrink dynamically and is completely contained
and managed within a single OSD.
Objects are grouped into partitions that enable security, space
management, and quota management. Each partition represents a
security domain with its own set of keys.
The OSD object model is flexible, allowing higher-level software to
map each of its entities (e.g., file and table) to a single object,
multiple objects, or a partial object.

Basic Command Set

OSD Optimizations

One of the common usage patterns of an OSD involves an operation (e.g., WRITE)
on the data segment of an object and then accessing some attributes (e.g., a file
system maintained last-access-time attribute). To support this common access
pattern, the OSD standard allows all commands to include a SET_ATTR and/or
GET_ATTR operation. We call this piggybacking an attribute access on a
command. Piggybacking attribute updates can significantly reduce the number of
messages and hence the message processing on each OSD. Piggybacking can also
improve client latency by removing the additional network roundtrip that a
separate GET_ATTR or SET_ATTR command would create.

Security

Object Storage Device Security Methods

Seagate Object Drive

Seagate designed and made a protype of an object disk that could be a


continuation of its existing line of highend hard disks.
The OSD code ran on one of the processors alongside other portions of the drive
firmware
OSD extends earlier interfaces by providing two crucial pieces
First, since space management is delegated to drives, drives now have basic information
such as which blocks are free and which are in use. This allows drives to optimize reliability
and service quality.
The second piece of information gained from the OSD relates to information about
application requirements. Through shared attributes, applications can relay preferences for
QoS and reliability.

With traditional sector-based drives, disk access is based on fixed-length blocks


accessed by block addresses. As a result, any change in the sector size requires
widespread software modifications. With an OSD, the drive manages block size
and space management

Seagate Object Drive (contd.)


The addition of space management is a natural extension of the functionality
provided by the drive. Current drives already contain relocation maps for defect
management, zoning, and other performance and reliability tracking. Allocation
maps extend this functionality by providing richer semantics to help minimize
seeks and to optimize read/write caching.
Exception handling and recovery are critical for the proper functioning of the
drive. As with a host-based file system, each drive must protect metadata. This
can be done by internally mirroring the metadata or by protecting it with more
powerful error-correction codes.
OSD drives can unobtrusively perform repairs or fence objects if they are not
repairable.
Achieving high data-transfer rates to a hard disk is difficult because of its limited
computational power. Specialized hardware was used wherever possible. The
OSD protocol has more complex commands than those used in standard SCSIs

Challenges for Object Stores

Eventual Consistency
Scale out by adding more x86 servers, nodes to the Object Store
Due to distributed nature of Object Storage, it is subject to Brewers CAP theorem.
It states that it is impossible for a distributed system to simultaneously
Provide consistency, availability, and partition tolerance
Object stores can offer two but not three

Eventual Consistency - Eventual consistency means that in case


data objects are stored and receive no new updates, that eventually
all nodes with access to these data objects will return the last
updated value. Eventual consistency has been proposed so Object
Stores can offer an acceptable performance.

Latency and Performance

Virtual Machines especially IOPS devouring applications require storage latency


and high performance storage SAS disks, fiber channel, infiniband and All-Flash
Arrays were introduced to offer the necessary bandwidth and an acceptable
latency.
Object Storage is developed and optimized to contain a massive amount of data.
To maximize the amount of storage capacity per node in the Object Storage
Cluster, large SATA disks are selected as these provide the best price per GB. By
selecting these large disks, you cant achieve the IOPS and storage performance
needed by Virtual Machines. SANs have been fitted with fast, but small SAS
drives. Try to maximize the amount the storage per node and select smaller, more
expensive SSD disks, but the price per GB stored data skyrockets.

Different Management Paradigms


Object Stores understand objects, while hypervisors understand Virtual
Machines. What is needed is a software layer that plugs into the hypervisor such
that the system administrator doesn't need to understand LUNs, RAID groups, etc
but can just manage Virtual Machines. This software layer has to translate a VM
paradigm into an Object Store paradigm.

Turning Object Storage into Virtual Machine Storage

Open vStorage takes this different approach and is designed from the ground up
with Virtual Machines and their performance requirements in mind. It uses a
well-considered architecture which allows Object Storage to be turned into block
storage for Virtual Machines and avoids pitfalls such as seen with distributed file
systems linked to Object Storage.
Open vStorage creates a unified namespace for the Virtual Machines across
multiple Hosts.

OpenStack Swift
Data is distributed by making distributed copies or replicas to different nodes and
drives across the Swift cluster
The number of replicas is user-configurable
The access is provided via RESTfil API and am Amazon S3 compatible RESTful API
Swift is based on ring architecture with three categories: account, container, and
objects

Ceph

Stores data on a single distributed computer cluster and provides interfaces for
object-, block-, and file-level storage
Does not have a single point of failure
Scalable to Exabyte level
Replicates data and make it fault tolerant

Ceph employs four distinct kinds of daemons:[4]


Cluster monitors
Metadata
Object Storage Devices
Representational state transfer

Characteristics
API-level access vs. filesystem-level
Flat structure vs. hierarchical structure
Scalable metadata
Scalable platform
Durable data storage
Low-cost data storage

Advantages
Scalable capacity (many PB easily)
Scalable performance (environment-level performance scales in a
linear manner)
Durable
Low cost
Simplified management
Single Access Point
No volumes to manage/resize/etc.

Disadvantages
No random access to files
POSIX utilities do not work directly with object-storage (it is not a
filesystem)
Integration may require modification of application and workflow
logic (This could easily be seen when we are trying to use FIO for the
Kinetic Drives). Due to API in there we have to alter the FIO to make it
run benchmarks on the Kinetic Drive.
Typically, lower performance on a per-object basis than block storage

Use Cases
Currently the datasets best-suited for Object Storage are the following:
Unstructured data

Media (images, music, video)


Web Content
Documents
Backups/Archives

Archival and storage of structured and semi-structured data


Databases
Sensor data
Log files

Thanks

You might also like