You are on page 1of 29

Apache HBase 0.

98
Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Andrew Purtell : Tianyou Li,

Who Am I?
Committer and PMC Member, Apache HBase project Apache HBase Committer Member of the Big Data Research And Development Group at Intel Release manager for Apache HBase 0.98 Apache HBase 0.98

What is Apache HBase?


A high performance horizontally scalable datastore engine for Big Data, suitable as the store of record for mission critical data

Apache Software Foundation community project Apache Open source Free license

HBase and Big Data


1994-2006: Large Internet companies first encounter Big Data 1994-2006: (Today: 94% corporate data growth YoY) (:94%)

HBase and Big Data


2006-today: The openness of the early leaders provides a blueprint for motivated and talented open source communities 2006-: .
Distributed filesystem Horizontally scalable database Parallel programming model Distributed lock manager GFS BigTable MapReduce HDFS HBase Hadoop

Chubby
Google

ZooKeeper
Apache, Yahoo, FB ?

HBase and Big Data


Now: HBase is a foundation of Big Data use cases : HBase

HBase and Hadoop


RDB Data Collector

Shark
Structured Query

R
Statistics

Sqoop

Giraph
Graph analysis framework

Mahout
Data mining

Pig
Data Manipulation

Hive
Structured Query

Oozie
Data Flow

HBase Coprocessors
Data execution engine
Log Data Collector

Flume

HBase
Distributed Database

Iterative In-Memory Computation

HDFS 2.0
Hadoop Distributed File System

The Java Virtual Machine

Hadoop Common JNI

Coordination

Cluster Resource Manager / MapReduce

Zookeeper

YARN (MRv2)

Spark

The HBase Data Model (HBase )


(Tablespaces)

Not a spreadsheet, think of a distributed sorted map

How HBase Achieves Scalability HBase


Table A

Splits Table B

Regions

Assignments

RegionServers

HBase As Data Application Platform HBase


Coprocessors()
In-process system extension framework( ) Observers (Like triggers) () Endpoints (Like stored procedures) ()

System integrators can deploy application code that runs where the data resides

HBase Differentiators HBase


RDBMS Data layout Transactions Query language Security Indexes Max data size R/W throughput limits
Row oriented Multi-row ACID ACID Native SQL SQL AuthN and AuthZ (ACL) On arbitrary columns Terabytes TB 1000s of operations per second 1000

HBase
Column oriented Multi-row within region only region No native query language SQL AuthN and AuthZ (ACL, Visibility labels) new in 0.98 and(, ) 0.98 Single row index only Petabytes PB Millions of operations per second

New In Apache HBase 0.98.0 Apache HBase 0.98.0


New security features and improvements
Cell tags HFile v3 Transparent server side encryption (HBASE-7544) Per-cell ACLs (HBASE-7662) Cell level visibility labels (HBASE-7663) EXEC access permission checks for Endpoints (HBASE-6104) Endpoints EXEC

New In Apache HBase 0.98.0 Apache HBase 0.98.0


New features
Reverse scans (HBASE-4811) MapReduce over snapshots (HBASE-8369) MapReduce

Performance improvements
Improved WAL write threading model (HBASE-8755) WAL Stripe compactions (HBASE-7667) REST streaming scans (HBASE-9343) REST

Cell Tags()
All values written to HBase are stored into cells HBase(cells) Cells can now also carry one or more tags Cells(tags)
Metadata, considered distinct from the key and the value , (key and value) We use tags to implement per cell ACLs and visibility labels (tags) cell

HFile Version 3
New file format, supporting cell tags and block encryption Enabled with a site configuration file change
hfile.format.version = 3

HFile v2 data is transparently migrated over time as new files are written by flushes and compactions HFile v2 flush compaction

Transparent Encryption (HBASE-7544)


Built on a new cryptographic codec and key management framework inside HBase HBase Transparent encryption of HBase on disk data HBase Supports schema design that places sensitive information in only a subset of column families column families

Transparent Encryption (HBASE-7544)

Per-Cell ACLs (HBASE-7662)


Extends the existing HBase ACL model with support for persisting and checking per-cell ACL data in tags HBasetags Backwards compatible We timestamp ACLs on a cell like any other HBase data for straightforward policy evolution

Visibility Labels (HBASE-7663)


Visibility expression support via new security coprocessor
Labels: arbitrary strings : Expressions: Labels joined boolean expressions : Operators: &, |, !, ( ) : &, |, !, ( )

in

secret secret | topsecret ( secret | topsecret ) & !probationary

Visibility Labels (HBASE-7663)


New client APIs and new shell commands for label management, similar to those of Apache Accumulo, for easy migration API Apache Accumulo, Users specify visibility expressions on cells cell Users ask for authorizations on Gets and Scans (Gets Scans) The server decides which authorizations are valid Scan results are filtered according to the users visibility Scan

Endpoint EXEC Grants (HBASE-6104)


HBase ACLs grant a familiar set of privileges to users and groups: HBase :
(R)ead, (W)rite, E(X)excute, (C)reate, (A)dmin , , ,,

However, versions prior to 0.98.0 ignore X , 0.98.0 E(X)excute () Now access to coprocessor Endpoint invocations can be controlled on a global, per-table, or per-column family basis (coprocessor Endpoint) column-family

Reverse Scans (HBASE-4811)


A new scanner type that seeks to the end of a range and then steps backwards (Scan) No longer necessary to manually maintain reverse index tables for descending sorts Exposed at the client with a new Scan option Scan
Scan#setReversed(boolean reversed)

Performance is on par with normal (forward) scanning (Scan)

MapReduce Over Snapshots (HBASE-8369)


Adds MapReduce utilities supporting jobs over snapshots of table data MapReduce snapshot Mapreduce job Clients can skip the HBase API and read HFiles directly on disk from a table snapshot HBase API
Can increase throughput ~5x by skipping many system layers 5

Not recommended from a security perspective


Built in access control is completely bypassed

Improved WAL Write Throughput (HBASE-8755) WAL


Introduces a new threading model for WAL writes that reduces lock contention WAL Provides better write throughput when under load, a ~15% improvement in write ops/sec at high write concurrency 15%

Stripe Compactions (HBASE-7667)


Stripe compactions split the data inside the region by row key and create sub-rangesof data Stripe compactions rowkey Region
Sub-ranges are compacted independently

compact Can reduce read latency variability and reduce compaction data volume (write amplification) compact Some use cases can benefit but the feature is complex to configure and tune, consult the documentation for detail , ,

REST Streaming Scans (HBASE-9343) REST


Introduces a new scanning mode to the REST API for stateless scanning REST API (Scan) The client manages paging and limits Instead of forcing a batching up of results as they come back from the RegionServers into multiple HTTP transactions, the stateless scanner can stream all results back to the client over one HTTP connection HTTP RegionServersHTTP

Upgrading to HBase 0.98.0 HBase 0.98.0


Direct upgrade possible from 0.94 0.98 using an offline data migration procedure 0.94 0.98 Upgrade from 0.96 0.98 is seamless 0.96 0.98
Wire compatibility Mixed clientserver and serverserver operation with 0.96 possible as long as no 0.98 specific features enabled 0.98 -> ->

Binary API compatibility not guaranteed, some applications may need minor changes Binary API,

Future of HBase 0.98.x Branch HBase 0.98.x Branch


Minor releases (0.98.1, 0.98.2, etc.) expected, these will contain: (0.98.1, 0.98.2 .), :
Bug fixes Bugs Performance improvements Deprecations of some APIs for HBase 1.0

APIsHBase 1.0 Tag compression in HFile Tag Hfile Performance improvements for encryption

End Questions?

You might also like