You are on page 1of 8

Apache HCatalog

What is it ? How does it work ? Interfaces Architecture Example

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

HCatalog What is it ?

A Hive metastore interface set Shared schema and data types for Hadoop tools Rest interface for external data access Assists inter operability between

Pig, Hive and Map Reduce

Table abstraction of data storage Will provide data availability notifications

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

HCatalog How does it work ?

Pig

HCatLoader + HCatStorer interface HCatInputFormat + HCatOutputFormat interface No interface necessary Direct access to meta data

Map Reduce

Hive

Notifications when data available

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

HCatalog Interfaces

Interface via

Pig Map Reduce Hive Streaming Orc file RC file Text file Sequence file Custom format

Access data via


www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

HCatalog Interfaces

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

HCatalog Architecture

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

HCatalog Example
A data flow example from hive.apache.org
First Joe in data acquisition uses distcp to get data onto the grid. hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'" Second Sally in data processing uses Pig to cleanse and prepare the data. Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS. A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, ); B = filter A by bot_finder(zeta) = 0; store Z into 'data/processedevents/20100819/data'; With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started. A = load 'rawevents' using HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; store Z into 'processedevents' using HcatStorer("date=20100819"); Note that the pig job refers to the data by name rawevents rather than a location Now access the data via Hive QL select advertiser_id, count(clicks) from processedevents where date = 20100819 group by advertiser_id;

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

Contact Us

Feel free to contact us at


www.semtech-solutions.co.nz info@semtech-solutions.co.nz

We offer IT project consultancy We are happy to hear about your problems You can just pay for those hours that you need To solve your problems