Professional Documents
Culture Documents
(and friends)
Qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e., unprocessed data, refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols.
Page 2
Arguably also, knowledge about the data, so that the tools can then make use of the data in a meaningful sense, to extract information from it.
Data is present, organized, recorded, and catalogued. Tools exist that are able to operate on the data.
Finding it
Finding it
Evolve : Organization to support various data modeling concepts (table, partitions, columns, records)
Reading it
Page 7
Reading it
Each tool having its own storage space, its own private world
Evolve : Abstracting away storage mechanism and having tools sit on top of file formats and mechanisms, so now, suddenly, tools have interoperability.
Evolve : Having a storage abstraction that adapts to existing storage mechanisms in an easy to develop manner
Project owner - cares about amount of resources used, data portability, data connectors
Ops - needs to manage data storage, cluster management, need to control data expiry, replication, import and export.
Also :
People who help aforementioned people: Tool Writer - wants abstractions to deal with variances, wants to be able to store and retrieve relevant metadata and data, so they can focus on their user
Storage subsystem writer - wants standardization so that they can be used by other actors.
Interoperability, Convenience
Pig HCatalog
Hive
HDFS
HBase
MPP Store
Users can query data with Pig, Hive, or custom MapReduce jobs Standard HDFS formats available Q1 2012 HBase data by early Q2 2012
14
HCatLoader HCatInputFormat
HCatStorer HCatOutputFormat Hive MetaStore Client Generated Thrift Client CLI Notification
Hive MetaStore
RDBMS
Page 15
Storage
Page 16
HCatalog
Storage
Page 17
Page 18
HCatalog
Storage
Page 19
Getting Involved
TODO
HCATALOG-8 : HCatalog needs a logo HBase integration, trying to nail down a better table metaphor Hive integration interoperability between the notion of StorageDriver and StorageHandler, project dependency management 0.23 Work HCATALOG-182 : Improve the and friends bit.
Templeton
A Webservices API for Hadoop
Insulation from interface changes release to release Opens the door to languages other than Java Thin clients through webservices vs forced fat-clients in gatewa
Page 24
Register table relationships for data (e.g., createTable, createDatabase) Adjust tables (e.g., AlterTable) Look at a statistics (e.g., ShowTable)
MapReduce, Pig, Hive Poll for job status Notification URL when job completes (optional)
Stateless Server
Horizontally scale for load Configurable for HA Currently Requires ZooKeeper to track job status info
Page 25
ANY QUESTIONS ?