Professional Documents
Culture Documents
INTERNAL
August 16, 2012
Copyright 2012 Tata Consultancy Services Limited
Contents
Contents
Introduction
Why Hive?
Configuring Hive
The Hive Shell
Hive Architecture
HiveQL
Data Types and Table types
Managed Table
External Table
Storage Formats
Queries
View
Hive Data Model
The Metastore
User Defined Functions
What Hive is not?
Introduction
Why Hive ?
Configuring Hive
HiveQL is
Hive's query
language, a
dialect of SQL.
HiveQL is
generally case
insensitive(exc
ept for string
comparisons).
Hive Architecture
Hive
JDBC
Command Line
Interface
Libraries
Web Interface
ODBC
Thrift Server
Driver
(Complier, Optimizer, Executor)
HiveQL
Metastore
HADOOP
(MAP-Reduce + HDFS)
The user interface for users to submit queries and other operations to the
system
CLI
The command line interface to Hive (the shell). This is the default service
HWI
Driver
The component that stores all the structure information of the various table
and partitions in the warehouse.
Compiler
The component that parses the query, does semantic analysis on the
different query blocks and query expressions and eventually generates an
execution plan
Execution
Engine
The component which executes the execution plan created by the compiler.
Thrift
Client
Thrift client makes it easy to run Hive commands from a wide range of
programming languages
ODBC
Driver
The Hive ODBC Driver allows applications that support the ODBC protocol
to connect to Hive.The ODBC driver uses Thrift to communicate with the
Hive server
Map
Reduce
10
HiveQL
HiveQL is hive's SQL dialect
It does not provide the full features of SQL_92 language constructs
The main differences between HiveQL and SQL are
Features
SQL
HiveQL
Updates
Indexes
Supported
Not supported
Functions
Views
updatable
read-only
Multitable inserts
Not supported
Supported
11
12
Managed Table
Managed Table - Hive moves the data into its warehouse directory
hive> Create table managed_table(dummy String);
Load data inpath '/user/txt' into table managed_table;
When a managed table is dropped then the table including its data and metadata is
deleted.
13
External Table
14
Queries
Table Creation
hive> CREATE TABLE <table name>
(<column name> <data type>, ...)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '<character>';
Terminated By ' <character>';
Alter a Table
hive> ALTER TABLE <table name> ADD COLUMN (<column name> <data type>);
Drop a Table
hive> DROP TABLE <table name>;
15
Queries (Contd..)
Describe table structure
hive> DESCRIBE <table name>
16
Subquery
Hive supports subqueries only in the FROM clause.
The columns in the subquery select list are available in the outer query just like columns of a
table
Example
SELECT col
FROM (
SELECT col1+col2 AS col
FROM table1
) table2
17
Join in Hive
Hive supports only equality joins, outer joins, and left semi joins.
Hive does not support join conditions that are not equality conditions as it is very difficult to
express such conditions as a map/reduce job.
More than two tables can be joined in Hive
Example
Hive> SELECT table1.*, table2.*
>FROM table1 JOIN table2 ON (table1.col1 = table2.col1) ;
18
View
A view is a sort of virtual table that is defined by a SELECT statement
Views can be used to present data to users in a different way to the way it is actually stored on
disk
Syntax
CREATE VIEW <TableName>
AS
SELECT *
FROM <TableName>
WHERE <Condition>;
19
Tables
Partitions
Buckets
Each Table can have one or more partition keys which determine how the
data is stored
Data in each partition may in turn be divided into Buckets based on the hash
of a column in the table. Each bucket is stored as a file in the partition
directory.
20
Partitions
21
Buckets
22
Buckets (Contd..)
23
The Metastore
Local metastore
metastore
Local
Remote metastore
metastore
Remote
metastore-It contains an
embedded Derby database
instance backed by the
local disk.This doesnot
support multiple sessions.
It uses a standalone
database.MySQL is a
popular choice for the
standalone metastore
24
25
Steps
mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY
'password';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON
hive_db_metastore.* TO 'hiveuser'@'%';
mysql> REVOKE ALTER,CREATE ON
hive_db_metastore.* FROM 'hiveuser'@'%';
26
27
28
Hive is not designed for online transaction processing and does not
offer real-time queries and row level updates
Latency for Hive queries is generally very high (minutes) even when
data sets involved are very small (say a few hundred mega bytes)
29
References
https://cwiki.apache.org/Hive/tutorial.html
https://cwiki.apache.org/Hive/languagemanual-cli.html
Hadoop in Action
30
Thank You