You are on page 1of 31

Hive

INTERNAL
August 16, 2012
Copyright 2012 Tata Consultancy Services Limited

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Contents

Contents

Introduction
Why Hive?
Configuring Hive
The Hive Shell
Hive Architecture
HiveQL
Data Types and Table types
Managed Table
External Table
Storage Formats
Queries
View
Hive Data Model
The Metastore
User Defined Functions
What Hive is not?

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Introduction

Hive is a data warehouse infrastructure built on top of Apache


Hadoop
Hive is designed to enable
Easy data summarization
Ad-hoc querying
Analysis of large volumes of data

Hive provides a simple query language called Hive QL

HiveQL allows traditional map/reduce programmers to be able to


plug in their custom mappers and reduce

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Why Hive ?

Need a multi petabyte warehouse

Files are insufficient data abstractions


Need Tables, Schema, Partitions,Indices

Need for an open data format


RDBMS have a closed data format
Flexible schema

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Configuring Hive

Download a release at ftp://ftp.nextgen.com

Unpack the tarball in a suitable place on your workstation


%tar xzf hive-x.y.z-dev.tar.gz

Put Hive on your class path


%export HIVE_HOME=/home/EmpID/hive-x.y.z-dev
%export PATH=$PATH:$HIVE_INSTALL/bin
Type hive to launch the shell
% hive
hive>

Only for TCS Internal Training TCS NextGen Solutions, Kochi

The Hive Shell

The hive shell


is the primary
way that we
will interact
with Hive.

HiveQL is
Hive's query
language, a
dialect of SQL.

HiveQL is
generally case
insensitive(exc
ept for string
comparisons).

The hive shell


can be run in
non-interactive
mode also.
The -f option
runs the
commands in
the specified
script file.

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Hive Architecture

Hive
JDBC
Command Line
Interface
Libraries

Web Interface

ODBC

Thrift Server

Driver
(Complier, Optimizer, Executor)

HiveQL

Metastore

HADOOP
(MAP-Reduce + HDFS)

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Hive Architecture (Contd..)


UI

The user interface for users to submit queries and other operations to the
system

CLI

The command line interface to Hive (the shell). This is the default service

HWI

Driver

Hive web interface can be used as an alternative to shell.


It can be started using the following commands
% export ANT_LIB=/path/to/ant/lib
%hive service hwi

The component which receives the queries.

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Hive Architecture (Contd..)


Metastore

The component that stores all the structure information of the various table
and partitions in the warehouse.

Compiler

The component that parses the query, does semantic analysis on the
different query blocks and query expressions and eventually generates an
execution plan

Execution
Engine

The component which executes the execution plan created by the compiler.

Thrift
Client

Thrift client makes it easy to run Hive commands from a wide range of
programming languages

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Hive Architecture (Contd..)


JDBC
Driver

Hive provides a Type 4 (pure java) JDBC driver,defined in the class


org.apache.hadoop.hive.jdbc.HiveDriver

When configured with a JDBC URI of the form


jdbc:hive://host:port/dbname, a Java application will connect to a Hiveserver running in a
separate process at the given host and port.

ODBC
Driver

The Hive ODBC Driver allows applications that support the ODBC protocol
to connect to Hive.The ODBC driver uses Thrift to communicate with the
Hive server

Map
Reduce

Hive internally runs the query as a map reduce.

10

Only for TCS Internal Training TCS NextGen Solutions, Kochi

HiveQL
HiveQL is hive's SQL dialect
It does not provide the full features of SQL_92 language constructs
The main differences between HiveQL and SQL are

Features

SQL

HiveQL

Updates

Insert,update and delete.

Insert overwrite table

Indexes

Supported

Not supported

Functions

Hundreds of built in functions

Dozens of built-in functions

Views

updatable

read-only

Multitable inserts

Not supported

Supported
11

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Data Types and Table Types


Hive Data Types
Hive supports both complex and primitive datatypes.
Primitive Data Types
Signed Integer - TINYINT, SMALLINT, INT, BIGINT
Floating Point - FLOAT, DOUBLE
BOOLEAN
STRING
Complex Data Types
ARRAY,MAP and STRUCT
Hive Table Types
Hive Tables are of two types
Managed Tables
External Tables

12

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Managed Table

Managed Table - Hive moves the data into its warehouse directory
hive> Create table managed_table(dummy String);
Load data inpath '/user/txt' into table managed_table;
When a managed table is dropped then the table including its data and metadata is
deleted.

13

Only for TCS Internal Training TCS NextGen Solutions, Kochi

External Table

External Table - Hive refers to the data that is at an existing location


outside the warehouse directory
Uses the keyword 'EXTERNAL to specify an external table.
hive>Create EXTERNAL table ext_table(dummyString) location '/user/tom/ext_table;
hive>load data inpath '/user/text' into table ext_table;
When an external table is dropped hive will leave the data untouched and delete only the
metadata.

14

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Queries
Table Creation
hive> CREATE TABLE <table name>
(<column name> <data type>, ...)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '<character>';
Terminated By ' <character>';
Alter a Table
hive> ALTER TABLE <table name> ADD COLUMN (<column name> <data type>);
Drop a Table
hive> DROP TABLE <table name>;

15

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Queries (Contd..)
Describe table structure
hive> DESCRIBE <table name>

To show all tables in database


hive> SHOW TABLES

To load data Into Hive tables


hive> LOAD DATA INPATH <file path>
INTO TABLE <table name>

To retrieve Data From Hive Tables


hive> SELECT * from <table name>

16

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Subquery
Hive supports subqueries only in the FROM clause.
The columns in the subquery select list are available in the outer query just like columns of a
table
Example
SELECT col
FROM (
SELECT col1+col2 AS col
FROM table1
) table2

17

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Join in Hive
Hive supports only equality joins, outer joins, and left semi joins.
Hive does not support join conditions that are not equality conditions as it is very difficult to
express such conditions as a map/reduce job.
More than two tables can be joined in Hive
Example
Hive> SELECT table1.*, table2.*
>FROM table1 JOIN table2 ON (table1.col1 = table2.col1) ;

18

Only for TCS Internal Training TCS NextGen Solutions, Kochi

View
A view is a sort of virtual table that is defined by a SELECT statement
Views can be used to present data to users in a different way to the way it is actually stored on
disk
Syntax
CREATE VIEW <TableName>
AS
SELECT *
FROM <TableName>
WHERE <Condition>;

19

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Hive Data Model


Data in hive is organized into

Tables

Partitions

Buckets

These are analogous to Tables in Relational Databases. Tables can be filtered,


projected, joined and unioned. Additionally all the data of a table is stored in a
directory in hdfs.

Each Table can have one or more partition keys which determine how the
data is stored

Data in each partition may in turn be divided into Buckets based on the hash
of a column in the table. Each bucket is stored as a file in the partition
directory.

20

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Partitions

21

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Buckets

22

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Buckets (Contd..)

23

Only for TCS Internal Training TCS NextGen Solutions, Kochi

The Metastore

There are three configurations


Embedded metastore
metastore
Embedded

Local metastore
metastore
Local

Remote metastore
metastore
Remote

metastore-It contains an
embedded Derby database
instance backed by the
local disk.This doesnot
support multiple sessions.

It uses a standalone
database.MySQL is a
popular choice for the
standalone metastore

One or more metastore


servers run in seperate
processes to the Hive
service

24

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Configuring Hive to have MySQL as


Metastore DB
Get MySql JDBC Connector Jar and copy to hive/lib directory
Get the hive-schema-0.7.0.mysql.sql file identified in hive-0.7.1cdh3u2/src/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql to the machine where
MySQL DB is installed and keep it in a directory for later use.
connect to DB with the id and password
$> mysql -u username -p"password"
create database hive_db_metastore
mysql> create database hive_db_metastore;
mysql> use hive_db_metastore;
mysql> SOURCE /home/Emp_Id/hive-schema-0.7.0.mysql.sql;

25

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Configuring Hive to have MySQL as


Metastore DB
You also need a MySQL user account for Hive to use/to access the Metastore

Steps
mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY
'password';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON
hive_db_metastore.* TO 'hiveuser'@'%';
mysql> REVOKE ALTER,CREATE ON
hive_db_metastore.* FROM 'hiveuser'@'%';

26

Only for TCS Internal Training TCS NextGen Solutions, Kochi

User Defined Functions


There are three types of UDF in hive

UDF (User Defined


Function) - Operates on
a single row and
produces a single row
as output.

UDAF (User Defined


Aggregate Function) Works on multiple input
rows and creates a
single output row

UDTF (User Defined


Table Generating
Function) - Operates on
a single row and
produces multiple rows
as output

A UDF must satisfy the following two properties


1. A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
2. A UDF must implement at least one evaluate() method

27

Only for TCS Internal Training TCS NextGen Solutions, Kochi

User Defined Functions (Contd..)


To use the UDF in hive
ADD JAR /path/to/hive-examples.jar;
Create temporary function strip as 'com.hive.Strip';

hive> SELECT strip('banana', 'ab') FROM dummy; output : nan

28

Only for TCS Internal Training TCS NextGen Solutions, Kochi

What hive is not ?

Hive is not designed for online transaction processing and does not
offer real-time queries and row level updates

Latency for Hive queries is generally very high (minutes) even when
data sets involved are very small (say a few hundred mega bytes)

29

Only for TCS Internal Training TCS NextGen Solutions, Kochi

References

https://cwiki.apache.org/Hive/tutorial.html

https://cwiki.apache.org/Hive/languagemanual-cli.html

Hadoop-The Definitive Guide

Hadoop in Action

30

Only for TCS Internal Training TCS NextGen Solutions, Kochi

Thank You

Only for TCS Internal Training TCS NextGen Solutions, Kochi

You might also like