0% found this document useful (0 votes)
20 views22 pages

Unit IV (1)

Apache Hive is a data warehouse system built on Hadoop that allows users to query and analyze large datasets using a SQL-like language called HiveQL. It is primarily used for batch processing and analyzing large datasets, providing features like partitioning, bucketing, and support for various file formats. Hive's architecture includes components such as Hive Client, Metastore, and Execution Engine, which work together to process queries efficiently.

Uploaded by

anujkothawale93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views22 pages

Unit IV (1)

Apache Hive is a data warehouse system built on Hadoop that allows users to query and analyze large datasets using a SQL-like language called HiveQL. It is primarily used for batch processing and analyzing large datasets, providing features like partitioning, bucketing, and support for various file formats. Hive's architecture includes components such as Hive Client, Metastore, and Execution Engine, which work together to process queries efficiently.

Uploaded by

anujkothawale93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit- IV Introduction to HIVE

What is HIVE?
Apache Hive is a data warehouse system built on top of Hadoop that allows you to query and
analyze large datasets using a SQL-like language (HiveQL).
Hadoop offers an open-source data warehouse system called Hive or Apache Hive. Hive is a
mechanism through which we can access the data stored in Hadoop Distributed File System
(HDFS). It provides an interface similar to SQL, which enables you to create databases and
tables for storing data. In this way, you can achieve the MapReduce concept without explicitly
writing the source code for it.
Hive also supports a language called HiveQL, which is considered as the primary data processing
method for Treasure Data. Treasure Data is a cloud data platform that allows you to collect,
store, and analyze data on the cloud. It manages its own Hadoop cluster, which accepts your
queries and executes them using the Hadoop MapReduce framework. HiveQL automatically
translates SQL-like queries into MapReduce jobs executed on Hadoop.
How Does Hive Work?

1. A user writes a query in HiveQL (which is similar to SQL).


2. Hive translates the query into a format that Hadoop understands (MapReduce jobs).
3. Hadoop processes the data stored in HDFS (Hadoop Distributed File System).
4. The results are generated and returned to the user.

What is Hive Used For?

 Hive is mainly used for analyzing and processing large datasets.


 It is not a traditional database and not suitable for tasks that need real-time processing
(like online banking).
 It works best for batch processing, such as analyzing logs, records, and big data.

Features:
The following is a list of Apache Hive's main features:

1. We are allowed free usage of Apache Hive. It's freely accessible.


2. Large datasets kept in the Hadoop Distributed File System can also be handled using Hive.
3. The data may be queried concurrently by several people.
4. The low-level interface requirements of Apache Hadoop are well met by the Apache Hive
program.
5. Apache Hive partitions and buckets data at the table level to enhance speed.
6. Numerous file formats, such as textFile, orc, Avro, sequence file, parquet, copying,
compression, and others, are supported by Hive.
7. SQL is the same query language used by Hadoop. To use Hive, we don't need to be
proficient in any programming languages. All we need to work with Hive is simple SQL.
8. There are several built-in features in Hive. Apache Hive supports external tables. This
functionality allows us to process data without actually putting it in HDFS.

HIVE Architecture:

The architecture of Hive consists of various components. These components are described as
follows:

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It supports
different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the request from
all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The
JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.
Hive Services

The following are the services provided by Hive:-

HIVE CLI (Command Line Interface)-This is the most commonly used interface of Hive. It is
mostly referred to as Hive CLL. The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.

HIVE Web Interface-This is a simple Graphical User Interface (GUI) used to connect to Hive. To
use this interface, you need to configure it during the Hive installation.
HIVE Metastore- It stores all the information related to the structure of the various tables and
partitions in the data warehouse. It also includes column and column type information and the
serializers and deserializers necessary to read and write data. It also contains information about
the corresponding HDFS files where your data is stored.

HIVE Server- This is an optional server. By using this server, users can submit their Hive jobs
from a remote client. It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
Driver- Receives the submitted queries. This driver component creates a session handle for the
submitted query and then sends the query to the compiler to generate an execution plan.

Compiler-Parses the query, performs semantic analysis on different query blocks and query
expressions, and generates an execution plan.

Execution Engine-Executes the execution plan created by the compiler. The plan is in the form
of a Directed Acyclic Graph (DAG) to be executed in various stages. This engine manages the
dependencies between the different stages of a plan and is also responsible to execute these
stages on the appropriate system components.

HIVE Data Types:


HIVE supports two kinds of Data Types: Primitive type and complex type. Primitive data types
are built-in data types, which also act as basic structures for building more sophisticated data
types. Primitive data types are associated with columns of a table.
The Numeric data type in Hive is categorized into

 Integral data type


 Floating data type

Integral data type:

a. TINYINT: 1-byte integer


b. SMALLINT: 2-byte integer
c. INTEGER: 4-byte integer
d. BIGINT: 8-byte integer

In Hive, Integral literals are assumed to be INTEGER by default unless they cross the range of
INTEGER values.

Floating data type

a. FLOAT: It is a 4-byte single-precision floating-point number.


b. DOUBLE: It is an 8-byte double-precision floating-point number.
c. DOUBLE PRECISION: It is an alias for DOUBLE. It is only available starting with Hive 2.2.0
d. DECIMAL
It was introduced in Hive 0.11.0. It is based on Java’s BigDecimal. DECIMAL types support both
scientific and non-scientific notations.

In Hive 0.11.0 and 0.12, the precision of the DECIMAL type is fixed and limited to 38 digits.
As of Hive 0.13, user can specify the scale and precision during table creation using the syntax:

DECIMAL(precision, scale)
If precision is not specified, then by default, it is equal to 10.

If the scale is not specified, then by default, it is equal to 0.

DECIMAL provides more precise values and greater range than DOUBLE.

e. NUMERIC: It started with Hive 3.0.0. The NUMERIC data type is the same as the DECIMAL
type.

DATE/TIME:

a. TIMESTAMP: Timestamps were introduced in Hive 0.8.0. It supports traditional UNIX


timestamp with the optional nanosecond precision. The supported Timestamps format is yyyy-
mm-dd hh:mm:ss[.f…] in the text files. If they are in any other format, declare them as the
appropriate type and use UDF(User Defined Function) to convert them to timestamps.

b. DATE: Dates were introduced in Hive 0.12.0. DATE value describes a particular
year/month/day in the form of YYYY-MM-DD.
For example- DATE ‘2020-02-04’

It does not have a time of day component. The range of value supported for the DATE type is
0000-01-01 to 9999-12-31.

c. INTERVAL: Hive Interval data types are available only after starting with Hive version 1.2 or
above. Hive accepts the interval syntax with unit specifications. We have to specify the units
along with the interval value.
For example, INTERVAL ‘1’ DAY refers to the day time.

STRING:

a. STRING: In Hive, String literals are represented either with the single quotes(‘ ’) or with
double-quotes(“ ”).

b. VARCHAR: In Hive, VARCHAR data types are of different lengths, but we have to specify the
maximum number of characters allowed in the character string. If the string value assigned to
the varchar is less than the maximum length, then the remaining space will be freed out. Also, if
the string value assigned is more than the maximum length, then the string is silently truncated.
The length of the varchar is between (1 to 65535).
c. CHAR: CHAR data types are fixed-length. The values shorter than the specified length are
padded with the spaces. Unlike VARCHAR, trailing spaces are not significant in CHAR types
during comparisons. The maximum length of CHAR is fixed at 255.

MISCELLANEOUS:

a. BOOLEAN: Boolean types in Hive store either true or false.


b. BINARY: BINARY type in Hive is an array of bytes. This is all about Hive Primitive Data Types.

Hive Complex Data Type

Complex Data Types are built on the top of Primitive Data Type.

The Hive Complex Data Type are categorized as:

Array:
The elements in an array have to be of the same data type. Elements can be accessed by using
the [n] notation where n represents the index of the array.
Map:
The elements in a map are accessed by using the [‘element name’] notation.
Struct:

The elements within this type can be accessed by using the Dot (.) operator.
Union:
UNION type in Hive is similar to the UNION in C. UNION types at any point of time can hold
exactly one data type from its specified data types.

HIVE File Format:

Hive supports multiple file formats for storing data in tables. The choice of file format impacts
storage, performance, and query efficiency.
Text File < RCFile < Sequence File < Avro < Parquet < ORC
Hive supports various file formats, each optimized for different use cases in big data processing.
Below is a detailed explanation of the commonly used file formats in Apache Hive.

Text File Avro Parquet ORC


RCFile Sequence File
(Original size (430GB) (221 GB) (131 GB)
(505GB) (450GB)
585GB)

Compression

1. Text File Formats

These are simple formats that store data in human-readable text format.

a) CSV (Comma-Separated Values) – .csv

 Data is stored as plain text, with values separated by commas.


 Easy to read and write.
 Not optimized for performance as there is no indexing or compression.
 Example:

Sql

1,John,New York
2,Alice,London

b) TSV (Tab-Separated Values) – .tsv

 Similar to CSV, but values are separated by tab (\t) instead of commas.
 Example:

sql
1 John New York
2 Alice London
c) TXT (Plain Text) – .txt

 A general text format where each line represents a record.


 Delimiters (comma, tab, pipe |, etc.) must be manually handled.
 Example (Pipe | delimited):

Sql
1|John|New York
2|Alice|London

2. Binary and Columnar Formats

These formats improve performance and storage efficiency.

a) Optimized Row Columnar (ORC) – .orc

 Highly optimized for Hive performance.


 Stores data in a columnar format, improving query performance.
 Supports compression, reducing storage space.
 Example Hive Table Definition:

Sql

CREATE TABLE employee_orc (


id INT,
name STRING,
city STRING
) STORED AS ORC;

b) Parquet Format – .parquet

 Columnar storage format like ORC, but more widely used outside Hive (e.g., Spark,
Presto).
 Good for read-heavy workloads.
 Example Hive Table Definition:

Sql

CREATE TABLE employee_parquet (


id INT,
name STRING,
city STRING
) STORED AS PARQUET;

c) Avro Format – .avro

 Binary format that supports schema evolution (changing schema without breaking
compatibility).
 Example Hive Table Definition:

Sql

CREATE TABLE employee_avro (


id INT,
name STRING,
city STRING
) STORED AS AVRO;

3. Hadoop-Compatible Formats

a) Sequence File – .seq

 Binary key-value format used in Hadoop.


 Efficient for MapReduce but not ideal for Hive queries.

b) Record Columnar File (RCFile) – .rc

 Older columnar format, replaced by ORC and Parquet.

HIVE Query Language:

Here’s a simple explanation of the Hive query execution process:

1. User writes a HiveQL Query – The user writes a SQL-like query in Hive.
2. Query is converted into a MapReduce job – Hive translates the query into a
MapReduce job.
3. Job is submitted to Hadoop – The job is sent to the Hadoop framework for processing.
4. Hadoop executes the job on HDFS data – Hadoop processes the data stored in HDFS
using the MapReduce framework.
5. Results are generated – The processed results are produced.
6. Results are returned to the user – The final output is displayed to the user.
This process allows Hive to handle large-scale data efficiently, though it may have some latency
due to MapReduce execution.

Hive Query Language (HQL) is a SQL-like language used in Apache Hive for querying and
analyzing large datasets stored in Hadoop Distributed File System (HDFS). It includes:
1. DDL (Data Definition Language) – For defining and managing databases and tables.
2. DML (Data Manipulation Language) – For inserting, updating, and querying data.
3. Partitioning & Bucketing – For performance optimization.

1. Hive DDL (Data Definition Language)

DDL commands are used for creating, altering, and dropping databases, tables, and partitions.

1.1 Database Operations

1.2 Table Operations


Create a Managed Table

Managed Tables: Hive manages both metadata and data (dropping table deletes data).

Create an External Table

External Tables: Hive manages only metadata; the data remains even if the table is dropped.
Show and Describe Tables

Modify Table Structure

Drop a Table
1.3 Partitioning (Improves Query Speed)

Partitioning helps in managing large datasets efficiently by dividing them into smaller,
queryable chunks.

Querying specific partitions is faster than scanning the entire table.

1.4 Bucketing (Optimizes Storage and Querying)

Bucketing divides data into fixed-size parts, helping optimize query performance.

Used for evenly distributing data and improving joins.


2. Hive DML (Data Manipulation Language)

DML commands are used for inserting, updating, and querying data.

2.1 Loading Data into Hive

2.2 Querying Data

2.3 Aggregations and Grouping


2.4 Joins in Hive

3. Views in Hive

Views allow creating virtual tables for easy querying.


4. Performance Optimization in Hive

1. Partitioning – Speeds up queries by scanning only relevant partitions.


2. Bucketing – Helps with efficient joins and sampling.
3. ORC File Format – Improves query speed with columnar storage.

Summary: Hive Query Language (HQL)


RCFile Implementation:

RCFile (Record Columnar File) is a columnar storage format used in Apache Hive and Hadoop
to optimize query performance. It was introduced by Facebook to improve data storage and
retrieval efficiency compared to row-based storage formats like TextFile.

RCFile is implemented as a hybrid storage format, combining both row and columnar storage.
It stores data in row groups but organizes it column-wise within each row group. This structure
allows for better compression and query performance.

Key Components of RCFile Implementation

1. File Structure:
o RCFile consists of a header followed by multiple row groups.
o Each row group contains data stored in a columnar format.
2. Row Group:
o A row group contains a set number of rows (typically defined by a threshold).
o Data inside a row group is stored column-wise rather than row-wise.
o This allows better compression and selective column retrieval.
3. Compression & Storage:
o RCFile applies column-level compression, making it more storage-efficient.
o Since similar data types are stored together, compression algorithms like Gzip or
Snappy achieve higher efficiency.
4. Data Serialization & Deserialization:
o When writing data, RCFile serializes records row by row and organizes them into
column blocks.
o When reading, it decompresses only the required columns, reducing I/O
overhead and improving query speed.
5. RCFile Reader & Writer:
o RCFileWriter: Writes data into RCFile format by organizing it into row groups and
column blocks.
o RCFileReader: Reads data efficiently by loading only relevant columns, reducing
unnecessary disk I/O.

Advantages of RCFile

 Efficient Columnar Storage: Helps in better compression and query performance.


 Faster Query Execution: Since only the needed columns are read, it improves speed.
 Optimized for Hive: Works well with Hive’s execution engine.
 Better Compression: Columnar compression improves disk usage efficiency.

Limitations

 Not as advanced as ORC or Parquet: RCFile was later replaced by ORC (Optimized Row
Columnar) and Parquet, which provide better optimization.
 Lack of Schema Evolution Support: Unlike Parquet, RCFile does not handle schema
changes efficiently.

SERDE

What is Serializer/Deserializer? (SERDE)

A Serializer/Deserializer (SerDe) is a component in data processing systems that handles the


conversion of data formats between storage and application usage.

 Serialization: Converts structured data into a format suitable for storage or


transmission.
 Deserialization: Converts stored or received data back into a structured format for
processing.

Why is SERDE Needed?

Hive supports different data storage formats like Text, ORC, Parquet, JSON, Avro, etc. Since
data can be structured in different ways, SERDE helps Hive understand how to interpret and
process the data correctly.
🔹 How Does SERDE Work?

When a table is created in Hive, a SERDE is specified to define how Hive should:

 Convert raw data from storage format to Hive table format (Deserialization)
 Convert Hive table format back to storage format (Serialization)

1. Deserialization Process (Reading Data into Hive)

When Hive reads data from storage (HDFS, S3, etc.), the Deserializer converts raw data into a
structured format.

Steps:

1. Hive Reads Raw Data from Storage


o The data can be stored in various formats such as Text, CSV, JSON, ORC,
Parquet, Avro.
o Hive fetches the raw bytes or text files.
2. Hive Identifies the Associated SerDe
o Based on the table definition (ROW FORMAT SERDE), Hive determines which
SerDe to use.
3. Deserializer Parses the Data
o The Deserializer processes the raw data according to its structure:
 For CSV → Splits data by commas.
 For JSON → Parses JSON keys and values.
 For ORC/Parquet → Reads binary-encoded columnar data.
4. Mapping Data to Table Schema
o After parsing, the Deserializer assigns values to respective columns in the Hive
table.
5. Hive Stores Data in Internal Format
o The structured data is now available in tabular format.
o Queries can now be executed on this structured data.

2. Serialization Process (Writing Data from Hive to Storage)

When Hive processes data and stores it back, the Serializer converts structured data into the
required storage format.

Steps:

1. Data Retrieved from Hive Table


o When a user inserts or updates data, Hive retrieves the structured row format.
2. Serializer Converts Data to Required Format
o The Serializer takes the structured data and formats it:
 For CSV → Joins values with commas.
 For JSON → Converts data to JSON string.
 For ORC/Parquet → Encodes data in binary format.
3. Data is Stored in Target Format
o The serialized data is written back to HDFS, S3, or another storage system.

User Defined Functions:

In Apache Hive, User-Defined Functions (UDFs) are custom functions created by users to
extend Hive's built-in capabilities. Hive provides a rich set of built-in functions, but when those
are not sufficient for specific use cases, users can create their own UDFs.

Types of User-Defined Functions in Hive

1. UDF (User-Defined Function)


o Used for row-wise transformations (returns a single value for each input row).
o Example: Custom string manipulation.
2. UDAF (User-Defined Aggregate Function)
o Used for aggregation (like SUM, AVG, COUNT but custom).
o Example: Finding a custom average or weighted sum.
3. UDTF (User-Defined Table-Generating Function)
o Used to return multiple rows for a single input row.
o Example: Splitting a column value into multiple rows.

How to Create a UDF in Hive?

1. Write a Java class that extends org.apache.hadoop.hive.ql.exec.UDF.


2. Implement the evaluate() method with custom logic.
3. Compile the Java class into a JAR file.
4. Add the JAR to Hive using ADD JAR command.
5. Create a temporary function in Hive using CREATE FUNCTION.
6. Use the function in Hive queries.

Example of a Simple Hive UDF

Steps to Use This UDF in Hive

1. Compile the Java file and create a JAR.


2. Add the JAR to Hive:

Sql

ADD JAR /path/to/your_udf.jar;

3. Create the function:


Sql

CREATE TEMPORARY FUNCTION to_upper AS ‘com.example.ToUpperCaseUDF’;


4. Use it in a Hive query:
Sql

SELECT to_upper(name) FROM employees;

You might also like