Unit IV (1)
Unit IV (1)
What is HIVE?
Apache Hive is a data warehouse system built on top of Hadoop that allows you to query and
analyze large datasets using a SQL-like language (HiveQL).
Hadoop offers an open-source data warehouse system called Hive or Apache Hive. Hive is a
mechanism through which we can access the data stored in Hadoop Distributed File System
(HDFS). It provides an interface similar to SQL, which enables you to create databases and
tables for storing data. In this way, you can achieve the MapReduce concept without explicitly
writing the source code for it.
Hive also supports a language called HiveQL, which is considered as the primary data processing
method for Treasure Data. Treasure Data is a cloud data platform that allows you to collect,
store, and analyze data on the cloud. It manages its own Hadoop cluster, which accepts your
queries and executes them using the Hadoop MapReduce framework. HiveQL automatically
translates SQL-like queries into MapReduce jobs executed on Hadoop.
How Does Hive Work?
Features:
The following is a list of Apache Hive's main features:
HIVE Architecture:
The architecture of Hive consists of various components. These components are described as
follows:
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It supports
different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request from
all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The
JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.
Hive Services
HIVE CLI (Command Line Interface)-This is the most commonly used interface of Hive. It is
mostly referred to as Hive CLL. The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.
HIVE Web Interface-This is a simple Graphical User Interface (GUI) used to connect to Hive. To
use this interface, you need to configure it during the Hive installation.
HIVE Metastore- It stores all the information related to the structure of the various tables and
partitions in the data warehouse. It also includes column and column type information and the
serializers and deserializers necessary to read and write data. It also contains information about
the corresponding HDFS files where your data is stored.
HIVE Server- This is an optional server. By using this server, users can submit their Hive jobs
from a remote client. It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
Driver- Receives the submitted queries. This driver component creates a session handle for the
submitted query and then sends the query to the compiler to generate an execution plan.
Compiler-Parses the query, performs semantic analysis on different query blocks and query
expressions, and generates an execution plan.
Execution Engine-Executes the execution plan created by the compiler. The plan is in the form
of a Directed Acyclic Graph (DAG) to be executed in various stages. This engine manages the
dependencies between the different stages of a plan and is also responsible to execute these
stages on the appropriate system components.
In Hive, Integral literals are assumed to be INTEGER by default unless they cross the range of
INTEGER values.
In Hive 0.11.0 and 0.12, the precision of the DECIMAL type is fixed and limited to 38 digits.
As of Hive 0.13, user can specify the scale and precision during table creation using the syntax:
DECIMAL(precision, scale)
If precision is not specified, then by default, it is equal to 10.
DECIMAL provides more precise values and greater range than DOUBLE.
e. NUMERIC: It started with Hive 3.0.0. The NUMERIC data type is the same as the DECIMAL
type.
DATE/TIME:
b. DATE: Dates were introduced in Hive 0.12.0. DATE value describes a particular
year/month/day in the form of YYYY-MM-DD.
For example- DATE ‘2020-02-04’
It does not have a time of day component. The range of value supported for the DATE type is
0000-01-01 to 9999-12-31.
c. INTERVAL: Hive Interval data types are available only after starting with Hive version 1.2 or
above. Hive accepts the interval syntax with unit specifications. We have to specify the units
along with the interval value.
For example, INTERVAL ‘1’ DAY refers to the day time.
STRING:
a. STRING: In Hive, String literals are represented either with the single quotes(‘ ’) or with
double-quotes(“ ”).
b. VARCHAR: In Hive, VARCHAR data types are of different lengths, but we have to specify the
maximum number of characters allowed in the character string. If the string value assigned to
the varchar is less than the maximum length, then the remaining space will be freed out. Also, if
the string value assigned is more than the maximum length, then the string is silently truncated.
The length of the varchar is between (1 to 65535).
c. CHAR: CHAR data types are fixed-length. The values shorter than the specified length are
padded with the spaces. Unlike VARCHAR, trailing spaces are not significant in CHAR types
during comparisons. The maximum length of CHAR is fixed at 255.
MISCELLANEOUS:
Complex Data Types are built on the top of Primitive Data Type.
Array:
The elements in an array have to be of the same data type. Elements can be accessed by using
the [n] notation where n represents the index of the array.
Map:
The elements in a map are accessed by using the [‘element name’] notation.
Struct:
The elements within this type can be accessed by using the Dot (.) operator.
Union:
UNION type in Hive is similar to the UNION in C. UNION types at any point of time can hold
exactly one data type from its specified data types.
Hive supports multiple file formats for storing data in tables. The choice of file format impacts
storage, performance, and query efficiency.
Text File < RCFile < Sequence File < Avro < Parquet < ORC
Hive supports various file formats, each optimized for different use cases in big data processing.
Below is a detailed explanation of the commonly used file formats in Apache Hive.
Compression
These are simple formats that store data in human-readable text format.
Sql
1,John,New York
2,Alice,London
Similar to CSV, but values are separated by tab (\t) instead of commas.
Example:
sql
1 John New York
2 Alice London
c) TXT (Plain Text) – .txt
Sql
1|John|New York
2|Alice|London
Sql
Columnar storage format like ORC, but more widely used outside Hive (e.g., Spark,
Presto).
Good for read-heavy workloads.
Example Hive Table Definition:
Sql
Binary format that supports schema evolution (changing schema without breaking
compatibility).
Example Hive Table Definition:
Sql
3. Hadoop-Compatible Formats
1. User writes a HiveQL Query – The user writes a SQL-like query in Hive.
2. Query is converted into a MapReduce job – Hive translates the query into a
MapReduce job.
3. Job is submitted to Hadoop – The job is sent to the Hadoop framework for processing.
4. Hadoop executes the job on HDFS data – Hadoop processes the data stored in HDFS
using the MapReduce framework.
5. Results are generated – The processed results are produced.
6. Results are returned to the user – The final output is displayed to the user.
This process allows Hive to handle large-scale data efficiently, though it may have some latency
due to MapReduce execution.
Hive Query Language (HQL) is a SQL-like language used in Apache Hive for querying and
analyzing large datasets stored in Hadoop Distributed File System (HDFS). It includes:
1. DDL (Data Definition Language) – For defining and managing databases and tables.
2. DML (Data Manipulation Language) – For inserting, updating, and querying data.
3. Partitioning & Bucketing – For performance optimization.
DDL commands are used for creating, altering, and dropping databases, tables, and partitions.
Managed Tables: Hive manages both metadata and data (dropping table deletes data).
External Tables: Hive manages only metadata; the data remains even if the table is dropped.
Show and Describe Tables
Drop a Table
1.3 Partitioning (Improves Query Speed)
Partitioning helps in managing large datasets efficiently by dividing them into smaller,
queryable chunks.
Bucketing divides data into fixed-size parts, helping optimize query performance.
DML commands are used for inserting, updating, and querying data.
3. Views in Hive
RCFile (Record Columnar File) is a columnar storage format used in Apache Hive and Hadoop
to optimize query performance. It was introduced by Facebook to improve data storage and
retrieval efficiency compared to row-based storage formats like TextFile.
RCFile is implemented as a hybrid storage format, combining both row and columnar storage.
It stores data in row groups but organizes it column-wise within each row group. This structure
allows for better compression and query performance.
1. File Structure:
o RCFile consists of a header followed by multiple row groups.
o Each row group contains data stored in a columnar format.
2. Row Group:
o A row group contains a set number of rows (typically defined by a threshold).
o Data inside a row group is stored column-wise rather than row-wise.
o This allows better compression and selective column retrieval.
3. Compression & Storage:
o RCFile applies column-level compression, making it more storage-efficient.
o Since similar data types are stored together, compression algorithms like Gzip or
Snappy achieve higher efficiency.
4. Data Serialization & Deserialization:
o When writing data, RCFile serializes records row by row and organizes them into
column blocks.
o When reading, it decompresses only the required columns, reducing I/O
overhead and improving query speed.
5. RCFile Reader & Writer:
o RCFileWriter: Writes data into RCFile format by organizing it into row groups and
column blocks.
o RCFileReader: Reads data efficiently by loading only relevant columns, reducing
unnecessary disk I/O.
Advantages of RCFile
Limitations
Not as advanced as ORC or Parquet: RCFile was later replaced by ORC (Optimized Row
Columnar) and Parquet, which provide better optimization.
Lack of Schema Evolution Support: Unlike Parquet, RCFile does not handle schema
changes efficiently.
SERDE
Hive supports different data storage formats like Text, ORC, Parquet, JSON, Avro, etc. Since
data can be structured in different ways, SERDE helps Hive understand how to interpret and
process the data correctly.
🔹 How Does SERDE Work?
When a table is created in Hive, a SERDE is specified to define how Hive should:
Convert raw data from storage format to Hive table format (Deserialization)
Convert Hive table format back to storage format (Serialization)
When Hive reads data from storage (HDFS, S3, etc.), the Deserializer converts raw data into a
structured format.
Steps:
When Hive processes data and stores it back, the Serializer converts structured data into the
required storage format.
Steps:
In Apache Hive, User-Defined Functions (UDFs) are custom functions created by users to
extend Hive's built-in capabilities. Hive provides a rich set of built-in functions, but when those
are not sufficient for specific use cases, users can create their own UDFs.
Sql