0% found this document useful (0 votes)

20 views22 pages

Unit IV (1)

Apache Hive is a data warehouse system built on Hadoop that allows users to query and analyze large datasets using a SQL-like language called HiveQL. It is primarily used for batch processing and analyzing large datasets, providing features like partitioning, bucketing, and support for various file formats. Hive's architecture includes components such as Hive Client, Metastore, and Execution Engine, which work together to process queries efficiently.

Uploaded by

anujkothawale93

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views22 pages

Unit IV (1)

Uploaded by

anujkothawale93

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Unit- IV Introduction to HIVE

What is HIVE?
Apache Hive is a data warehouse system built on top of Hadoop that allows you to query and
analyze large datasets using a SQL-like language (HiveQL).
Hadoop offers an open-source data warehouse system called Hive or Apache Hive. Hive is a
mechanism through which we can access the data stored in Hadoop Distributed File System
(HDFS). It provides an interface similar to SQL, which enables you to create databases and
tables for storing data. In this way, you can achieve the MapReduce concept without explicitly
writing the source code for it.
Hive also supports a language called HiveQL, which is considered as the primary data processing
method for Treasure Data. Treasure Data is a cloud data platform that allows you to collect,
store, and analyze data on the cloud. It manages its own Hadoop cluster, which accepts your
queries and executes them using the Hadoop MapReduce framework. HiveQL automatically
translates SQL-like queries into MapReduce jobs executed on Hadoop.
How Does Hive Work?

1. A user writes a query in HiveQL (which is similar to SQL).

2. Hive translates the query into a format that Hadoop understands (MapReduce jobs).
3. Hadoop processes the data stored in HDFS (Hadoop Distributed File System).
4. The results are generated and returned to the user.

What is Hive Used For?

 Hive is mainly used for analyzing and processing large datasets.

 It is not a traditional database and not suitable for tasks that need real-time processing
(like online banking).
 It works best for batch processing, such as analyzing logs, records, and big data.

Features:
The following is a list of Apache Hive's main features:

1. We are allowed free usage of Apache Hive. It's freely accessible.

2. Large datasets kept in the Hadoop Distributed File System can also be handled using Hive.
3. The data may be queried concurrently by several people.
4. The low-level interface requirements of Apache Hadoop are well met by the Apache Hive
program.
5. Apache Hive partitions and buckets data at the table level to enhance speed.
6. Numerous file formats, such as textFile, orc, Avro, sequence file, parquet, copying,
compression, and others, are supported by Hive.
7. SQL is the same query language used by Hadoop. To use Hive, we don't need to be
proficient in any programming languages. All we need to work with Hive is simple SQL.
8. There are several built-in features in Hive. Apache Hive supports external tables. This
functionality allows us to process data without actually putting it in HDFS.

HIVE Architecture:

The architecture of Hive consists of various components. These components are described as
follows:

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It supports
different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the request from
all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The
JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.
Hive Services

The following are the services provided by Hive:-

HIVE CLI (Command Line Interface)-This is the most commonly used interface of Hive. It is
mostly referred to as Hive CLL. The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.

HIVE Web Interface-This is a simple Graphical User Interface (GUI) used to connect to Hive. To
use this interface, you need to configure it during the Hive installation.
HIVE Metastore- It stores all the information related to the structure of the various tables and
partitions in the data warehouse. It also includes column and column type information and the
serializers and deserializers necessary to read and write data. It also contains information about
the corresponding HDFS files where your data is stored.

HIVE Server- This is an optional server. By using this server, users can submit their Hive jobs
from a remote client. It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
Driver- Receives the submitted queries. This driver component creates a session handle for the
submitted query and then sends the query to the compiler to generate an execution plan.

Compiler-Parses the query, performs semantic analysis on different query blocks and query
expressions, and generates an execution plan.

Execution Engine-Executes the execution plan created by the compiler. The plan is in the form
of a Directed Acyclic Graph (DAG) to be executed in various stages. This engine manages the
dependencies between the different stages of a plan and is also responsible to execute these
stages on the appropriate system components.

HIVE Data Types:

HIVE supports two kinds of Data Types: Primitive type and complex type. Primitive data types
are built-in data types, which also act as basic structures for building more sophisticated data
types. Primitive data types are associated with columns of a table.
The Numeric data type in Hive is categorized into

 Integral data type

 Floating data type

Integral data type:

a. TINYINT: 1-byte integer

b. SMALLINT: 2-byte integer
c. INTEGER: 4-byte integer
d. BIGINT: 8-byte integer

In Hive, Integral literals are assumed to be INTEGER by default unless they cross the range of
INTEGER values.

Floating data type

a. FLOAT: It is a 4-byte single-precision floating-point number.

b. DOUBLE: It is an 8-byte double-precision floating-point number.
c. DOUBLE PRECISION: It is an alias for DOUBLE. It is only available starting with Hive 2.2.0
d. DECIMAL
It was introduced in Hive 0.11.0. It is based on Java’s BigDecimal. DECIMAL types support both
scientific and non-scientific notations.

In Hive 0.11.0 and 0.12, the precision of the DECIMAL type is fixed and limited to 38 digits.
As of Hive 0.13, user can specify the scale and precision during table creation using the syntax:

DECIMAL(precision, scale)
If precision is not specified, then by default, it is equal to 10.

If the scale is not specified, then by default, it is equal to 0.

DECIMAL provides more precise values and greater range than DOUBLE.

e. NUMERIC: It started with Hive 3.0.0. The NUMERIC data type is the same as the DECIMAL
type.

DATE/TIME:

a. TIMESTAMP: Timestamps were introduced in Hive 0.8.0. It supports traditional UNIX

timestamp with the optional nanosecond precision. The supported Timestamps format is yyyy-
mm-dd hh:mm:ss[.f…] in the text files. If they are in any other format, declare them as the
appropriate type and use UDF(User Defined Function) to convert them to timestamps.

b. DATE: Dates were introduced in Hive 0.12.0. DATE value describes a particular
year/month/day in the form of YYYY-MM-DD.
For example- DATE ‘2020-02-04’

It does not have a time of day component. The range of value supported for the DATE type is
0000-01-01 to 9999-12-31.

c. INTERVAL: Hive Interval data types are available only after starting with Hive version 1.2 or
above. Hive accepts the interval syntax with unit specifications. We have to specify the units
along with the interval value.
For example, INTERVAL ‘1’ DAY refers to the day time.

STRING:

a. STRING: In Hive, String literals are represented either with the single quotes(‘ ’) or with
double-quotes(“ ”).

b. VARCHAR: In Hive, VARCHAR data types are of different lengths, but we have to specify the
maximum number of characters allowed in the character string. If the string value assigned to
the varchar is less than the maximum length, then the remaining space will be freed out. Also, if
the string value assigned is more than the maximum length, then the string is silently truncated.
The length of the varchar is between (1 to 65535).
c. CHAR: CHAR data types are fixed-length. The values shorter than the specified length are
padded with the spaces. Unlike VARCHAR, trailing spaces are not significant in CHAR types
during comparisons. The maximum length of CHAR is fixed at 255.

MISCELLANEOUS:

a. BOOLEAN: Boolean types in Hive store either true or false.

b. BINARY: BINARY type in Hive is an array of bytes. This is all about Hive Primitive Data Types.

Hive Complex Data Type

Complex Data Types are built on the top of Primitive Data Type.

The Hive Complex Data Type are categorized as:

Array:
The elements in an array have to be of the same data type. Elements can be accessed by using
the [n] notation where n represents the index of the array.
Map:
The elements in a map are accessed by using the [‘element name’] notation.
Struct:

The elements within this type can be accessed by using the Dot (.) operator.
Union:
UNION type in Hive is similar to the UNION in C. UNION types at any point of time can hold
exactly one data type from its specified data types.

HIVE File Format:

Hive supports multiple file formats for storing data in tables. The choice of file format impacts
storage, performance, and query efficiency.
Text File < RCFile < Sequence File < Avro < Parquet < ORC
Hive supports various file formats, each optimized for different use cases in big data processing.
Below is a detailed explanation of the commonly used file formats in Apache Hive.

Text File Avro Parquet ORC

RCFile Sequence File
(Original size (430GB) (221 GB) (131 GB)
(505GB) (450GB)
585GB)

Compression

1. Text File Formats

These are simple formats that store data in human-readable text format.

a) CSV (Comma-Separated Values) – .csv

 Data is stored as plain text, with values separated by commas.

 Easy to read and write.
 Not optimized for performance as there is no indexing or compression.
 Example:

Sql

1,John,New York
2,Alice,London

b) TSV (Tab-Separated Values) – .tsv

 Similar to CSV, but values are separated by tab (\t) instead of commas.
 Example:

sql
1 John New York
2 Alice London
c) TXT (Plain Text) – .txt

 A general text format where each line represents a record.

 Delimiters (comma, tab, pipe |, etc.) must be manually handled.
 Example (Pipe | delimited):

Sql
1|John|New York
2|Alice|London

2. Binary and Columnar Formats

These formats improve performance and storage efficiency.

a) Optimized Row Columnar (ORC) – .orc

 Highly optimized for Hive performance.

 Stores data in a columnar format, improving query performance.
 Supports compression, reducing storage space.
 Example Hive Table Definition:

Sql

CREATE TABLE employee_orc (

id INT,
name STRING,
city STRING
) STORED AS ORC;

b) Parquet Format – .parquet

 Columnar storage format like ORC, but more widely used outside Hive (e.g., Spark,
Presto).
 Good for read-heavy workloads.
 Example Hive Table Definition:

Sql

CREATE TABLE employee_parquet (

id INT,
name STRING,
city STRING
) STORED AS PARQUET;

c) Avro Format – .avro

 Binary format that supports schema evolution (changing schema without breaking
compatibility).
 Example Hive Table Definition:

Sql

CREATE TABLE employee_avro (

id INT,
name STRING,
city STRING
) STORED AS AVRO;

3. Hadoop-Compatible Formats

a) Sequence File – .seq

 Binary key-value format used in Hadoop.

 Efficient for MapReduce but not ideal for Hive queries.

b) Record Columnar File (RCFile) – .rc

 Older columnar format, replaced by ORC and Parquet.

HIVE Query Language:

Here’s a simple explanation of the Hive query execution process:

1. User writes a HiveQL Query – The user writes a SQL-like query in Hive.
2. Query is converted into a MapReduce job – Hive translates the query into a
MapReduce job.
3. Job is submitted to Hadoop – The job is sent to the Hadoop framework for processing.
4. Hadoop executes the job on HDFS data – Hadoop processes the data stored in HDFS
using the MapReduce framework.
5. Results are generated – The processed results are produced.
6. Results are returned to the user – The final output is displayed to the user.
This process allows Hive to handle large-scale data efficiently, though it may have some latency
due to MapReduce execution.

Hive Query Language (HQL) is a SQL-like language used in Apache Hive for querying and
analyzing large datasets stored in Hadoop Distributed File System (HDFS). It includes:
1. DDL (Data Definition Language) – For defining and managing databases and tables.
2. DML (Data Manipulation Language) – For inserting, updating, and querying data.
3. Partitioning & Bucketing – For performance optimization.

1. Hive DDL (Data Definition Language)

DDL commands are used for creating, altering, and dropping databases, tables, and partitions.

1.1 Database Operations

1.2 Table Operations

Create a Managed Table

Managed Tables: Hive manages both metadata and data (dropping table deletes data).

Create an External Table

External Tables: Hive manages only metadata; the data remains even if the table is dropped.
Show and Describe Tables

Modify Table Structure

Drop a Table
1.3 Partitioning (Improves Query Speed)

Partitioning helps in managing large datasets efficiently by dividing them into smaller,
queryable chunks.

Querying specific partitions is faster than scanning the entire table.

1.4 Bucketing (Optimizes Storage and Querying)

Bucketing divides data into fixed-size parts, helping optimize query performance.

Used for evenly distributing data and improving joins.

2. Hive DML (Data Manipulation Language)

DML commands are used for inserting, updating, and querying data.

2.1 Loading Data into Hive

2.2 Querying Data

2.3 Aggregations and Grouping

2.4 Joins in Hive

3. Views in Hive

Views allow creating virtual tables for easy querying.

4. Performance Optimization in Hive

1. Partitioning – Speeds up queries by scanning only relevant partitions.

2. Bucketing – Helps with efficient joins and sampling.
3. ORC File Format – Improves query speed with columnar storage.

Summary: Hive Query Language (HQL)

RCFile Implementation:

RCFile (Record Columnar File) is a columnar storage format used in Apache Hive and Hadoop
to optimize query performance. It was introduced by Facebook to improve data storage and
retrieval efficiency compared to row-based storage formats like TextFile.

RCFile is implemented as a hybrid storage format, combining both row and columnar storage.
It stores data in row groups but organizes it column-wise within each row group. This structure
allows for better compression and query performance.

Key Components of RCFile Implementation

1. File Structure:
o RCFile consists of a header followed by multiple row groups.
o Each row group contains data stored in a columnar format.
2. Row Group:
o A row group contains a set number of rows (typically defined by a threshold).
o Data inside a row group is stored column-wise rather than row-wise.
o This allows better compression and selective column retrieval.
3. Compression & Storage:
o RCFile applies column-level compression, making it more storage-efficient.
o Since similar data types are stored together, compression algorithms like Gzip or
Snappy achieve higher efficiency.
4. Data Serialization & Deserialization:
o When writing data, RCFile serializes records row by row and organizes them into
column blocks.
o When reading, it decompresses only the required columns, reducing I/O
overhead and improving query speed.
5. RCFile Reader & Writer:
o RCFileWriter: Writes data into RCFile format by organizing it into row groups and
column blocks.
o RCFileReader: Reads data efficiently by loading only relevant columns, reducing
unnecessary disk I/O.

Advantages of RCFile

 Efficient Columnar Storage: Helps in better compression and query performance.

 Faster Query Execution: Since only the needed columns are read, it improves speed.
 Optimized for Hive: Works well with Hive’s execution engine.
 Better Compression: Columnar compression improves disk usage efficiency.

Limitations

 Not as advanced as ORC or Parquet: RCFile was later replaced by ORC (Optimized Row
Columnar) and Parquet, which provide better optimization.
 Lack of Schema Evolution Support: Unlike Parquet, RCFile does not handle schema
changes efficiently.

SERDE

What is Serializer/Deserializer? (SERDE)

A Serializer/Deserializer (SerDe) is a component in data processing systems that handles the

conversion of data formats between storage and application usage.

 Serialization: Converts structured data into a format suitable for storage or

transmission.
 Deserialization: Converts stored or received data back into a structured format for
processing.

Why is SERDE Needed?

Hive supports different data storage formats like Text, ORC, Parquet, JSON, Avro, etc. Since
data can be structured in different ways, SERDE helps Hive understand how to interpret and
process the data correctly.
🔹 How Does SERDE Work?

When a table is created in Hive, a SERDE is specified to define how Hive should:

 Convert raw data from storage format to Hive table format (Deserialization)
 Convert Hive table format back to storage format (Serialization)

1. Deserialization Process (Reading Data into Hive)

When Hive reads data from storage (HDFS, S3, etc.), the Deserializer converts raw data into a
structured format.

Steps:

1. Hive Reads Raw Data from Storage

o The data can be stored in various formats such as Text, CSV, JSON, ORC,
Parquet, Avro.
o Hive fetches the raw bytes or text files.
2. Hive Identifies the Associated SerDe
o Based on the table definition (ROW FORMAT SERDE), Hive determines which
SerDe to use.
3. Deserializer Parses the Data
o The Deserializer processes the raw data according to its structure:
 For CSV → Splits data by commas.
 For JSON → Parses JSON keys and values.
 For ORC/Parquet → Reads binary-encoded columnar data.
4. Mapping Data to Table Schema
o After parsing, the Deserializer assigns values to respective columns in the Hive
table.
5. Hive Stores Data in Internal Format
o The structured data is now available in tabular format.
o Queries can now be executed on this structured data.

2. Serialization Process (Writing Data from Hive to Storage)

When Hive processes data and stores it back, the Serializer converts structured data into the
required storage format.

Steps:

1. Data Retrieved from Hive Table

o When a user inserts or updates data, Hive retrieves the structured row format.
2. Serializer Converts Data to Required Format
o The Serializer takes the structured data and formats it:
 For CSV → Joins values with commas.
 For JSON → Converts data to JSON string.
 For ORC/Parquet → Encodes data in binary format.
3. Data is Stored in Target Format
o The serialized data is written back to HDFS, S3, or another storage system.

User Defined Functions:

In Apache Hive, User-Defined Functions (UDFs) are custom functions created by users to
extend Hive's built-in capabilities. Hive provides a rich set of built-in functions, but when those
are not sufficient for specific use cases, users can create their own UDFs.

Types of User-Defined Functions in Hive

1. UDF (User-Defined Function)

o Used for row-wise transformations (returns a single value for each input row).
o Example: Custom string manipulation.
2. UDAF (User-Defined Aggregate Function)
o Used for aggregation (like SUM, AVG, COUNT but custom).
o Example: Finding a custom average or weighted sum.
3. UDTF (User-Defined Table-Generating Function)
o Used to return multiple rows for a single input row.
o Example: Splitting a column value into multiple rows.

How to Create a UDF in Hive?

1. Write a Java class that extends org.apache.hadoop.hive.ql.exec.UDF.

2. Implement the evaluate() method with custom logic.
3. Compile the Java class into a JAR file.
4. Add the JAR to Hive using ADD JAR command.
5. Create a temporary function in Hive using CREATE FUNCTION.
6. Use the function in Hive queries.

Example of a Simple Hive UDF

Steps to Use This UDF in Hive

1. Compile the Java file and create a JAR.

2. Add the JAR to Hive:

Sql

ADD JAR /path/to/your_udf.jar;

3. Create the function:

Sql

CREATE TEMPORARY FUNCTION to_upper AS ‘com.example.ToUpperCaseUDF’;

4. Use it in a Hive query:
Sql

SELECT to_upper(name) FROM employees;

Written Detention
80% (5)
Written Detention
3 pages
AR15 Owners Manual
0% (1)
AR15 Owners Manual
40 pages
Living Death - 1899 - 22 Cats Paw (3.5E)
No ratings yet
Living Death - 1899 - 22 Cats Paw (3.5E)
41 pages
1G Torque Specs
100% (1)
1G Torque Specs
55 pages
Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Supergene High Yield Introgressed Hybrid Oil Palm Malaysia
100% (13)
Supergene High Yield Introgressed Hybrid Oil Palm Malaysia
16 pages
Design An Airline Management System
No ratings yet
Design An Airline Management System
9 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
Unit 5 Handouts
No ratings yet
Unit 5 Handouts
16 pages
HIVE
No ratings yet
HIVE
80 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
Hive Final (1)
No ratings yet
Hive Final (1)
75 pages
Hive
No ratings yet
Hive
26 pages
unit-IV.docx
No ratings yet
unit-IV.docx
64 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
bda unit 4 - mam
No ratings yet
bda unit 4 - mam
57 pages
Hive PPT
No ratings yet
Hive PPT
61 pages
hive updated
No ratings yet
hive updated
18 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Unit 3
No ratings yet
Unit 3
8 pages
Introduction to Hive
No ratings yet
Introduction to Hive
14 pages
hive
No ratings yet
hive
47 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
Big Data
No ratings yet
Big Data
120 pages
HIVE
No ratings yet
HIVE
28 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
BDA-UNIT-IV -2020-21
100% (1)
BDA-UNIT-IV -2020-21
30 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Unit V BD LM Cse
No ratings yet
Unit V BD LM Cse
34 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hive
No ratings yet
Hive
30 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Hive
No ratings yet
Hive
5 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
(r17a0528) Big Data Analytics-57-100
No ratings yet
(r17a0528) Big Data Analytics-57-100
44 pages
HIVE (1)
No ratings yet
HIVE (1)
18 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
Hive
No ratings yet
Hive
23 pages
Unit 3
No ratings yet
Unit 3
23 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Hive
No ratings yet
Hive
63 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
HIVE
No ratings yet
HIVE
3 pages
Unit IV Notes
No ratings yet
Unit IV Notes
47 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
Hive
No ratings yet
Hive
9 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
Big Data & Analytics (CSE6005) L6 (2)
No ratings yet
Big Data & Analytics (CSE6005) L6 (2)
56 pages
BDA Unit-5-PPT
No ratings yet
BDA Unit-5-PPT
39 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Frid Ffhs26 Tech Sheet
No ratings yet
Frid Ffhs26 Tech Sheet
2 pages
Ceratizit Tools
No ratings yet
Ceratizit Tools
340 pages
rohini_89775758742
No ratings yet
rohini_89775758742
7 pages
Marxist Literary Theory : From Internet
No ratings yet
Marxist Literary Theory : From Internet
52 pages
Call Parent Slip
No ratings yet
Call Parent Slip
1 page
Construction and Building Materials: Mohammad Tahersima, Paul Tikalsky
No ratings yet
Construction and Building Materials: Mohammad Tahersima, Paul Tikalsky
12 pages
Date Sheet Semester Wise Tentative - V2 (1) - Compressed
No ratings yet
Date Sheet Semester Wise Tentative - V2 (1) - Compressed
18 pages
Appna Sehat: Change Management (B) : Background
No ratings yet
Appna Sehat: Change Management (B) : Background
11 pages
Welding Processes
No ratings yet
Welding Processes
12 pages
Netflix Cookies 1
No ratings yet
Netflix Cookies 1
3 pages
Mooring
No ratings yet
Mooring
11 pages
ECE TRANS WP.29 GRVA 2024 33e - 0
No ratings yet
ECE TRANS WP.29 GRVA 2024 33e - 0
2 pages
Book Package Details: GSTIN - 07BFGPS7337G1Z4
No ratings yet
Book Package Details: GSTIN - 07BFGPS7337G1Z4
1 page
Fr Study Hub
No ratings yet
Fr Study Hub
93 pages
Voice Engineer - JD
No ratings yet
Voice Engineer - JD
2 pages
Construction Cost Analysis of Retaining Walls
No ratings yet
Construction Cost Analysis of Retaining Walls
7 pages
Kabbalat Shabbat
No ratings yet
Kabbalat Shabbat
20 pages
Flying Paper f22 Raptor PDF
25% (4)
Flying Paper f22 Raptor PDF
4 pages
DG Sizing
100% (1)
DG Sizing
51 pages
V2-DOC
No ratings yet
V2-DOC
74 pages
NASM Intel x86 Assembly Language Cheat Sheet: Instruction Effect Examples
No ratings yet
NASM Intel x86 Assembly Language Cheat Sheet: Instruction Effect Examples
1 page
Draft HJRS Promtion Policy v2
No ratings yet
Draft HJRS Promtion Policy v2
5 pages
2024 Agriculture Form 2 Schemes of Work
100% (1)
2024 Agriculture Form 2 Schemes of Work
17 pages
Final Quiz 1 - Attempt Review
No ratings yet
Final Quiz 1 - Attempt Review
6 pages

Unit IV (1)

Uploaded by

Unit IV (1)

Uploaded by

Unit- IV Introduction to HIVE

1. A user writes a query in HiveQL (which is similar to SQL).

What is Hive Used For?

 Hive is mainly used for analyzing and processing large datasets.

1. We are allowed free usage of Apache Hive. It's freely accessible.

The following are the services provided by Hive:-

HIVE Data Types:

 Integral data type

Integral data type:

a. TINYINT: 1-byte integer

Floating data type

a. FLOAT: It is a 4-byte single-precision floating-point number.

If the scale is not specified, then by default, it is equal to 0.

a. TIMESTAMP: Timestamps were introduced in Hive 0.8.0. It supports traditional UNIX

a. BOOLEAN: Boolean types in Hive store either true or false.

Hive Complex Data Type

The Hive Complex Data Type are categorized as:

HIVE File Format:

Text File Avro Parquet ORC

1. Text File Formats

a) CSV (Comma-Separated Values) – .csv

 Data is stored as plain text, with values separated by commas.

b) TSV (Tab-Separated Values) – .tsv

 A general text format where each line represents a record.

2. Binary and Columnar Formats

These formats improve performance and storage efficiency.

a) Optimized Row Columnar (ORC) – .orc

 Highly optimized for Hive performance.

CREATE TABLE employee_orc (

b) Parquet Format – .parquet

CREATE TABLE employee_parquet (

c) Avro Format – .avro

CREATE TABLE employee_avro (

a) Sequence File – .seq

 Binary key-value format used in Hadoop.

b) Record Columnar File (RCFile) – .rc

 Older columnar format, replaced by ORC and Parquet.

HIVE Query Language:

Here’s a simple explanation of the Hive query execution process:

1. Hive DDL (Data Definition Language)

1.1 Database Operations

1.2 Table Operations

Create an External Table

Modify Table Structure

Querying specific partitions is faster than scanning the entire table.

1.4 Bucketing (Optimizes Storage and Querying)

Used for evenly distributing data and improving joins.

2.1 Loading Data into Hive

2.2 Querying Data

2.3 Aggregations and Grouping

Views allow creating virtual tables for easy querying.

1. Partitioning – Speeds up queries by scanning only relevant partitions.

Summary: Hive Query Language (HQL)

Key Components of RCFile Implementation

 Efficient Columnar Storage: Helps in better compression and query performance.

What is Serializer/Deserializer? (SERDE)

A Serializer/Deserializer (SerDe) is a component in data processing systems that handles the

 Serialization: Converts structured data into a format suitable for storage or

Why is SERDE Needed?

1. Deserialization Process (Reading Data into Hive)

1. Hive Reads Raw Data from Storage

2. Serialization Process (Writing Data from Hive to Storage)

1. Data Retrieved from Hive Table

User Defined Functions:

Types of User-Defined Functions in Hive

1. UDF (User-Defined Function)

How to Create a UDF in Hive?

1. Write a Java class that extends org.apache.hadoop.hive.ql.exec.UDF.

Example of a Simple Hive UDF

Steps to Use This UDF in Hive

1. Compile the Java file and create a JAR.

ADD JAR /path/to/your_udf.jar;

3. Create the function:

CREATE TEMPORARY FUNCTION to_upper AS ‘com.example.ToUpperCaseUDF’;

SELECT to_upper(name) FROM employees;

You might also like