0% found this document useful (0 votes)

5 views53 pages

Wa0006.

Hive is a data warehouse infrastructure built on Hadoop, primarily used for ETL applications, providing data summarization and ad-hoc queries. It supports SQL-like querying through HiveQL and is designed for handling large datasets with a focus on performance and extensibility. Hive is not suitable for OLTP applications or low-latency database access, and it features components like a metastore for metadata management and supports partitioning and bucketing for efficient data organization.

Uploaded by

Riki Borah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views53 pages

Wa0006.

Uploaded by

Riki Borah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Hive

◼ A data warehouse infrastructure built on top of Hadoop for providing

data summarization, ad-hoc queries, and analysis.
▪ Mostly used in ETL application.
▪ Provide structure to unstructured, semi-structured data
▪ Access to different storage formats
▪ Query execution via MapReduce

◼ Key Building Principles:

▪ SQL is a familiar language
▪ Extensibility – Types, Functions, Formats, Scripts
▪ Performance
Hive
◼ Hive was originally developed at Facebook
◼ Applications
◼ Log processing
◼ Text mining
◼ Document indexing
◼ Customer-facing business intelligence (e.g., Google Analytics),
◼ Predictive modeling
◼ hypothesis testing
Hive is not meant for

◼ An OLTP application
◼ Low latency Database access
◼ Transactional database (ACID)
◼ Row level inserts, updates or deletes – hive 0.14 allows update and
delete.
OLTP vs OLAP
OLTP (On-line Transaction Processing)
- OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE).
- very fast query processing
- maintaining data integrity in multi-access environments
- effectiveness measured by number of transactions per second.
- In OLTP database there is detailed and current data.

OLAP (On-line Analytical Processing)

- OLAP is characterized by relatively low volume of transactions.
- Queries are often very complex and involve aggregations.
- For OLAP systems a response time is an effectiveness measure.
- OLAP applications are widely used by Data Mining techniques.
- In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star
schema).
Hive vs RDBMS
HiveQL

-- Create a table in Hive

CREATE TABLE docs (line STRING);

--To load a data file in Hive table

LOAD DATA INPATH ‘/user/DATA/docs.txt’
OVERWRITE INTO TABLE docs;
-- Create a new table with data from an existing table in Hive

CREATE TABLE word_counts AS

SELECT word, count(1) AS count
FROM
(SELECT explode(split(line, ’\\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
Components in Hive

◼ Shell or CLI: The command line interface to Hive

◼ Driver: Handles session, fetch, execute
◼ HiveQL Compiler: Parse, Plan & Optimize
◼ Metadata Store: Stores table and database structure
◼ Execution engine: DAG of stages (MapReduce Jobs, MapRFS, or
Metadata)
Hive Clients
◼ Hive Thrift Client: makes it easy to run Hive commands from a wide
range of programming languages like C++, Java etc
◼ Hive JDBC Driver: Enable Java applications to connect Hive.
◼ ODBC Driver: allows applications that support the ODBC protocol to
connect to Hive.
◼ Hive Server: Clients and applications access hive service by using the
hive server.
◼ It is started by using the command hive --service hiveserver
Metastore
◼ The metastore is the central repository of Hive metadata.
◼ Holds table definitions (column types, physical layout)
◼ Holds information for Partitioned data
◼ can be stored in Derby, MySQL, Postgres database.
◼ Three types of configuration
• Embedded - It allows only one hive session to be opened at a time.
• Local - Allows multiple hive sessions (multiple users) to be connected at a
time
• Remote - Allows multiple hive sessions (multiple users) to be connected at a
time with high security.
Embedded Metastore Configuration

◼ By default, Hive uses an embedded Derby Database for the meta

store service.
◼ Derby resides along with Hive in the same JVM instance.
◼ It allows only one hive session to be opened at a time.
Local Metastore Configuration

◼ Allows multiple hive sessions (multiple users) to be connected at a

time
◼ Meta store service runs in the same process as the Hive service, but
connects to a database running in a separate process
◼ Uses any JDBC compliant standalone database
◼ MySQL is most popularly used
◼ Properties to be set in hive-site.xml
• javax.jdo.option.ConnectionURL = jdbc:mysql://host/dbname
• javax.jdo.option.ConnectionDriverName = com.mysql.jdbc.Driver
Remote Metastore Configuration

◼ One or more metastore servers run in separate processes to the Hive

service.
◼ Better manageability and security, since the database tier can be
completely firewalled off, and the clients no longer need the
database credentials.
Hive CLI

◼ Getting started with Interactive shell

• $HIVE_HOME/bin/hive

• hive> SHOW DATABASES;

hive> CREATE DATABASE training;
•
hive> USE training;
•
hive> set hive.cli.print.current.db=true;
•
hive(training)>
How to execute a hive query
◼ Executing queries from command line
• $HIVE_HOME/bin/hive –e “select * from
employees”
◼ Executing queries from a file
• $HIVE_HOME/bin/hive –f /training/emp.hql
◼ Result of a query into a file using silent mode
• $HIVE_HOME/bin/hive –S –e “select * from
employees” > /home/<username>/res.txt
Tables - Description
◼ Hive table is made of table’s metadata & associated data
◼ Actual data stored in HDFS, Metadata in Metastore (Derby, My SQL, Postgre SQL)
◼ 2 types of tables – Managed/Internal, External
◼ Physical Data Layout
◼ Database stored in a directory in HDFS
Example: /user/hive/warehouse/
◼ Tables stored under the HDFS directory
Example:/user/hive/warehouse/<db_name>/<table_name>
/user/hive/warehouse/sample.db/sampleTable
◼ Default database is “default”
Tables – DDL & Related Information
Create & Describe Table
Hive Partitions
◼ Partitions can be created on any fields.
◼ For each value of the partition, Hive creates a sub-directory.
◼ If we partition by date, all files belonging to a particular date will go into its
respective partition.
◼ Sub-partitions are also supported.
◼ Physical Layout example
• /user/hive/warehouse/sampleTable/dt=2012-05-18
• /user/hive/warehouse/sampleTable/dt=2012-05-19

• Here, dt is the partition name; 2012-05-18 and 2012-05-19 are values.

• Each is a sub-directory under the table
Partitions
Partitions view and drop

• To viewing all partitions of a table :

SHOW PARTITIONS <Hive table name>;
Ex- SHOW PARTITIONS stations;

• To dropping a partition
ALTER TABLE stations
DROP PARTITON (year=2012);
Processing of hive table by creating Partition
Step 1 -Creation of Table all states

Create table Allstates

(state String,
District String,
Enrolments String
)
row format delimited
fields terminated by ',‘ ;

Step 2 - Loading data into created table Allstates

Load data local inpath '/home/hduser/Desktop/AllStates.csv'
into table Allstates;
Step 3 - Creation of partition table
create table state_part
(District String,
Enrolments String
)
PARTITIONED BY(state String);

Step 4 - For partition we have to set this property (hive-site.xml)

set hive.exec.dynamic.partition.mode=nonstrict ;
• Step 5 - Loading data into partition table

INSERT OVERWRITE TABLE state_part

PARTITION(state)
SELECT district, enrolments,state
from allstates;
•Step 6 - Actual processing and formation of partition tables
based on state as partition key
•Step 7 - There are going to be 38 partition outputs in HDFS
storage with the file name as state name.
•Step 8 – check the partitions created for the hive internal table
in the following location –
•$bin/hadoop fs –ls /user/hive/warehouse/state_part
Views
◼ A way of decomposing complex queries.
◼ Only query-able views.
◼ Updatable views not supported.
◼ Since views are read-only they may not be used as the target of
LOAD/INSERT/ALTER
◼ Querying the view would start MapReduce jobs
◼ Materialized views not supported.
Buckets - Description

◼ Enables efficient query execution.

◼ The clause used for bucketing is CLUSTERED BY.

Buckets - DDL
Buckets - DML
Bucketing in Hive
• Buckets in hive is used in segregating of hive table-data into multiple files or
directories.
• it is used for efficient querying.
• The data i.e. present in that partitions can be divided further into Buckets
• The division is performed based on Hash of particular columns that we
selected in the table.
• Buckets use some form of Hashing algorithm at back end to read each
record and place it into buckets
• In Hive, we have to enable buckets by -
• set hive.enforce.bucketing=true;
•Step 1) Creating Bucket as shown below.

•CREATE TABLE SAMPLE_BUCKET(

•firstName STRING,
•job_id INT,
•department STRING,
•salary STRING,
•country STRING)

•CLUSTERED BY (country) INTO 4 BUCKETS

•ROW FORMAT DELIMITED
•FIELDS TERMINATED BY ',' ;
Step 2) Loading Data into table sample bucket
• Assuming that "Employees table" already created in Hive system.

• FROM EMPLOYEES
• INSERT OVERWRITE TABLE SAMPLE_BUCKET
• SELECT first_Name, job_id, department, salary, country ;

Step 3)Displaying 4 buckets that created in Step 1

• we can see that the data from the employees table is transferred into 4 buckets created in step 1.
• bin/hadoop fs -ls /<data file location in hdfs>
Tables - DML: Loading data

◼ row-level inserts or updates not supported in hive table.

◼ Populating a table would mean directly loading a file containing N
records into the table
◼ Hive does not do any transformation while loading data into tables
◼ Load operations are currently pure copy/move operations
Table DML
• Two ways to populate a hive table or modify the table’s data:
◼ By using load statements – to load data from files or directories
◼ By using insert overwrite table/ insert into statements - to load data
from a query
Loading
LOAD DATA LOCAL INPATH ‘/training/demo/data’ OVERWRITE INTO
TABLE demo;

◼ Use LOCAL when the file to be loaded resides in the local file system
and not HDFS
◼ Use LOCAL to copy a file to Hive table location
◼ Use OVERWRITE if data is not to be appended
Load into Partitioned data

• LOAD DATA LOCAL INPATH

‘/training/hivetraining/datasets/WEATHER/temperature/99
9233-327.txt’
INTO TABLE temperature
PARTITION (stationno=‘999233-327’);
Insert & Overwriting data
INSERT OVERWRITE TABLE demoid
select id from demo

INSERT OVERWRITE TABLE demonames

PARTITION(place=‘US’)
select firstname, lastname
from demo
where country=‘US’

INSERT OVERWRITE overwrites existing data.

Insert & Appending data
Multi-Table Inserts
Writing into a directory
Partition Inserts

◼ Partition columns are specified in DDL statements itself

◼ Partition columns information is kept in metastore
◼ Static Partitions – The partition value known at compile time
Dynamic Partition Inserts

• Dynamic Partitions – The partition values evaluated from the query

at run-time ( as an analogy they are like variable argument (varargs)
methods)
SELECT
Group by In Select
Distinct
Select Limit
Order By vs Sort By
SELECT – UNION ALL
CTAS, UNION ALL

◼ In UNION ALL : column names, data types & number of columns in all
queries used for uniting should match exactly.
◼ Sub Queries in hive should always be given an alias name e.g. ‘temp’
in this query
CASE statement
◼ Case statements are like IF-THEN-ELSE
◼ Example : We want to categorize stations into either “Missing” or
“Eastern Hemisphere” based on their longitude values
• https://career.guru99.com/top-30-hive-interview-questions/
https://letsfindcourse.com/hadoop-questions/pig-hadoop-mcq-questions

Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Model Double Bar Graph Activity For OL
No ratings yet
Model Double Bar Graph Activity For OL
2 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hive L1
No ratings yet
Hive L1
134 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
DELEM
50% (4)
DELEM
150 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Articles Usage 5th Grade
No ratings yet
Articles Usage 5th Grade
3 pages
Felix Mendelssohn
No ratings yet
Felix Mendelssohn
4 pages
Hive PPT
No ratings yet
Hive PPT
61 pages
Big Data Record 2
No ratings yet
Big Data Record 2
117 pages
Hive Final
No ratings yet
Hive Final
75 pages
HIVE
No ratings yet
HIVE
80 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Bec601 Module 1 Notes
No ratings yet
Bec601 Module 1 Notes
61 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
IOM House Style Manual - 2020
No ratings yet
IOM House Style Manual - 2020
66 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Hive Main
No ratings yet
Hive Main
33 pages
2as Test Unit Make Peace Nobel Peace Prize Tests 131908
100% (1)
2as Test Unit Make Peace Nobel Peace Prize Tests 131908
6 pages
Hive
No ratings yet
Hive
45 pages
Hive
No ratings yet
Hive
65 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
HIVE
No ratings yet
HIVE
28 pages
Module 4
No ratings yet
Module 4
34 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive
No ratings yet
Hive
29 pages
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
No ratings yet
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
91 pages
Hive Intoduction and Tables
No ratings yet
Hive Intoduction and Tables
31 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Hive
No ratings yet
Hive
28 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
ESC201T L21 Diode Model
No ratings yet
ESC201T L21 Diode Model
41 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Mark Scheme (Results) Summer 2024: Pearson Edexcel International GCSE in Mathematics A (4MA1) Paper 2F
No ratings yet
Mark Scheme (Results) Summer 2024: Pearson Edexcel International GCSE in Mathematics A (4MA1) Paper 2F
28 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
Hive
No ratings yet
Hive
23 pages
Hive Documet
No ratings yet
Hive Documet
33 pages
07 Hive 01
No ratings yet
07 Hive 01
21 pages
M4 Q&a
No ratings yet
M4 Q&a
22 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hive Notes
No ratings yet
Hive Notes
15 pages
HDFSandhivecommands
No ratings yet
HDFSandhivecommands
15 pages
1ST Assignment
No ratings yet
1ST Assignment
50 pages
Practical-2 Hive (Show - Create - Load Commands)
No ratings yet
Practical-2 Hive (Show - Create - Load Commands)
13 pages
Iteanz AEM Course Path
No ratings yet
Iteanz AEM Course Path
23 pages
Apache Hive
No ratings yet
Apache Hive
30 pages
Hive
No ratings yet
Hive
9 pages
Hive
No ratings yet
Hive
15 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Module 4 Philo
No ratings yet
Module 4 Philo
33 pages
WIN10 pro安装U2000LCT.zh-Eng
No ratings yet
WIN10 pro安装U2000LCT.zh-Eng
14 pages
Railway Ticket Booking System Using QR Code
No ratings yet
Railway Ticket Booking System Using QR Code
14 pages
Ravikiran CyberArk Resume
No ratings yet
Ravikiran CyberArk Resume
6 pages
AIT-6A Final Exam: ¿Enviar La Respuesta?
No ratings yet
AIT-6A Final Exam: ¿Enviar La Respuesta?
14 pages
Hive
No ratings yet
Hive
4 pages
W60N10
No ratings yet
W60N10
11 pages
Crowdfunding Platform Using Smart Contracts
No ratings yet
Crowdfunding Platform Using Smart Contracts
8 pages
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
No ratings yet
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
11 pages
SOAL PAS GANJIL 12 IPA, IPS November 20221
No ratings yet
SOAL PAS GANJIL 12 IPA, IPS November 20221
5 pages
Role of Syntax in John Milton Poem On His Blindness-2 PDF
No ratings yet
Role of Syntax in John Milton Poem On His Blindness-2 PDF
11 pages
G10 Timetable (May-June 2025) Examination
No ratings yet
G10 Timetable (May-June 2025) Examination
1 page
A Web-Based Modeling Tool For The SEMAT Essence Theory of Software Engineering
No ratings yet
A Web-Based Modeling Tool For The SEMAT Essence Theory of Software Engineering
7 pages
Toolkit: Interface Tool Development Software
No ratings yet
Toolkit: Interface Tool Development Software
3 pages
Karthik V.: Sr. Azure Devops Engineer
No ratings yet
Karthik V.: Sr. Azure Devops Engineer
9 pages
Revelation 4 - 6 Translation
No ratings yet
Revelation 4 - 6 Translation
6 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Peer Assessment 1
No ratings yet
Peer Assessment 1
9 pages
Measuring Position and Displacement With LVDTS: Tutorial
No ratings yet
Measuring Position and Displacement With LVDTS: Tutorial
5 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
1 Nice To Meet You!: Hello!
No ratings yet
1 Nice To Meet You!: Hello!
80 pages
TSR PSP Comexe
No ratings yet
TSR PSP Comexe
21 pages
Teme Licenta 2011 Engleza
No ratings yet
Teme Licenta 2011 Engleza
3 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Wa0006.

Uploaded by

Wa0006.

Uploaded by

Hive

◼ A data warehouse infrastructure built on top of Hadoop for providing

◼ Key Building Principles:

OLAP (On-line Analytical Processing)

-- Create a table in Hive

--To load a data file in Hive table

CREATE TABLE word_counts AS

◼ Shell or CLI: The command line interface to Hive

◼ By default, Hive uses an embedded Derby Database for the meta

◼ Allows multiple hive sessions (multiple users) to be connected at a

◼ One or more metastore servers run in separate processes to the Hive

◼ Getting started with Interactive shell

• hive> SHOW DATABASES;

• Here, dt is the partition name; 2012-05-18 and 2012-05-19 are values.

• To viewing all partitions of a table :

Create table Allstates

Step 2 - Loading data into created table Allstates

Step 4 - For partition we have to set this property (hive-site.xml)

INSERT OVERWRITE TABLE state_part

◼ Enables efficient query execution.

◼ The clause used for bucketing is CLUSTERED BY.

•CREATE TABLE SAMPLE_BUCKET(

•CLUSTERED BY (country) INTO 4 BUCKETS

Step 3)Displaying 4 buckets that created in Step 1

◼ row-level inserts or updates not supported in hive table.

• LOAD DATA LOCAL INPATH

INSERT OVERWRITE TABLE demonames

INSERT OVERWRITE overwrites existing data.

◼ Partition columns are specified in DDL statements itself

• Dynamic Partitions – The partition values evaluated from the query

You might also like