0% found this document useful (0 votes)
5 views53 pages

Wa0006.

Hive is a data warehouse infrastructure built on Hadoop, primarily used for ETL applications, providing data summarization and ad-hoc queries. It supports SQL-like querying through HiveQL and is designed for handling large datasets with a focus on performance and extensibility. Hive is not suitable for OLTP applications or low-latency database access, and it features components like a metastore for metadata management and supports partitioning and bucketing for efficient data organization.

Uploaded by

Riki Borah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views53 pages

Wa0006.

Hive is a data warehouse infrastructure built on Hadoop, primarily used for ETL applications, providing data summarization and ad-hoc queries. It supports SQL-like querying through HiveQL and is designed for handling large datasets with a focus on performance and extensibility. Hive is not suitable for OLTP applications or low-latency database access, and it features components like a metastore for metadata management and supports partitioning and bucketing for efficient data organization.

Uploaded by

Riki Borah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Hive

◼ A data warehouse infrastructure built on top of Hadoop for providing


data summarization, ad-hoc queries, and analysis.
▪ Mostly used in ETL application.
▪ Provide structure to unstructured, semi-structured data
▪ Access to different storage formats
▪ Query execution via MapReduce

◼ Key Building Principles:


▪ SQL is a familiar language
▪ Extensibility – Types, Functions, Formats, Scripts
▪ Performance
Hive
◼ Hive was originally developed at Facebook
◼ Applications
◼ Log processing
◼ Text mining
◼ Document indexing
◼ Customer-facing business intelligence (e.g., Google Analytics),
◼ Predictive modeling
◼ hypothesis testing
Hive is not meant for

◼ An OLTP application
◼ Low latency Database access
◼ Transactional database (ACID)
◼ Row level inserts, updates or deletes – hive 0.14 allows update and
delete.
OLTP vs OLAP
OLTP (On-line Transaction Processing)
- OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE).
- very fast query processing
- maintaining data integrity in multi-access environments
- effectiveness measured by number of transactions per second.
- In OLTP database there is detailed and current data.

OLAP (On-line Analytical Processing)


- OLAP is characterized by relatively low volume of transactions.
- Queries are often very complex and involve aggregations.
- For OLAP systems a response time is an effectiveness measure.
- OLAP applications are widely used by Data Mining techniques.
- In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star
schema).
Hive vs RDBMS
HiveQL

-- Create a table in Hive


CREATE TABLE docs (line STRING);

--To load a data file in Hive table


LOAD DATA INPATH ‘/user/DATA/docs.txt’
OVERWRITE INTO TABLE docs;
-- Create a new table with data from an existing table in Hive

CREATE TABLE word_counts AS


SELECT word, count(1) AS count
FROM
(SELECT explode(split(line, ’\\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
Components in Hive

◼ Shell or CLI: The command line interface to Hive


◼ Driver: Handles session, fetch, execute
◼ HiveQL Compiler: Parse, Plan & Optimize
◼ Metadata Store: Stores table and database structure
◼ Execution engine: DAG of stages (MapReduce Jobs, MapRFS, or
Metadata)
Hive Clients
◼ Hive Thrift Client: makes it easy to run Hive commands from a wide
range of programming languages like C++, Java etc
◼ Hive JDBC Driver: Enable Java applications to connect Hive.
◼ ODBC Driver: allows applications that support the ODBC protocol to
connect to Hive.
◼ Hive Server: Clients and applications access hive service by using the
hive server.
◼ It is started by using the command hive --service hiveserver
Metastore
◼ The metastore is the central repository of Hive metadata.
◼ Holds table definitions (column types, physical layout)
◼ Holds information for Partitioned data
◼ can be stored in Derby, MySQL, Postgres database.
◼ Three types of configuration
• Embedded - It allows only one hive session to be opened at a time.
• Local - Allows multiple hive sessions (multiple users) to be connected at a
time
• Remote - Allows multiple hive sessions (multiple users) to be connected at a
time with high security.
Embedded Metastore Configuration

◼ By default, Hive uses an embedded Derby Database for the meta


store service.
◼ Derby resides along with Hive in the same JVM instance.
◼ It allows only one hive session to be opened at a time.
Local Metastore Configuration

◼ Allows multiple hive sessions (multiple users) to be connected at a


time
◼ Meta store service runs in the same process as the Hive service, but
connects to a database running in a separate process
◼ Uses any JDBC compliant standalone database
◼ MySQL is most popularly used
◼ Properties to be set in hive-site.xml
• javax.jdo.option.ConnectionURL = jdbc:mysql://host/dbname
• javax.jdo.option.ConnectionDriverName = com.mysql.jdbc.Driver
Remote Metastore Configuration

◼ One or more metastore servers run in separate processes to the Hive


service.
◼ Better manageability and security, since the database tier can be
completely firewalled off, and the clients no longer need the
database credentials.
Hive CLI

◼ Getting started with Interactive shell


• $HIVE_HOME/bin/hive

• hive> SHOW DATABASES;


hive> CREATE DATABASE training;

hive> USE training;

hive> set hive.cli.print.current.db=true;

hive(training)>
How to execute a hive query
◼ Executing queries from command line
• $HIVE_HOME/bin/hive –e “select * from
employees”
◼ Executing queries from a file
• $HIVE_HOME/bin/hive –f /training/emp.hql
◼ Result of a query into a file using silent mode
• $HIVE_HOME/bin/hive –S –e “select * from
employees” > /home/<username>/res.txt
Tables - Description
◼ Hive table is made of table’s metadata & associated data
◼ Actual data stored in HDFS, Metadata in Metastore (Derby, My SQL, Postgre SQL)
◼ 2 types of tables – Managed/Internal, External
◼ Physical Data Layout
◼ Database stored in a directory in HDFS
Example: /user/hive/warehouse/
◼ Tables stored under the HDFS directory
Example:/user/hive/warehouse/<db_name>/<table_name>
/user/hive/warehouse/sample.db/sampleTable
◼ Default database is “default”
Tables – DDL & Related Information
Create & Describe Table
Hive Partitions
◼ Partitions can be created on any fields.
◼ For each value of the partition, Hive creates a sub-directory.
◼ If we partition by date, all files belonging to a particular date will go into its
respective partition.
◼ Sub-partitions are also supported.
◼ Physical Layout example
• /user/hive/warehouse/sampleTable/dt=2012-05-18
• /user/hive/warehouse/sampleTable/dt=2012-05-19

• Here, dt is the partition name; 2012-05-18 and 2012-05-19 are values.


• Each is a sub-directory under the table
Partitions
Partitions view and drop

• To viewing all partitions of a table :


SHOW PARTITIONS <Hive table name>;
Ex- SHOW PARTITIONS stations;

• To dropping a partition
ALTER TABLE stations
DROP PARTITON (year=2012);
Processing of hive table by creating Partition
Step 1 -Creation of Table all states

Create table Allstates


(state String,
District String,
Enrolments String
)
row format delimited
fields terminated by ',‘ ;

Step 2 - Loading data into created table Allstates


Load data local inpath '/home/hduser/Desktop/AllStates.csv'
into table Allstates;
Step 3 - Creation of partition table
create table state_part
(District String,
Enrolments String
)
PARTITIONED BY(state String);

Step 4 - For partition we have to set this property (hive-site.xml)


set hive.exec.dynamic.partition.mode=nonstrict ;
• Step 5 - Loading data into partition table

INSERT OVERWRITE TABLE state_part


PARTITION(state)
SELECT district, enrolments,state
from allstates;
•Step 6 - Actual processing and formation of partition tables
based on state as partition key
•Step 7 - There are going to be 38 partition outputs in HDFS
storage with the file name as state name.
•Step 8 – check the partitions created for the hive internal table
in the following location –
•$bin/hadoop fs –ls /user/hive/warehouse/state_part
Views
◼ A way of decomposing complex queries.
◼ Only query-able views.
◼ Updatable views not supported.
◼ Since views are read-only they may not be used as the target of
LOAD/INSERT/ALTER
◼ Querying the view would start MapReduce jobs
◼ Materialized views not supported.
Buckets - Description

◼ Enables efficient query execution.

◼ The clause used for bucketing is CLUSTERED BY.


Buckets - DDL
Buckets - DML
Bucketing in Hive
• Buckets in hive is used in segregating of hive table-data into multiple files or
directories.
• it is used for efficient querying.
• The data i.e. present in that partitions can be divided further into Buckets
• The division is performed based on Hash of particular columns that we
selected in the table.
• Buckets use some form of Hashing algorithm at back end to read each
record and place it into buckets
• In Hive, we have to enable buckets by -
• set hive.enforce.bucketing=true;
•Step 1) Creating Bucket as shown below.

•CREATE TABLE SAMPLE_BUCKET(


•firstName STRING,
•job_id INT,
•department STRING,
•salary STRING,
•country STRING)

•CLUSTERED BY (country) INTO 4 BUCKETS


•ROW FORMAT DELIMITED
•FIELDS TERMINATED BY ',' ;
Step 2) Loading Data into table sample bucket
• Assuming that "Employees table" already created in Hive system.

• FROM EMPLOYEES
• INSERT OVERWRITE TABLE SAMPLE_BUCKET
• SELECT first_Name, job_id, department, salary, country ;

Step 3)Displaying 4 buckets that created in Step 1


• we can see that the data from the employees table is transferred into 4 buckets created in step 1.
• bin/hadoop fs -ls /<data file location in hdfs>
Tables - DML: Loading data

◼ row-level inserts or updates not supported in hive table.


◼ Populating a table would mean directly loading a file containing N
records into the table
◼ Hive does not do any transformation while loading data into tables
◼ Load operations are currently pure copy/move operations
Table DML
• Two ways to populate a hive table or modify the table’s data:
◼ By using load statements – to load data from files or directories
◼ By using insert overwrite table/ insert into statements - to load data
from a query
Loading
LOAD DATA LOCAL INPATH ‘/training/demo/data’ OVERWRITE INTO
TABLE demo;

◼ Use LOCAL when the file to be loaded resides in the local file system
and not HDFS
◼ Use LOCAL to copy a file to Hive table location
◼ Use OVERWRITE if data is not to be appended
Load into Partitioned data

• LOAD DATA LOCAL INPATH


‘/training/hivetraining/datasets/WEATHER/temperature/99
9233-327.txt’
INTO TABLE temperature
PARTITION (stationno=‘999233-327’);
Insert & Overwriting data
INSERT OVERWRITE TABLE demoid
select id from demo

INSERT OVERWRITE TABLE demonames


PARTITION(place=‘US’)
select firstname, lastname
from demo
where country=‘US’

INSERT OVERWRITE overwrites existing data.


Insert & Appending data
Multi-Table Inserts
Writing into a directory
Partition Inserts

◼ Partition columns are specified in DDL statements itself


◼ Partition columns information is kept in metastore
◼ Static Partitions – The partition value known at compile time
Dynamic Partition Inserts

• Dynamic Partitions – The partition values evaluated from the query


at run-time ( as an analogy they are like variable argument (varargs)
methods)
SELECT
Group by In Select
Distinct
Select Limit
Order By vs Sort By
SELECT – UNION ALL
CTAS, UNION ALL

◼ In UNION ALL : column names, data types & number of columns in all
queries used for uniting should match exactly.
◼ Sub Queries in hive should always be given an alias name e.g. ‘temp’
in this query
CASE statement
◼ Case statements are like IF-THEN-ELSE
◼ Example : We want to categorize stations into either “Missing” or
“Eastern Hemisphere” based on their longitude values
• https://career.guru99.com/top-30-hive-interview-questions/
https://letsfindcourse.com/hadoop-questions/pig-hadoop-mcq-questions

You might also like