100% found this document useful (1 vote)

231 views31 pages

Hive - PIG - HBase - Zookeeper

The document provides an overview of Apache Pig, Hive, HBase, and ZooKeeper, detailing their functionalities, architectures, and use cases within the Hadoop ecosystem. Pig is a platform for analyzing large datasets using Pig Latin, while Hive is a data warehousing system that utilizes HiveQL for querying structured data. HBase is a distributed column-oriented database for real-time access to structured data, and ZooKeeper serves as a centralized service for configuration and synchronization in distributed systems.

Uploaded by

21bce233

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

231 views31 pages

Hive - PIG - HBase - Zookeeper

Uploaded by

21bce233

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to

Pig Hive HBase Zookeeper

1
Apache Pig
● A platform to create programs that run on top of Hadoop in order to analyze large sets of data
● Pig has two main things:
● Pig Latin - a high level language for writing data analysis programs
● Pig Engine - Execution environment to run Pig Latin programs
● Execution Types:
1. Local Mode: Need access to single machine. Pig runs in a single JVM and accesses the local
filesystem
2. Hadoop (MapReduce) Mode: Need access to hadoop cluster and HDFS installation. Pig
translates queries into MapReduce jobs and runs them on a Hadoop cluster.

2
WHAT IS PIG?
Framework for analyzing large un-structured and
semi-structured data on top of hadoop

Pig engine: runtime environment where the

program executed.

Pig Latin: is simple but powerful data flow language

similar to scripting language.
1. SQL
2. Provide common data operations( loud,
filters, joins, group, store)
Pig Latin - Features and Data Flow Advantage over
MapReduce
framework
Features:
1. Pig Latin provides various operators that allows flexibility to developers to develop their own
functions for processing, reading and writing data
2. Pig Latin script is made up of a series of operations, or transformations, that are applied to the input
data to produce output

Data Flow:
1. A LOAD statement to read data from the system
2. A series of “transformation” statement to process the data
3. A DUMP statement to view results or STORE statement to save the result

4
Pig Architecture and Components
Components:

1. Parser
2. Compiler
3. Optimizer
4. Execution Engine

5
Execution Steps
1. Programmers write scripts in Pig Latin language to analyze data.
2. All these scripts are converted to Map and Reduce tasks.
3. The component Pig Engine accepts the Pig Latin scripts as input and converts them to MapReduce
jobs.

Limitations
1. Pig does not support random reads or queries in the order of tens of milliseconds.
2. Pig does not support random writes to update small portions of data, all writes are bulk, streaming
writes, just like MapReduce.
3. Low latency queries are not supported in Pig, making it not suitable for OLAP and OLTP.

6
WHAT IS HIVE?

o Hive: A data warehousing system to store structured data on Hadoop file system
o Developed by Facebook
o Provides an essay query by executing Hadoop MapReduce plans
o Provides SQL type language for querying called HiveQL or HQL
o The Hive shell is the primary way that we will interact with Hive
Introduction to HIVE
Mapreduce, where users have to understand advanced styles of Java programming in order to
successfully query data

ETL and Data warehousing tool on top of Hadoop

Data summarization and analysis of structured data

Organizing data by partitioning and bucketing

HiveQL: Query the data

8
HIVE DATA MODEL
• Tables: all the data is stored in a directory in HDFS
• Primitives: numeric, boolean, string and timestamps
• Complex: Arrays, maps and structs

• Partitions: divides a table into parts

• Queries that are restricted to a particular date or set of dates can run much more
efficiently because they only need to scan the files in the partitions that the query
pertains to

• Buckets: data in each partitions is divided into buckets

1. Enable more efficient queries
2. Makes sampling more efficient
Components in HIVE

1. Hadoop core components

2. Metastore
3. Driver
4. Hive Clients

10
MAJOR COMPONENTS OF HIVE
• UI: Users submits queries and other operations to the system

• Driver: Session handles, executes and fetches APIs modeled on JDBC/ODBC interfaces

• Metastore: Stores all the structure information of the various tables and partitions in the
warehouse

• Compiler: Converts the HiveQL into a plan for execution

• Execution Engine: Manages dependencies between these different stages of the plan and
executes these stages on the appropriate system components
Components
Drivers:

Receives all the instructions from HiveQL, parses the query and performs the semantic analysis.

Acts as a controller and observes the progress and life cycle of various actions by creating sessions.

Jar files that are part of hive package help in converting these HiveQL queries into equivalent MapReduce
jobs.

Hive Clients:

It is the interface through which we can submit the hive queries

Example: hive CLI, beeline

12
Hive vs Relational Database
● By using Hive, we can perform some peculiar functionality that is not achieved
in Relational Databases.
● Relational databases are of "Schema on READ and Schema on Write"
Hive is "Schema on READ only".
● No support for Update or Delete in HIVE
● No support for inserting single rows.
● Supports Partitioning and Bucketing.
PIG Vs HIVE
PIG: HIVE:

● Procedural Data FLow Language. ● Declarative SQL Language.

● Mainly used when there are more joins ● Used when limited number of joins are
and filters. present.
● Operates on the client side of a cluster. ● Operates on the server side of a cluster.
● Mainly used by researchers for ● Mainly used by data analysts for creating
programming. reports.
● Can handle both structured and
unstructured data. ● Supports only structured data.
● Cannot operate on thrift server. ● Can operate on thrift server.
● Pig uses Pig Latin for programming ● It uses HQL which goes beyond the SQL.
● No need to create tables. ● Should manually create tables.
HIVE Pros and Cons:
Pros: Cons:

● Hive works extremely well with large ● Joins (especially left join and right
data sets. Analysis over them is made join) are very complex, space
easy. consuming and time consuming.
● User-defined functions gives Improvement in this area would be of
flexibility to users to define operations great help!
that are used frequently as functions. ● Debugging can be messy with
● String functions that are available in ambiguous return codes and large
Hive has been extensively used for jobs can fail without much
analysis. explanation as to why.
● Partition to increase query efficiency. ● Slow because it uses mapreduce.
PIG Pros and Cons:
Pros: Cons:

● It has many advanced features built-in ● Writing your own User Defined
such as joins, secondary sort, many Functions (UDFS) is a nice feature
optimizations, predicate push-down, but can be painful to implement in
etc. practice
● Provides a decent abstraction for Map- ● May not fit every need and a SQL-like
Reduce jobs, allowing for a faster abstraction may not be easy
result than creating your own MR jobs ● The commands are not executed
● Can handle large and unstructured unless either you dump or store an
datasets. intermediate or final result. This
increases the iteration between debug
and resolving the issue.
HBase
● HBase is a distributed column-oriented database built on top of the HDFS. It is an open-source
project and horizontally scalable.
● HBase is a data model that is similar to Google’s big table that designed to provide quick random
access to huge amounts of structured data.
● HBase is a part of Hadoop ecosystem that provides real-time read/write access to data in the
Hadoop File System.
● HBase stores its data in HDFS.

17
Features of Hbase
● HBase is sparse, multidimensional, sorted map-
based database, which supports multiple versions
of the same record.
● HBase provides atomic read and write.
● HBase provides consistent reads and writes due
to above feature.
● HBase is linearly scalable.
● It has automatic failure support.
● It integrates with Hadoop, both as a source and a
destination.
● It has easy java API for client.
● It provides data replication across clusters.
18
HBase is… HBase is not ...
A distributed column oriented database Not an SQL Database
built on top HDFS.
Not Relational
A data model that is similar to Google’s
Big Table that designed to provide quick No Joins
random access to huge amounts of data.
No fancy query language and no
sophisticated query engine.
HBase Features
Linear Scalability: Capable of storing hundreds of terabytes of data.

Automatic and configurable sharding of tables.

Automatic failover support.

Strictly consistent read and writes.

HBase vs HDFS
Both are distributed systems that scale to hundreds or thousands of nodes.
HBase vs HDFS (Continued...)
•HBase is a database built on top of the •HDFS is a suitable for storing large files.
HDFS.
•HDFS does not support fast individual
•HBase provides fast lookups for larger record lookups.
tables.
•It provides high latency batch processing;
•It provides low latency access to single
rows from billions of records (Random •It provides only sequential access of
access). data.

•HBase internally uses Hash tables and

provides random access, and it stores the
data in indexed HDFS files for faster
lookups
Hbase vs HDFS vs Hive

23
Zookeeper
● Apache ZooKeeper is a software project of the Apache Software Foundation.
● It is essentially a distributed hierarchical key-value store, which is used to
provide a distributed configuration service, synchronization service, and naming
registry for large distributed systems.
● Examples include configuration information, hierarchical naming space, and so
on. Applications can leverage these to coordinate distributed processing across
large clusters.
● ZooKeeper was developed by Yahoo research and was a sub-project of Hadoop
but is now a top-level Apache project in its own right.
Zookeeper
● ZooKeeper is a centralized service for maintaining configuration information, naming, providing
distributed synchronization, and group services.
● ZooKeeper provides an infrastructure for cross-node synchronization by maintaining status type
information in memory on ZooKeeper servers.

25
Components of ZooKeeper
● Client - node in our distributed application cluster, access information from the server. Interacts with
the server to know that the connection is established.
● Server - node in our ZooKeeper ensemble, provides all the services to clients. Gives
acknowledgement to client to inform that the server is alive.
● Ensemble - Group of ZooKeeper servers. The minimum number of nodes that is required to form an
ensemble is 3
● Leader - Server node which performs automatic recovery if any of the connected node failed.
Leaders are elected on service startup.
● Follower - Server node which follows leader instruction.

26
Znode(ZooKeeper Node)
Every znode in the ZooKeeper data model maintains a stat structure

ZooKeeper data model maintains a stat structure of every node :

● Version number - every time the data associated with the znode changes, its corresponding version
number will be updated.
● Action Control List (ACL) - authentication mechanism for accessing the znode to govern its R/W
operations.
● Timestamp - represents time elapsed from znode creation and modification
● Data length - amount of the data stored in a znode, maximum 1 MB.

27
ZooKeeper
● ZooKeeper Command Line Interface (CLI) is used to interact with the ZooKeeper ensemble for
development purpose for debugging.
● To perform ZooKeeper CLI operations, the server and client are turned on and then the client can
perform the following operations :
● Create Znode
○ Ephemeral znodes (flag e): will delete once the session expires.
○ Sequential znodes (flag s): to specify a unique path.
● Get data
● Watch znode
● Set data
● Create children of a znode
● List children of a znode
● Check Status
● Delete a znode

28
ZooKeeper
Using ZooKeeper API, an application can connect, interact, manipulate data, coordinate, and finally
disconnect from a ZooKeeper ensemble.

● Rich set of features to get all functionalities of Zookeeper ensemble in a simple and safe manner.
● ZooKeeper API provides a small set of methods to manipulate all the details of znode with
ZooKeeper ensemble.

Steps followed to interact with ZooKeeper:

1. Connect to the ZooKeeper ensemble. ZooKeeper ensemble assign a Session ID for the client.
2. Send heartbeats to the server periodically. Otherwise, the ZooKeeper ensemble expires the Session
ID and the client needs to reconnect.
3. Get / Set the znodes.
4. Disconnect once all the tasks are completed.

29
Resources and Video Links
1. Apache PIG: http://pig.apache.org
2. Apache Hive: https://hive.apache.org/
3. Apache Zookeeper: https://zookeeper.apache.org/
4. Apache Hbase: https://www.edureka.co/blog/hbase-architecture/

PIG- https://youtu.be/rxnXHlaSohM
HIVE- https://youtu.be/uY7Rr7ru9E4
HBase- https://youtu.be/kN01ELCAsn8
ZooKeeper- https://youtu.be/Kgf9EjTNucM

30
THANK YOU!!!

Questions???

Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Hive Data Processing and HQL Guide
No ratings yet
Hive Data Processing and HQL Guide
11 pages
BDA Notes
No ratings yet
BDA Notes
13 pages
Big Data ETL Integration Techniques
No ratings yet
Big Data ETL Integration Techniques
14 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Overview of Apache Pig and Pig Latin
No ratings yet
Overview of Apache Pig and Pig Latin
55 pages
Hive Database & Analytics Guide
No ratings yet
Hive Database & Analytics Guide
10 pages
RGPV BigData ExamNotes
No ratings yet
RGPV BigData ExamNotes
19 pages
Big Data: Concepts, Challenges, and Solutions
No ratings yet
Big Data: Concepts, Challenges, and Solutions
22 pages
HBase: Data Management & Architecture
100% (1)
HBase: Data Management & Architecture
36 pages
Pig Vs Hive Big Data Analysis Showdown
No ratings yet
Pig Vs Hive Big Data Analysis Showdown
11 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Hive
100% (1)
Hive
47 pages
Hive Table Creation and Query Examples
No ratings yet
Hive Table Creation and Query Examples
5 pages
Performance Comparison of Hive, Impala and Spark SQL
No ratings yet
Performance Comparison of Hive, Impala and Spark SQL
6 pages
Hive Real Life Use Cases - AcadGild Blog
No ratings yet
Hive Real Life Use Cases - AcadGild Blog
19 pages
Log File Analysis with Hadoop
100% (1)
Log File Analysis with Hadoop
79 pages
Hadoop MapReduce Overview & Usage
No ratings yet
Hadoop MapReduce Overview & Usage
57 pages
Advantages of Relational Databases
No ratings yet
Advantages of Relational Databases
19 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Database and Table Management in Hive
No ratings yet
Database and Table Management in Hive
5 pages
Presto for Big Data Analytics in Cloud
100% (1)
Presto for Big Data Analytics in Cloud
31 pages
Ai and Big Data
No ratings yet
Ai and Big Data
6 pages
Module 3
No ratings yet
Module 3
36 pages
Understanding Big Data and Hadoop Basics
No ratings yet
Understanding Big Data and Hadoop Basics
28 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
HIVE Blockchain: Overvalued Pump & Dump
No ratings yet
HIVE Blockchain: Overvalued Pump & Dump
7 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
55 pages
Hive Installation On Windows 10
No ratings yet
Hive Installation On Windows 10
13 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
Hive Lab: Data Management Scenarios
No ratings yet
Hive Lab: Data Management Scenarios
33 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hadoop MapReduce Programming Guide
No ratings yet
Hadoop MapReduce Programming Guide
33 pages
Overview of Apache Hive Essentials
No ratings yet
Overview of Apache Hive Essentials
9 pages
Use Case Diagrams
No ratings yet
Use Case Diagrams
8 pages
DW Olap
No ratings yet
DW Olap
57 pages
Cloud Big Data Technology Selection Guide
No ratings yet
Cloud Big Data Technology Selection Guide
58 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Twitter Sentiment Analysis Project Report
No ratings yet
Twitter Sentiment Analysis Project Report
42 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
24 pages
Cognos Query Tips and Guidelines
No ratings yet
Cognos Query Tips and Guidelines
11 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
Big Data Topic3 (Spark) (Thanh Binh Nguyen) .TextMark
No ratings yet
Big Data Topic3 (Spark) (Thanh Binh Nguyen) .TextMark
60 pages
Nptel Week1
No ratings yet
Nptel Week1
134 pages
Hadoop Hive - One
No ratings yet
Hadoop Hive - One
10 pages
Introduction to Apache Spark for Big Data
No ratings yet
Introduction to Apache Spark for Big Data
241 pages
Avro Hands On Exercises
No ratings yet
Avro Hands On Exercises
2 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
4 pages
Hadoop on Commodity Hardware Explained
No ratings yet
Hadoop on Commodity Hardware Explained
28 pages
Introduction to Apache Hive Tutorial
No ratings yet
Introduction to Apache Hive Tutorial
30 pages
Big Data and Data Analytics Cloudera.
No ratings yet
Big Data and Data Analytics Cloudera.
3 pages
Comprehensive Big Data and Hadoop Course
No ratings yet
Comprehensive Big Data and Hadoop Course
17 pages
2012 Semantic Web Test
No ratings yet
2012 Semantic Web Test
6 pages
Unit 6 - Compression and Serialization in Hadoop
No ratings yet
Unit 6 - Compression and Serialization in Hadoop
24 pages
Install Hadoop 3.3.0 & Run WordCount
100% (1)
Install Hadoop 3.3.0 & Run WordCount
16 pages
S Pig Hive HBase Zookeeper 07
No ratings yet
S Pig Hive HBase Zookeeper 07
21 pages
SSIS Guide: Deployment & Management
No ratings yet
SSIS Guide: Deployment & Management
2,769 pages
Machine Learning and Big Data Courses
No ratings yet
Machine Learning and Big Data Courses
72 pages
Big Data Technologies - PGDBDA - Feb20
No ratings yet
Big Data Technologies - PGDBDA - Feb20
12 pages
Module 01 BDA
No ratings yet
Module 01 BDA
19 pages
Bda Unit 2 - Mam
No ratings yet
Bda Unit 2 - Mam
63 pages
Big Data Notes (Unit III & Unit IV)
No ratings yet
Big Data Notes (Unit III & Unit IV)
53 pages
Data Engineering Questions Answers 1679109980
100% (1)
Data Engineering Questions Answers 1679109980
26 pages
KCS061 Big Data
No ratings yet
KCS061 Big Data
2 pages
Cloudera Administration PDF
No ratings yet
Cloudera Administration PDF
478 pages
DMWQ1D4S3T1 - Building Analytics at Scale With Amazon Athena
No ratings yet
DMWQ1D4S3T1 - Building Analytics at Scale With Amazon Athena
48 pages
Data Analytics For Ioe: Syllabus
No ratings yet
Data Analytics For Ioe: Syllabus
23 pages
Big Data Workshop
No ratings yet
Big Data Workshop
77 pages
Senior Data Architect Resume Overview
100% (1)
Senior Data Architect Resume Overview
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
194 pages
Hive for Relational Data Processing
No ratings yet
Hive for Relational Data Processing
48 pages
Metastore Viewer Research Paper
No ratings yet
Metastore Viewer Research Paper
21 pages
Hive: Big Data Analytics Overview
No ratings yet
Hive: Big Data Analytics Overview
59 pages
Abhipsa Bisoi-Bdf
No ratings yet
Abhipsa Bisoi-Bdf
2 pages
Big Data and Analytics On AWS: KD Singh Solutions Architect Amazon Web Services
No ratings yet
Big Data and Analytics On AWS: KD Singh Solutions Architect Amazon Web Services
49 pages
Big Data Analytics Question Bank
No ratings yet
Big Data Analytics Question Bank
2 pages
Big Data Analytics Lab Exercises in Java
No ratings yet
Big Data Analytics Lab Exercises in Java
45 pages
Data Scientist Expertise Overview
No ratings yet
Data Scientist Expertise Overview
6 pages
Senior Big Data Engineer Profile
No ratings yet
Senior Big Data Engineer Profile
6 pages
Programming Hive 1St Edition Edward Capriolo
No ratings yet
Programming Hive 1St Edition Edward Capriolo
480 pages
BDACh 05 L03 A Spark QLAnalytics
No ratings yet
BDACh 05 L03 A Spark QLAnalytics
24 pages
Big Data Analytics Lesson Plan B.Tech
No ratings yet
Big Data Analytics Lesson Plan B.Tech
2 pages
CSCI312 Big Data Management Singapore 2022-2 Assignment 2: Published On 24 April 2022
No ratings yet
CSCI312 Big Data Management Singapore 2022-2 Assignment 2: Published On 24 April 2022
10 pages
B22DCVT246 Tran Van Huy
No ratings yet
B22DCVT246 Tran Van Huy
69 pages
Unix Commands Part 2
No ratings yet
Unix Commands Part 2
37 pages

Hive - PIG - HBase - Zookeeper

Uploaded by

Hive - PIG - HBase - Zookeeper

Uploaded by

Introduction to

Pig Hive HBase Zookeeper

Pig engine: runtime environment where the

Pig Latin: is simple but powerful data flow language

ETL and Data warehousing tool on top of Hadoop

Data summarization and analysis of structured data

Organizing data by partitioning and bucketing

HiveQL: Query the data

• Partitions: divides a table into parts

• Buckets: data in each partitions is divided into buckets

1. Hadoop core components

• Compiler: Converts the HiveQL into a plan for execution

It is the interface through which we can submit the hive queries

Example: hive CLI, beeline

● Procedural Data FLow Language. ● Declarative SQL Language.

Automatic and configurable sharding of tables.

Automatic failover support.

Strictly consistent read and writes.

•HBase internally uses Hash tables and

ZooKeeper data model maintains a stat structure of every node :

Steps followed to interact with ZooKeeper:

You might also like