Introduction to
Pig Hive HBase Zookeeper
1
Apache Pig
● A platform to create programs that run on top of Hadoop in order to analyze large sets of data
● Pig has two main things:
● Pig Latin - a high level language for writing data analysis programs
● Pig Engine - Execution environment to run Pig Latin programs
● Execution Types:
1. Local Mode: Need access to single machine. Pig runs in a single JVM and accesses the local
filesystem
2. Hadoop (MapReduce) Mode: Need access to hadoop cluster and HDFS installation. Pig
translates queries into MapReduce jobs and runs them on a Hadoop cluster.
2
WHAT IS PIG?
Framework for analyzing large un-structured and
semi-structured data on top of hadoop
Pig engine: runtime environment where the
program executed.
Pig Latin: is simple but powerful data flow language
similar to scripting language.
1. SQL
2. Provide common data operations( loud,
filters, joins, group, store)
Pig Latin - Features and Data Flow Advantage over
MapReduce
framework
Features:
1. Pig Latin provides various operators that allows flexibility to developers to develop their own
functions for processing, reading and writing data
2. Pig Latin script is made up of a series of operations, or transformations, that are applied to the input
data to produce output
Data Flow:
1. A LOAD statement to read data from the system
2. A series of “transformation” statement to process the data
3. A DUMP statement to view results or STORE statement to save the result
4
Pig Architecture and Components
Components:
1. Parser
2. Compiler
3. Optimizer
4. Execution Engine
5
Execution Steps
1. Programmers write scripts in Pig Latin language to analyze data.
2. All these scripts are converted to Map and Reduce tasks.
3. The component Pig Engine accepts the Pig Latin scripts as input and converts them to MapReduce
jobs.
Limitations
1. Pig does not support random reads or queries in the order of tens of milliseconds.
2. Pig does not support random writes to update small portions of data, all writes are bulk, streaming
writes, just like MapReduce.
3. Low latency queries are not supported in Pig, making it not suitable for OLAP and OLTP.
6
WHAT IS HIVE?
o Hive: A data warehousing system to store structured data on Hadoop file system
o Developed by Facebook
o Provides an essay query by executing Hadoop MapReduce plans
o Provides SQL type language for querying called HiveQL or HQL
o The Hive shell is the primary way that we will interact with Hive
Introduction to HIVE
Mapreduce, where users have to understand advanced styles of Java programming in order to
successfully query data
ETL and Data warehousing tool on top of Hadoop
Data summarization and analysis of structured data
Organizing data by partitioning and bucketing
HiveQL: Query the data
8
HIVE DATA MODEL
• Tables: all the data is stored in a directory in HDFS
• Primitives: numeric, boolean, string and timestamps
• Complex: Arrays, maps and structs
• Partitions: divides a table into parts
• Queries that are restricted to a particular date or set of dates can run much more
efficiently because they only need to scan the files in the partitions that the query
pertains to
• Buckets: data in each partitions is divided into buckets
1. Enable more efficient queries
2. Makes sampling more efficient
Components in HIVE
1. Hadoop core components
2. Metastore
3. Driver
4. Hive Clients
10
MAJOR COMPONENTS OF HIVE
• UI: Users submits queries and other operations to the system
• Driver: Session handles, executes and fetches APIs modeled on JDBC/ODBC interfaces
• Metastore: Stores all the structure information of the various tables and partitions in the
warehouse
• Compiler: Converts the HiveQL into a plan for execution
• Execution Engine: Manages dependencies between these different stages of the plan and
executes these stages on the appropriate system components
Components
Drivers:
Receives all the instructions from HiveQL, parses the query and performs the semantic analysis.
Acts as a controller and observes the progress and life cycle of various actions by creating sessions.
Jar files that are part of hive package help in converting these HiveQL queries into equivalent MapReduce
jobs.
Hive Clients:
It is the interface through which we can submit the hive queries
Example: hive CLI, beeline
12
Hive vs Relational Database
● By using Hive, we can perform some peculiar functionality that is not achieved
in Relational Databases.
● Relational databases are of "Schema on READ and Schema on Write"
Hive is "Schema on READ only".
● No support for Update or Delete in HIVE
● No support for inserting single rows.
● Supports Partitioning and Bucketing.
PIG Vs HIVE
PIG: HIVE:
● Procedural Data FLow Language. ● Declarative SQL Language.
● Mainly used when there are more joins ● Used when limited number of joins are
and filters. present.
● Operates on the client side of a cluster. ● Operates on the server side of a cluster.
● Mainly used by researchers for ● Mainly used by data analysts for creating
programming. reports.
● Can handle both structured and
unstructured data. ● Supports only structured data.
● Cannot operate on thrift server. ● Can operate on thrift server.
● Pig uses Pig Latin for programming ● It uses HQL which goes beyond the SQL.
● No need to create tables. ● Should manually create tables.
HIVE Pros and Cons:
Pros: Cons:
● Hive works extremely well with large ● Joins (especially left join and right
data sets. Analysis over them is made join) are very complex, space
easy. consuming and time consuming.
● User-defined functions gives Improvement in this area would be of
flexibility to users to define operations great help!
that are used frequently as functions. ● Debugging can be messy with
● String functions that are available in ambiguous return codes and large
Hive has been extensively used for jobs can fail without much
analysis. explanation as to why.
● Partition to increase query efficiency. ● Slow because it uses mapreduce.
PIG Pros and Cons:
Pros: Cons:
● It has many advanced features built-in ● Writing your own User Defined
such as joins, secondary sort, many Functions (UDFS) is a nice feature
optimizations, predicate push-down, but can be painful to implement in
etc. practice
● Provides a decent abstraction for Map- ● May not fit every need and a SQL-like
Reduce jobs, allowing for a faster abstraction may not be easy
result than creating your own MR jobs ● The commands are not executed
● Can handle large and unstructured unless either you dump or store an
datasets. intermediate or final result. This
increases the iteration between debug
and resolving the issue.
HBase
● HBase is a distributed column-oriented database built on top of the HDFS. It is an open-source
project and horizontally scalable.
● HBase is a data model that is similar to Google’s big table that designed to provide quick random
access to huge amounts of structured data.
● HBase is a part of Hadoop ecosystem that provides real-time read/write access to data in the
Hadoop File System.
● HBase stores its data in HDFS.
17
Features of Hbase
● HBase is sparse, multidimensional, sorted map-
based database, which supports multiple versions
of the same record.
● HBase provides atomic read and write.
● HBase provides consistent reads and writes due
to above feature.
● HBase is linearly scalable.
● It has automatic failure support.
● It integrates with Hadoop, both as a source and a
destination.
● It has easy java API for client.
● It provides data replication across clusters.
18
HBase is… HBase is not ...
A distributed column oriented database Not an SQL Database
built on top HDFS.
Not Relational
A data model that is similar to Google’s
Big Table that designed to provide quick No Joins
random access to huge amounts of data.
No fancy query language and no
sophisticated query engine.
HBase Features
Linear Scalability: Capable of storing hundreds of terabytes of data.
Automatic and configurable sharding of tables.
Automatic failover support.
Strictly consistent read and writes.
HBase vs HDFS
Both are distributed systems that scale to hundreds or thousands of nodes.
HBase vs HDFS (Continued...)
•HBase is a database built on top of the •HDFS is a suitable for storing large files.
HDFS.
•HDFS does not support fast individual
•HBase provides fast lookups for larger record lookups.
tables.
•It provides high latency batch processing;
•It provides low latency access to single
rows from billions of records (Random •It provides only sequential access of
access). data.
•HBase internally uses Hash tables and
provides random access, and it stores the
data in indexed HDFS files for faster
lookups
Hbase vs HDFS vs Hive
23
Zookeeper
● Apache ZooKeeper is a software project of the Apache Software Foundation.
● It is essentially a distributed hierarchical key-value store, which is used to
provide a distributed configuration service, synchronization service, and naming
registry for large distributed systems.
● Examples include configuration information, hierarchical naming space, and so
on. Applications can leverage these to coordinate distributed processing across
large clusters.
● ZooKeeper was developed by Yahoo research and was a sub-project of Hadoop
but is now a top-level Apache project in its own right.
Zookeeper
● ZooKeeper is a centralized service for maintaining configuration information, naming, providing
distributed synchronization, and group services.
● ZooKeeper provides an infrastructure for cross-node synchronization by maintaining status type
information in memory on ZooKeeper servers.
25
Components of ZooKeeper
● Client - node in our distributed application cluster, access information from the server. Interacts with
the server to know that the connection is established.
● Server - node in our ZooKeeper ensemble, provides all the services to clients. Gives
acknowledgement to client to inform that the server is alive.
● Ensemble - Group of ZooKeeper servers. The minimum number of nodes that is required to form an
ensemble is 3
● Leader - Server node which performs automatic recovery if any of the connected node failed.
Leaders are elected on service startup.
● Follower - Server node which follows leader instruction.
26
Znode(ZooKeeper Node)
Every znode in the ZooKeeper data model maintains a stat structure
ZooKeeper data model maintains a stat structure of every node :
● Version number - every time the data associated with the znode changes, its corresponding version
number will be updated.
● Action Control List (ACL) - authentication mechanism for accessing the znode to govern its R/W
operations.
● Timestamp - represents time elapsed from znode creation and modification
● Data length - amount of the data stored in a znode, maximum 1 MB.
27
ZooKeeper
● ZooKeeper Command Line Interface (CLI) is used to interact with the ZooKeeper ensemble for
development purpose for debugging.
● To perform ZooKeeper CLI operations, the server and client are turned on and then the client can
perform the following operations :
● Create Znode
○ Ephemeral znodes (flag e): will delete once the session expires.
○ Sequential znodes (flag s): to specify a unique path.
● Get data
● Watch znode
● Set data
● Create children of a znode
● List children of a znode
● Check Status
● Delete a znode
28
ZooKeeper
Using ZooKeeper API, an application can connect, interact, manipulate data, coordinate, and finally
disconnect from a ZooKeeper ensemble.
● Rich set of features to get all functionalities of Zookeeper ensemble in a simple and safe manner.
● ZooKeeper API provides a small set of methods to manipulate all the details of znode with
ZooKeeper ensemble.
Steps followed to interact with ZooKeeper:
1. Connect to the ZooKeeper ensemble. ZooKeeper ensemble assign a Session ID for the client.
2. Send heartbeats to the server periodically. Otherwise, the ZooKeeper ensemble expires the Session
ID and the client needs to reconnect.
3. Get / Set the znodes.
4. Disconnect once all the tasks are completed.
29
Resources and Video Links
1. Apache PIG: http://pig.apache.org
2. Apache Hive: https://hive.apache.org/
3. Apache Zookeeper: https://zookeeper.apache.org/
4. Apache Hbase: https://www.edureka.co/blog/hbase-architecture/
PIG- https://youtu.be/rxnXHlaSohM
HIVE- https://youtu.be/uY7Rr7ru9E4
HBase- https://youtu.be/kN01ELCAsn8
ZooKeeper- https://youtu.be/Kgf9EjTNucM
30
THANK YOU!!!
Questions???
31