Chapter 1 - Apache Big
Chapter 1 - Apache Big
APACHE PIG
VB
LT
Module Contents
3
VB
LT
Big Data and its Challenges
4
VB
LT
Why Hadoop?
6
VB
LT
Hadoop and its Characteristics
7
VB
LT
Hadoop and its Characteristics
8
VB
LT
Introduction to Hadoop
9
HDFS
Hadoop Distributed File System
A distributed, scalable, and portable file system
written in Java for the Hadoop framework.
Provides high-throughput access to application
data.
Runs on large clusters of commodity machines.
Is used to store large datasets.
VB
LT
Introduction to Hadoop
10
MapReduce
Distributed data processing model and execution
environment that runs on large clusters of
commodity machines.
Also called MR.
Programs are inherently parallel.
VB
LT
Hadoop Ecosystem
11
VB
LT
Hadoop Ecosystem
12
VB
LT
Module Contents
13
VB
LT
What is Pig?
14
VB
LT
Internalize Pig
15
VB
LT
Internalizing Pig
16
VB
LT
Why Pig?
17
VB
LT
Equivalent Java MapReduce Code
18
VB
LT
Internalizing Pig
19
VB
LT
Ways to handle Pig
20
Grunt Mode
It’s interactive mode of Pig
Very useful for testing syntax checking
and ad-hoc data exploration.
Script Mode
Runs set of instructions from a file
Similar to a SQL script file
Embedded Mode
Executes Pig programs from a Java
program
VB
Suitable for creating Pig Script on the fly.
LT
Modes of Pig
21
Local
Need access to a single machine
All files are installed and run using your local host and file system
Is invoked by using the –x local flag.
pig –x local
Map Reduce
The default mode
Need access to a Hadoop cluster and HDFS installation
Can also be invoked by using the –x mapreduce flag or just
pig
pig VB
pig –x mapreduce
LT
Module Contents
22
VB
LT
Pig Components
23
VB
LT
Pig Programs Execution
24
VB
LT
Q&A
25
VB
LT