0% found this document useful (0 votes)
49 views25 pages

Chapter 1 - Apache Big

This document provides an introduction to Apache Pig, a platform for analyzing large datasets. It discusses how Pig works by taking Pig Latin scripts written by users and converting them into sequences of MapReduce jobs. Pig provides a high-level language and execution framework for writing data flows and performs parallelization behind the scenes in Hadoop. The document outlines Pig's architecture and components and how it fits within the Hadoop ecosystem.

Uploaded by

Hai Do Viet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views25 pages

Chapter 1 - Apache Big

This document provides an introduction to Apache Pig, a platform for analyzing large datasets. It discusses how Pig works by taking Pig Latin scripts written by users and converting them into sequences of MapReduce jobs. Pig provides a high-level language and execution framework for writing data flows and performs parallelization behind the scenes in Hadoop. The document outlines Pig's architecture and components and how it fits within the Hadoop ecosystem.

Uploaded by

Hai Do Viet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

BIG DATA ANALYSIS

APACHE PIG

Le Thi Minh Chau


Faculty Of Information Technology
HCMC University Of Technology And Education
Module Contents
2

 Introduction to Big Data and Hadoop


 Introduction to Pig
 Hadoop Pig Architecture

VB
LT
Module Contents
3

 Introduction to Big Data and Hadoop


 Introduction to Pig
 Hadoop Pig Architecture

VB
LT
Big Data and its Challenges
4

 Big data is a term for a collection of data sets so


large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data. processing applications.
 System/Enterprises generate huge amount of
data from Terabytes to and even Petabytes of
information.
  It’s very difficult to manage such huge data…
VB
LT
Big Data and its Challenges
5

VB
LT
Why Hadoop?
6

VB
LT
Hadoop and its Characteristics
7

 Apache Hadoop is a framework that allows the


distributed processing of large data sets
across clusters of commodity computers using
a simple programming model.
 It is an Open-source Data Management
technology with scale-out storage and
distributed processing.

VB
LT
Hadoop and its Characteristics
8

VB
LT
Introduction to Hadoop
9

 HDFS
 Hadoop Distributed File System
 A distributed, scalable, and portable file system
written in Java for the Hadoop framework.
 Provides high-throughput access to application
data.
 Runs on large clusters of commodity machines.
 Is used to store large datasets.
VB
LT
Introduction to Hadoop
10

 MapReduce
 Distributed data processing model and execution
environment that runs on large clusters of
commodity machines.
 Also called MR.
 Programs are inherently parallel.

VB
LT
Hadoop Ecosystem
11

VB
LT
Hadoop Ecosystem
12

VB
LT
Module Contents
13

 Introduction to Big Data and Hadoop


 Introduction to Pig
 Hadoop Pig Architecture

VB
LT
What is Pig?
14

 It is an open source data flow language


 Pig Latin is used to express the queries and
data manipulation operations in simple scripts.
 Pig converts the scripts into a sequence of
underlying Map Reduce jobs.

VB
LT
Internalize Pig
15

VB
LT
Internalizing Pig
16

VB
LT
Why Pig?
17

VB
LT
Equivalent Java MapReduce Code
18

VB
LT
Internalizing Pig
19

VB
LT
Ways to handle Pig
20

 Grunt Mode
 It’s interactive mode of Pig
 Very useful for testing syntax checking
and ad-hoc data exploration.
 Script Mode
 Runs set of instructions from a file
 Similar to a SQL script file
 Embedded Mode
 Executes Pig programs from a Java
program
VB
 Suitable for creating Pig Script on the fly.
LT
Modes of Pig
21

 Local
 Need access to a single machine
 All files are installed and run using your local host and file system
 Is invoked by using the –x local flag.
 pig –x local
 Map Reduce
 The default mode
 Need access to a Hadoop cluster and HDFS installation
 Can also be invoked by using the –x mapreduce flag or just
pig
 pig VB
 pig –x mapreduce
LT
Module Contents
22

 Introduction to Big Data and Hadoop


 Introduction to Pig
 Hadoop Pig Architecture

VB
LT
Pig Components
23

VB
LT
Pig Programs Execution
24

 Pig is just a wrapper on top of MapReduce


Layer
 It parses, optimizes and converts the Pig script
to a series of Map Reduce jobs

VB
LT
Q&A
25

VB
LT

You might also like