0% found this document useful (0 votes)

33 views19 pages

Big Data Unit IV

Apache Pig is a platform for analyzing large datasets that operates on the Hadoop platform. It provides a high-level language called Pig Latin for expressing data analysis programs, as well as infrastructure for optimizing and parallelizing the execution of those programs. Pig Latin scripts typically involve applying sets of operations to input data to produce desired results.

Uploaded by

beelogger4321

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views19 pages

Big Data Unit IV

Uploaded by

beelogger4321

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Unit-V BIG DATA ANAYTICS

Introduction to Apache PIG

Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets. It provides a high-level of abstraction for processing over the
MapReduce. It provides a high-level scripting language, known as Pig Latin which is used
to develop the data analysis codes.
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At
that time, the main idea to develop Pig was to execute the MapReduce jobs on extremely
large datasets. In the year 2007, it moved to Apache Software Foundation(ASF) which
makes it an open source project. The first version(0.1) of Pig came in the year 2008. The
latest version of Apache Pig is 0.18 which came in the year 2017.
Programmers who have SQL knowledge needed less effort to learn Pig Latin.
• It uses query approach which results in reducing the length of the code.
• Pig Latin is SQL like language.
• It provides many builtIn operators.
• It provides nested data types (tuples, bags, map).
Features of Apache Pig:
• For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
• Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages .
• Join operation is easy in Apache Pig.
• Fewer lines of code.
• Apache Pig allows splits in the pipeline.
• By integrating with other components of the Apache Hadoop ecosystem, such as
Apache Hive, Apache Spark, and Apache ZooKeeper, Apache Pig enables users to take
advantage of these components’ capabilities while transforming data.
• The data structure is multivalued, nested, and richer.
• Pig can handle the analysis of both structured and unstructured data.
Applications of Apache Pig:
• For exploring large datasets Pig Scripting is used.
• Provides the supports across large data-sets for Ad-hoc queries.
• In the prototyping of large data-sets processing algorithms.
• Required to process the time sensitive data loads.
• For collecting large amounts of datasets in form of search logs and web crawls.
• Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
• Atom: It is a atomic data value which is used to store as a string. The main use of this
model is that it can be used as a number and as well as a string.
• Tuple: It is an ordered set of the fields.
• Bag: It is a collection of the tuples.
• Map: It is a set of key/value pairs.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
Apache Pig Execution Modes

We can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode

In this mode, all the files are installed and run from your local host and local file system.
There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.

MapReduce Mode

MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.

• Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using
the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output
(using Dump operator).
• Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.

Apache Pig Vs SQL Database

Listed below are the major differences between Apache Pig and SQL.
Pig

• Pig Latin is a procedural language

• In Apache Pig, schema is optional. We can store data without designing a schema
(values are stored as $01, $02 etc.)
• The data model in Apache Pig is nested relational
• Apache Pig provides limited opportunity for Query optimization.
• An open source and high-level data flow language with a Multi-query approach
• Suitable for Complex as well as Nested data structure
• Semi-structured and structured data
• Pig works on top of MapReduce
• No concept of a schema to store data

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
SQL

• SQL is a declarative language

• Schema is mandatory in SQL.
• The data model used in SQL is flat relational.
• There is more opportunity for query optimization in SQL.
• General purpose database language for analytical and transactional queries.
• General purpose database language for analytical and transactional queries.
• A domain-specific language for a relational database management system
• Not compatible with MapReduce programming.
• Strict use of schemas in case of storing data

In addition to above differences, Apache Pig Latin −

• Allows splits in the pipeline.

• Allows developers to store data anywhere in the pipeline.
• Declares execution plans.
• Provides operators to perform ETL (Extract, Transform, and Load) functions.

Grunt shell

Grunt is Pig's interactive shell. It enables users to enter Pig Latin interactively and provides a
shell for users to interact with HDFS. To enter Grunt, invoke Pig with no script or command
to run.

The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can
invoke any shell commands using sh and fs.

sh Command

Using sh command, we can invoke any shell commands from the Grunt shell.
Using sh command from the Grunt shell, we cannot execute the commands that are a part of
the shell environment (ex − cd).

Syntax

Given below is the syntax of sh command.

grunt> sh shell command parameters

Example

We can invoke the ls command of Linux shell from the Grunt shell using the sh option as
shown below. In this example, it lists out the files in the /pig/bin/ directory.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
grunt> sh ls

pig
pig_1444799121955.log
pig.cmd
pig.py

fs Command

Using the fs command, we can invoke any FsShell commands from the Grunt shell.

Syntax

Given below is the syntax of fs command.

grunt> sh File System command parameters

Example

We can invoke the ls command of HDFS from the Grunt shell using fs command.

Utility Commands

The Grunt shell provides a set of utility commands. These include utility commands such
as clear, help, history, quit, and set; and commands such as exec, kill, and run to control
Pig from the Grunt shell. Given below is the description of the utility commands provided by
the Grunt shell.

clear Command

The clear command is used to clear the screen of the Grunt shell.

Syntax

We can clear the screen of the grunt shell using the clear command as shown below.

grunt> clear

help Command

The help command gives a list of Pig commands or Pig properties.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
history Command

This command displays a list of statements executed / used so far since the Grunt sell is
invoked.

set Command

The set command is used to show/assign values to keys used in Pig.

quit Command

We can quit from the Grunt shell using this command.

grunt> quit

exec Command

Using the exec command, we can execute Pig scripts from the Grunt shell.

Syntax

Given below is the syntax of the utility command exec.

grunt> exec [–param param_name = param_value] [–param_file file_name] [script]

run Command

We can run a Pig script from the Grunt shell using the run command

Syntax

Given below is the syntax of the run command.

grunt> run [–param param_name = param_value] [–param_file file_name] script

Pig Latin

Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter,
we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types,
general and relational operators, and Pig Latin UDF’s.

Pig Latin – Data Model

The data model of Pig is fully nested. A Relation is the outermost structure of the Pig Latin
data model. And it is a bag where –

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.

Pig Latin – Statemets

While processing data using Pig Latin, statements are the basic constructs.

• These statements work with relations. They include expressions and schemas.
• Every statement ends with a semicolon (;).
• We will perform various operations using operators provided by Pig Latin, through
statements.
• Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
• As soon as you enter a Load statement in the Grunt shell, its semantic checking will
be carried out. To see the contents of the schema, you need to use the Dump operator.
Only after performing the dump operation, the MapReduce job for loading the data
into the file system will be carried out.

Example

Given below is a Pig Latin statement, which loads data to Apache Pig.

grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as

( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Pig Latin – Data types

Given below table describes the Pig Latin data types.

S.N. Data Type Description & Example

1 int Represents a signed 32-bit integer.

Example : 8

2 long Represents a signed 64-bit integer.

Example : 5L

3 float Represents a signed 32-bit floating point.

Example : 5.5F

4 double Represents a 64-bit floating point.

Example : 10.5

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
5 chararray Represents a character array (string) in Unicode UTF-8 format.
Example : ‘tutorials point’

6 Bytearray Represents a Byte array (blob).

7 Boolean Represents a Boolean value.

Example : true/ false.

Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00

9 Biginteger Represents a Java BigInteger.

Example : 60708090709

10 Bigdecimal Represents a Java BigDecimal

Example : 185.98376256272893883

Complex Types

11 Tuple A tuple is an ordered set of fields.

Example : (raja, 30)

12 Bag A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}

13 Map A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]

Null Values

Values for all the above data types can be NULL. Apache Pig treats null values in a similar
way as SQL does.

A null can be an unknown value or a non-existent value. It is used as a placeholder for

optional values. These nulls can occur naturally or can be the result of an operation.

Pig Latin – Arithmetic Operators

The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b =
20.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
Operator Description Example

+ Addition − Adds values on either side of the a + b will give 30

operator

− Subtraction − Subtracts right hand operand from a − b will give −10

left hand operand

* Multiplication − Multiplies values on either side of a * b will give 200

the operator

/ Division − Divides left hand operand by right hand b / a will give 2

operand

% Modulus − Divides left hand operand by right hand b % a will give 0

operand and returns remainder

b = (a == 1)? 20:
Bincond − Evaluates the Boolean operators. It has 30;
?: three operands as shown below. if a = 1 the value
variable x = (expression) ? value1 if true : value2 if of b is 20.
false. if a!=1 the value of
b is 30.

CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
Case − The case operator is equivalent to nested 'even'
THEN
bincond operator. WHEN 1 THEN
ELSE
'odd'
END
END

Pig Latin – Comparison Operators

The following table describes the comparison operators of Pig Latin.

Operator Description Example

Equal − Checks if the values of two operands are equal (a = b) is not

==
or not; if yes, then the condition becomes true. true

Not Equal − Checks if the values of two operands are

!= equal or not. If the values are not equal, then condition (a != b) is true.
becomes true.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
Greater than − Checks if the value of the left operand is (a > b) is not
> greater than the value of the right operand. If yes, then true.
the condition becomes true.

Less than − Checks if the value of the left operand is less

< than the value of the right operand. If yes, then the (a < b) is true.
condition becomes true.

Greater than or equal to − Checks if the value of the (a >= b) is not

>= left operand is greater than or equal to the value of the true.
right operand. If yes, then the condition becomes true.

Less than or equal to − Checks if the value of the left (a <= b) is

<= operand is less than or equal to the value of the right true.
operand. If yes, then the condition becomes true.

Pattern matching − Checks whether the string in the f1 matches

matches left-hand side matches with the constant in the right-hand '.*tutorial.*'
side.

Pig Latin – Type Construction Operators

The following table describes the Type construction operators of Pig Latin.

Operator Description Example

() Tuple constructor operator − This operator is (Raju, 30)

used to construct a tuple.

Bag constructor operator − This operator is {(Raju, 30),

{}
used to construct a bag. (Mohammad, 45)}

[] Map constructor operator − This operator is [name#Raja, age#30]

used to construct a tuple.

Pig Latin – Relational Operations

The following table describes the relational operators of Pig Latin.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
Operator Description

Loading and Storing

To Load the data from the file system (local/HDFS) into a

LOAD
relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH,
To generate data transformations based on columns of data.
GENERATE

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

To arrange a relation in a sorted order based on one or more

ORDER
fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

To view the logical, physical, or MapReduce execution

EXPLAIN
plans to compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

Apache Pig - User Defined Functions

In addition to the built-in functions, Apache Pig provides extensive support

for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions
and use them. The UDF support is provided in six programming languages, namely, Java,
Jython, Python, JavaScript, Ruby and Groovy.

For writing UDF’s, complete support is provided in Java and limited support is provided in
all the remaining languages. Using Java, we can write UDF’s involving all parts of the
processing like data load/store, column transformation, and aggregation. Since Apache Pig
has been written in Java, the UDF’s written using Java language work efficiently compared to
other languages.

In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using
Piggybank, we can access Java UDF’s written by other users, and contribute our own UDF’s.

Types of UDF’s in Java

While writing UDF’s using Java, we can create and use the following three types of functions
−

• Filter Functions − The filter functions are used as conditions in filter statements.
These functions accept a Pig value as input and return a Boolean value.
• Eval Functions − The Eval functions are used in FOREACH-GENERATE
statements. These functions accept a Pig value as input and return a Pig result.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
• Algebraic Functions − The Algebraic functions act on inner bags in a
FOREACHGENERATE statement. These functions are used to perform full
MapReduce operations on an inner bag.

Writing UDF’s using Java

To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we
discuss how to write a sample UDF using Eclipse. Before proceeding further, make sure you
have installed Eclipse and Maven in your system.

Follow the steps given below to write a UDF function −

• Open Eclipse and create a new project (say myproject).

• Convert the newly created project into a Maven project.
• Copy the following content in the pom.xml. This file contains the Maven
dependencies for Apache Pig and Hadoop-core jar files.

Data Processing Operators :

The Apache Pig Operators is a high-level procedural language for querying large data sets
using Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and produces another
relation as output.
These operators are the main tools for Pig Latin provides to operate on the data.
They allow you to transform it by sorting, grouping, joining, projecting, and filtering.
The Apache Pig operators can be classified as :
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or HDFS storage
into a Pig relation.
FOREACH: This operator generates data transformations based on columns of data. It is
used to add or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or more relations
based on common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in either
ascending or descending order using ASC and DESC keywords.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
GROUP: The GROUP operator groups together the tuples with the same group key (key
field).
COGROUP: COGROUP is the same as the GROUP operator. For readability, programmers
usually use GROUP when only one relation is involved and COGROUP when multiple
relations are reinvolved.

Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display the results on
the screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular relation. The
DESCRIBE operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed through
a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it
comes to debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and MapReduce
execution plans of a relation.
Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

hive is a shell utility which can be used to run Hive queries in either interactive or batch
mode. HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline, which is a
JDBC client based on SQLLine.

Hive Command Line Options

To get help, run “hive -H” or “hive –help”. Usage (as it is in Hive 0.9.0)

usage: hive

Option Explanation
-d,–define <key=value> Variable subsitution to apply to hive commands. e.g. -d A=B or –
define A=B
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
Option Explanation
-H,–help Print help information
-h <hostname> Connecting to Hive Server on remote host
–hiveconf Use value for given property
<property=value>
–hivevar <key=value> Variable subsitution to apply to hive commands. e.g. –hivevar A=B
-i <filename> Initialization SQL file
-p <port> Connecting to Hive Server on port number
-S,–silent Silent mode in interactive shell
-v,–verbose Verbose mode (echo executed SQL to the console)

Hive Services :
The following are the services provided by Hive :
· Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
· Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
· Hive metastore: It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and its
type information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
· Hive Server: It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
· Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
· Hive Compiler: The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
· Hive Execution Engine: Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming
tasks in the order of their dependencies.
MetaStore :
Hive metastore (HMS) is a service that stores Apache Hive and other metadata in a backend
RDBMS, such as MySQL or PostgreSQL.
Impala, Spark, Hive, and other services share the metastore.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
The connections to and from HMS include HiveServer, Ranger, and the NameNode, which
represents HDFS.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or JDBC to
HiveServer.
The HiveServer instance reads/writes data to HMS.
By default, redundant HMS operate in active/active mode.
The physical data resides in a backend RDBMS, one for HMS.
All connections are routed to a single RDBMS service at any given time.
HMS talks to the NameNode over thrift and functions as a client to HDFS.
HMS connects directly to Ranger and the NameNode (HDFS), and so does HiveServer.
One or more HMS instances on the backend can talk to other services, such as Ranger.
Comparison with Traditional Database :

RDBMS HIVE

It is used to maintain the

It is used to maintain a data warehouse.
database.

It uses SQL (Structured Query

It uses HQL (Hive Query Language).
Language).

Schema is fixed in RDBMS Schema varies in it.

Normalized and de-normalized both type of

Normalized data is stored.
data is stored.

Tables in rdms are sparse. The table in hive is dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used The sharding method is used for partition

HiveQL
HiveQL is a query language for Hive to analyze and process structured data in a Meta-
store. It is a mixture of SQL-92, MySQL, and Oracle’s SQL. It is very much similar to
SQL and highly scalable. It reuses familiar concepts from the relational database world,
such as tables, rows, columns and schema, to ease learning. Hive supports four file formats
those are TEXT FILE, SEQUENCE FILE, ORC and RC FILE (Record Columnar File).

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
Hive internal tables vs external tables

There are two types of tables that you can create with Hive:

• Internal: Data is stored in the Hive data warehouse. The data warehouse is located
at /hive/warehouse/ on the default storage for the cluster.

Use internal tables when one of the following conditions applies:

o Data is temporary.
o You want Hive to manage the lifecycle of the table and data.
• External: Data is stored outside the data warehouse. The data can be stored on any
storage accessible by the cluster.

Use external tables when one of the following conditions apply:

o The data is also used outside of Hive. For example, the data files are updated by
another process (that doesn't lock the files.)
o Data needs to remain in the underlying location, even after dropping the table.
o We need a custom location, such as a non-default storage account.
o A program other than hive manages the data format, location, and so on.

What are the querying data in Hive?

Hive enables data summarization, querying, and analysis of data. Hive queries are written in
HiveQL, which is a query language similar to SQL. Hive allows you to project structure on
largely unstructured data. After you define the structure, you can use HiveQL to query the
data without knowledge of Java or MapReduce.

User-defined functions (UDF)

Hive can also be extended through user-defined functions (UDF). A UDF allows you to
implement functionality or logic that isn't easily modeled in HiveQL. For an example of
using UDFs with Hive, see the following documents:

• Use a Java user-defined function with Apache Hive

• Use a Python user-defined function with Apache Hive
• Use a C# user-defined function with Apache Hive
• How to add a custom Apache Hive user-defined function to HDInsight
• An example Apache Hive user-defined function to convert date/time formats to Hive
timestamp

What is HBase?

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.

HDFS HBase

HDFS is a distributed file

system suitable for storing HBase is a database built on top of the HDFS.
large files.

HDFS does not support fast

HBase provides fast lookups for larger tables.
individual record lookups.

It provides high latency batch

It provides low latency access to single rows
processing; no concept of batch
from billions of records (Random access).
processing.

HBase internally uses Hash tables and provides

It provides only sequential
random access, and it stores the data in indexed
access of data.
HDFS files for faster lookups.

Features of HBase

• HBase is linearly scalable.

• It has automatic failure support.
• It provides consistent read and writes.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.

Where to Use HBase

• Apache HBase is used to have random, real-time read/write access to Big Data.
• It hosts very large tables on top of clusters of commodity hardware.
• Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.

Applications of HBase

• It is used whenever there is a need to write heavy applications.

• HBase is used whenever we need to provide fast random access to available data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase Vs. RDBMS

While comparing HBase with Traditional Relational databases, we have to take three key
areas into consideration. Those are data model, data storage, and data diversity.

HBASE RDBMS

• Schema-less in database • Having fixed schema in database

• Column-oriented databases • Row oriented datastore

• Designed to store De-normalized

• Designed to store Normalized data
data

• Wide and sparsely populated tables

• Contains thin tables in database
present in HBase

• Supports automatic partitioning • Has no built in support for partitioning

• Well suited for OLAP systems • Well suited for OLTP systems

• Retrieve one row at a time and hence could read

• Read only relevant data from
unnecessary data if only some of the data in a
database
row is required

Rizwana K.H,Assistant Professor

SNGC,Coimbatore
• Structured and semi-structure data
• Structured data can be stored and processed
can be stored and processed using
using RDBMS
HBase

• Enables aggregation over many

• Aggregation is an expensive operation
rows and columns

Big SQL
Big SQL is a high performance massively parallel processing (MPP) SQL engine for Hadoop
that makes querying enterprise data from across the organization an easy and secure
experience. A Big SQL query can quickly access a variety of data sources including HDFS,
RDBMS, NoSQL databases, object stores, and WebHDFS by using a single database
connection or single query for best-in-class analytic capabilities.
What Big SQL looks like

Big SQL provides tools to help you manage your system and your databases, and we can use
popular analytic tools to visualize your data.

How Big SQL works

Big SQL's robust engine executes complex queries for relational data and Hadoop data. Big
SQL provides an advanced SQL compiler and a cost-based optimizer for efficient query
execution. Combining these with a massive parallel processing (MPP) engine helps distribute
query execution across nodes in a cluster.

Rizwana K.H,Assistant Professor

SNGC,Coimbatore

Unit5 Bigdatanotes
No ratings yet
Unit5 Bigdatanotes
52 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
BDP U4
No ratings yet
BDP U4
58 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
3 Pig
No ratings yet
3 Pig
77 pages
Introduction to Apache Pig for Data Analysis
No ratings yet
Introduction to Apache Pig for Data Analysis
23 pages
Unit-4 SGS
No ratings yet
Unit-4 SGS
13 pages
Unit III
No ratings yet
Unit III
118 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Unit 5
No ratings yet
Unit 5
24 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
PIG
No ratings yet
PIG
9 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Unit 5
No ratings yet
Unit 5
76 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
BDTools PIG
No ratings yet
BDTools PIG
14 pages
BDA - HIVE & PIG-Other Notes in Detail
No ratings yet
BDA - HIVE & PIG-Other Notes in Detail
162 pages
Unit No 4 Hadoop Eco System
No ratings yet
Unit No 4 Hadoop Eco System
15 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Unit 4
No ratings yet
Unit 4
20 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Applications of Apache Pig in Big Data
No ratings yet
Applications of Apache Pig in Big Data
10 pages
6 Part2
No ratings yet
6 Part2
45 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit IV
No ratings yet
Unit IV
36 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Pig
No ratings yet
Pig
61 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Pig
No ratings yet
Pig
16 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
Unit 5
No ratings yet
Unit 5
39 pages
Understanding Apache Pig in Big Data
No ratings yet
Understanding Apache Pig in Big Data
9 pages
Understanding Apache Pig for Data Analysis
No ratings yet
Understanding Apache Pig for Data Analysis
6 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Big Data Applications: Pig, Hive, HBase
No ratings yet
Big Data Applications: Pig, Hive, HBase
22 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Pig Notes-1
No ratings yet
Pig Notes-1
6 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
5 pages
Apache Pig
No ratings yet
Apache Pig
4 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
How To: Backup, Restore and Create SQL Databases JB 10.2 and Up
No ratings yet
How To: Backup, Restore and Create SQL Databases JB 10.2 and Up
7 pages
SQL Exercises for Bank Database
100% (1)
SQL Exercises for Bank Database
4 pages
EER to Relational Model Guide
No ratings yet
EER to Relational Model Guide
11 pages
Informatica - 3 - Knowledge Is Money
No ratings yet
Informatica - 3 - Knowledge Is Money
39 pages
Data Analytics Using R Unit-3
No ratings yet
Data Analytics Using R Unit-3
4 pages
DBA Level 3 Coc Questions All-2
100% (4)
DBA Level 3 Coc Questions All-2
9 pages
Comp PPT 2
No ratings yet
Comp PPT 2
13 pages
Online Hotel Booking Database Management System
No ratings yet
Online Hotel Booking Database Management System
47 pages
Introduction to Database Management Systems
100% (1)
Introduction to Database Management Systems
85 pages
AWS Glue: Create S3 Data Crawler Guide
No ratings yet
AWS Glue: Create S3 Data Crawler Guide
2 pages
DBMS MR-22 Model Question Paper
No ratings yet
DBMS MR-22 Model Question Paper
2 pages
Hadoop Archive
No ratings yet
Hadoop Archive
3 pages
DATA MINING - UNIT 1s
No ratings yet
DATA MINING - UNIT 1s
43 pages
CDMP Comprehensive Takeaways
No ratings yet
CDMP Comprehensive Takeaways
320 pages
Chapter 6 DBMS
No ratings yet
Chapter 6 DBMS
24 pages
Understanding Relational Database Models
No ratings yet
Understanding Relational Database Models
26 pages
Relational Database Design by ER - and EERR-to-Relational Mapping
No ratings yet
Relational Database Design by ER - and EERR-to-Relational Mapping
32 pages
Sqlmap Cheat Sheet
No ratings yet
Sqlmap Cheat Sheet
1 page
Directory and Disk Structures
No ratings yet
Directory and Disk Structures
9 pages
Topics Assigned To Each Student For Presentation in Database Management
No ratings yet
Topics Assigned To Each Student For Presentation in Database Management
2 pages
Question Bank Final Year Project Report
No ratings yet
Question Bank Final Year Project Report
69 pages
Zenon Analytics Interview Exp
No ratings yet
Zenon Analytics Interview Exp
8 pages
Data Modeling Techniques (DMTS) : Bubble Chart
No ratings yet
Data Modeling Techniques (DMTS) : Bubble Chart
10 pages
(Lecture Notes in Computer Science 10943) Ying Tan, Yuhui Shi, Qirong Tang - Data Mining and Big Data-Springer International Publishing (2018)
No ratings yet
(Lecture Notes in Computer Science 10943) Ying Tan, Yuhui Shi, Qirong Tang - Data Mining and Big Data-Springer International Publishing (2018)
792 pages
Database Concepts & SQL Overview
No ratings yet
Database Concepts & SQL Overview
3 pages
E-R Diagram Basics for Students
No ratings yet
E-R Diagram Basics for Students
15 pages
PL/SQL Practice for Beginners
100% (1)
PL/SQL Practice for Beginners
26 pages
PHP MySQL Lab Manual for 6th CE
No ratings yet
PHP MySQL Lab Manual for 6th CE
44 pages
IR Unit I Notes
No ratings yet
IR Unit I Notes
4 pages
AWS Analytics Services Briefs and MCQs
No ratings yet
AWS Analytics Services Briefs and MCQs
8 pages