Unit-V BIG DATA ANAYTICS
Introduction to Apache PIG
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets. It provides a high-level of abstraction for processing over the
MapReduce. It provides a high-level scripting language, known as Pig Latin which is used
to develop the data analysis codes.
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At
that time, the main idea to develop Pig was to execute the MapReduce jobs on extremely
large datasets. In the year 2007, it moved to Apache Software Foundation(ASF) which
makes it an open source project. The first version(0.1) of Pig came in the year 2008. The
latest version of Apache Pig is 0.18 which came in the year 2017.
Programmers who have SQL knowledge needed less effort to learn Pig Latin.
• It uses query approach which results in reducing the length of the code.
• Pig Latin is SQL like language.
• It provides many builtIn operators.
• It provides nested data types (tuples, bags, map).
Features of Apache Pig:
• For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
• Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages .
• Join operation is easy in Apache Pig.
• Fewer lines of code.
• Apache Pig allows splits in the pipeline.
• By integrating with other components of the Apache Hadoop ecosystem, such as
Apache Hive, Apache Spark, and Apache ZooKeeper, Apache Pig enables users to take
advantage of these components’ capabilities while transforming data.
• The data structure is multivalued, nested, and richer.
• Pig can handle the analysis of both structured and unstructured data.
Applications of Apache Pig:
• For exploring large datasets Pig Scripting is used.
• Provides the supports across large data-sets for Ad-hoc queries.
• In the prototyping of large data-sets processing algorithms.
• Required to process the time sensitive data loads.
• For collecting large amounts of datasets in form of search logs and web crawls.
• Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
• Atom: It is a atomic data value which is used to store as a string. The main use of this
model is that it can be used as a number and as well as a string.
• Tuple: It is an ordered set of the fields.
• Bag: It is a collection of the tuples.
• Map: It is a set of key/value pairs.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
Apache Pig Execution Modes
We can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system.
There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.
Apache Pig Execution Mechanisms
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.
• Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using
the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output
(using Dump operator).
• Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.
Apache Pig Vs SQL Database
Listed below are the major differences between Apache Pig and SQL.
Pig
• Pig Latin is a procedural language
• In Apache Pig, schema is optional. We can store data without designing a schema
(values are stored as $01, $02 etc.)
• The data model in Apache Pig is nested relational
• Apache Pig provides limited opportunity for Query optimization.
• An open source and high-level data flow language with a Multi-query approach
• Suitable for Complex as well as Nested data structure
• Semi-structured and structured data
• Pig works on top of MapReduce
• No concept of a schema to store data
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
SQL
• SQL is a declarative language
• Schema is mandatory in SQL.
• The data model used in SQL is flat relational.
• There is more opportunity for query optimization in SQL.
• General purpose database language for analytical and transactional queries.
• General purpose database language for analytical and transactional queries.
• A domain-specific language for a relational database management system
• Not compatible with MapReduce programming.
• Strict use of schemas in case of storing data
In addition to above differences, Apache Pig Latin −
• Allows splits in the pipeline.
• Allows developers to store data anywhere in the pipeline.
• Declares execution plans.
• Provides operators to perform ETL (Extract, Transform, and Load) functions.
Grunt shell
Grunt is Pig's interactive shell. It enables users to enter Pig Latin interactively and provides a
shell for users to interact with HDFS. To enter Grunt, invoke Pig with no script or command
to run.
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can
invoke any shell commands using sh and fs.
sh Command
Using sh command, we can invoke any shell commands from the Grunt shell.
Using sh command from the Grunt shell, we cannot execute the commands that are a part of
the shell environment (ex − cd).
Syntax
Given below is the syntax of sh command.
grunt> sh shell command parameters
Example
We can invoke the ls command of Linux shell from the Grunt shell using the sh option as
shown below. In this example, it lists out the files in the /pig/bin/ directory.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the Grunt shell.
Syntax
Given below is the syntax of fs command.
grunt> sh File System command parameters
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command.
Utility Commands
The Grunt shell provides a set of utility commands. These include utility commands such
as clear, help, history, quit, and set; and commands such as exec, kill, and run to control
Pig from the Grunt shell. Given below is the description of the utility commands provided by
the Grunt shell.
clear Command
The clear command is used to clear the screen of the Grunt shell.
Syntax
We can clear the screen of the grunt shell using the clear command as shown below.
grunt> clear
help Command
The help command gives a list of Pig commands or Pig properties.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
history Command
This command displays a list of statements executed / used so far since the Grunt sell is
invoked.
set Command
The set command is used to show/assign values to keys used in Pig.
quit Command
We can quit from the Grunt shell using this command.
grunt> quit
exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
Given below is the syntax of the utility command exec.
grunt> exec [–param param_name = param_value] [–param_file file_name] [script]
run Command
We can run a Pig script from the Grunt shell using the run command
Syntax
Given below is the syntax of the run command.
grunt> run [–param param_name = param_value] [–param_file file_name] script
Pig Latin
Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter,
we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types,
general and relational operators, and Pig Latin UDF’s.
Pig Latin – Data Model
The data model of Pig is fully nested. A Relation is the outermost structure of the Pig Latin
data model. And it is a bag where –
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
Pig Latin – Statemets
While processing data using Pig Latin, statements are the basic constructs.
• These statements work with relations. They include expressions and schemas.
• Every statement ends with a semicolon (;).
• We will perform various operations using operators provided by Pig Latin, through
statements.
• Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
• As soon as you enter a Load statement in the Grunt shell, its semantic checking will
be carried out. To see the contents of the schema, you need to use the Dump operator.
Only after performing the dump operation, the MapReduce job for loading the data
into the file system will be carried out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Pig Latin – Data types
Given below table describes the Pig Latin data types.
S.N. Data Type Description & Example
1 int Represents a signed 32-bit integer.
Example : 8
2 long Represents a signed 64-bit integer.
Example : 5L
3 float Represents a signed 32-bit floating point.
Example : 5.5F
4 double Represents a 64-bit floating point.
Example : 10.5
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
5 chararray Represents a character array (string) in Unicode UTF-8 format.
Example : ‘tutorials point’
6 Bytearray Represents a Byte array (blob).
7 Boolean Represents a Boolean value.
Example : true/ false.
Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00
9 Biginteger Represents a Java BigInteger.
Example : 60708090709
10 Bigdecimal Represents a Java BigDecimal
Example : 185.98376256272893883
Complex Types
11 Tuple A tuple is an ordered set of fields.
Example : (raja, 30)
12 Bag A bag is a collection of tuples.
Example : {(raju,30),(Mohhammad,45)}
13 Map A Map is a set of key-value pairs.
Example : [ ‘name’#’Raju’, ‘age’#30]
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in a similar
way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a placeholder for
optional values. These nulls can occur naturally or can be the result of an operation.
Pig Latin – Arithmetic Operators
The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b =
20.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
Operator Description Example
+ Addition − Adds values on either side of the a + b will give 30
operator
− Subtraction − Subtracts right hand operand from a − b will give −10
left hand operand
* Multiplication − Multiplies values on either side of a * b will give 200
the operator
/ Division − Divides left hand operand by right hand b / a will give 2
operand
% Modulus − Divides left hand operand by right hand b % a will give 0
operand and returns remainder
b = (a == 1)? 20:
Bincond − Evaluates the Boolean operators. It has 30;
?: three operands as shown below. if a = 1 the value
variable x = (expression) ? value1 if true : value2 if of b is 20.
false. if a!=1 the value of
b is 30.
CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
Case − The case operator is equivalent to nested 'even'
THEN
bincond operator. WHEN 1 THEN
ELSE
'odd'
END
END
Pig Latin – Comparison Operators
The following table describes the comparison operators of Pig Latin.
Operator Description Example
Equal − Checks if the values of two operands are equal (a = b) is not
==
or not; if yes, then the condition becomes true. true
Not Equal − Checks if the values of two operands are
!= equal or not. If the values are not equal, then condition (a != b) is true.
becomes true.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
Greater than − Checks if the value of the left operand is (a > b) is not
> greater than the value of the right operand. If yes, then true.
the condition becomes true.
Less than − Checks if the value of the left operand is less
< than the value of the right operand. If yes, then the (a < b) is true.
condition becomes true.
Greater than or equal to − Checks if the value of the (a >= b) is not
>= left operand is greater than or equal to the value of the true.
right operand. If yes, then the condition becomes true.
Less than or equal to − Checks if the value of the left (a <= b) is
<= operand is less than or equal to the value of the right true.
operand. If yes, then the condition becomes true.
Pattern matching − Checks whether the string in the f1 matches
matches left-hand side matches with the constant in the right-hand '.*tutorial.*'
side.
Pig Latin – Type Construction Operators
The following table describes the Type construction operators of Pig Latin.
Operator Description Example
() Tuple constructor operator − This operator is (Raju, 30)
used to construct a tuple.
Bag constructor operator − This operator is {(Raju, 30),
{}
used to construct a bag. (Mohammad, 45)}
[] Map constructor operator − This operator is [name#Raja, age#30]
used to construct a tuple.
Pig Latin – Relational Operations
The following table describes the relational operators of Pig Latin.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
Operator Description
Loading and Storing
To Load the data from the file system (local/HDFS) into a
LOAD
relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH,
To generate data transformations based on columns of data.
GENERATE
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
To arrange a relation in a sorted order based on one or more
ORDER
fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
To view the logical, physical, or MapReduce execution
EXPLAIN
plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of statements.
Apache Pig - User Defined Functions
In addition to the built-in functions, Apache Pig provides extensive support
for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions
and use them. The UDF support is provided in six programming languages, namely, Java,
Jython, Python, JavaScript, Ruby and Groovy.
For writing UDF’s, complete support is provided in Java and limited support is provided in
all the remaining languages. Using Java, we can write UDF’s involving all parts of the
processing like data load/store, column transformation, and aggregation. Since Apache Pig
has been written in Java, the UDF’s written using Java language work efficiently compared to
other languages.
In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using
Piggybank, we can access Java UDF’s written by other users, and contribute our own UDF’s.
Types of UDF’s in Java
While writing UDF’s using Java, we can create and use the following three types of functions
−
• Filter Functions − The filter functions are used as conditions in filter statements.
These functions accept a Pig value as input and return a Boolean value.
• Eval Functions − The Eval functions are used in FOREACH-GENERATE
statements. These functions accept a Pig value as input and return a Pig result.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
• Algebraic Functions − The Algebraic functions act on inner bags in a
FOREACHGENERATE statement. These functions are used to perform full
MapReduce operations on an inner bag.
Writing UDF’s using Java
To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we
discuss how to write a sample UDF using Eclipse. Before proceeding further, make sure you
have installed Eclipse and Maven in your system.
Follow the steps given below to write a UDF function −
• Open Eclipse and create a new project (say myproject).
• Convert the newly created project into a Maven project.
• Copy the following content in the pom.xml. This file contains the Maven
dependencies for Apache Pig and Hadoop-core jar files.
Data Processing Operators :
The Apache Pig Operators is a high-level procedural language for querying large data sets
using Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and produces another
relation as output.
These operators are the main tools for Pig Latin provides to operate on the data.
They allow you to transform it by sorting, grouping, joining, projecting, and filtering.
The Apache Pig operators can be classified as :
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or HDFS storage
into a Pig relation.
FOREACH: This operator generates data transformations based on columns of data. It is
used to add or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or more relations
based on common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in either
ascending or descending order using ASC and DESC keywords.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
GROUP: The GROUP operator groups together the tuples with the same group key (key
field).
COGROUP: COGROUP is the same as the GROUP operator. For readability, programmers
usually use GROUP when only one relation is involved and COGROUP when multiple
relations are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display the results on
the screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular relation. The
DESCRIBE operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed through
a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it
comes to debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and MapReduce
execution plans of a relation.
Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
hive is a shell utility which can be used to run Hive queries in either interactive or batch
mode. HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline, which is a
JDBC client based on SQLLine.
Hive Command Line Options
To get help, run “hive -H” or “hive –help”. Usage (as it is in Hive 0.9.0)
usage: hive
Option Explanation
-d,–define <key=value> Variable subsitution to apply to hive commands. e.g. -d A=B or –
define A=B
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
Option Explanation
-H,–help Print help information
-h <hostname> Connecting to Hive Server on remote host
–hiveconf Use value for given property
<property=value>
–hivevar <key=value> Variable subsitution to apply to hive commands. e.g. –hivevar A=B
-i <filename> Initialization SQL file
-p <port> Connecting to Hive Server on port number
-S,–silent Silent mode in interactive shell
-v,–verbose Verbose mode (echo executed SQL to the console)
Hive Services :
The following are the services provided by Hive :
· Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
· Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
· Hive metastore: It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and its
type information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
· Hive Server: It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
· Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
· Hive Compiler: The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
· Hive Execution Engine: Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming
tasks in the order of their dependencies.
MetaStore :
Hive metastore (HMS) is a service that stores Apache Hive and other metadata in a backend
RDBMS, such as MySQL or PostgreSQL.
Impala, Spark, Hive, and other services share the metastore.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
The connections to and from HMS include HiveServer, Ranger, and the NameNode, which
represents HDFS.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or JDBC to
HiveServer.
The HiveServer instance reads/writes data to HMS.
By default, redundant HMS operate in active/active mode.
The physical data resides in a backend RDBMS, one for HMS.
All connections are routed to a single RDBMS service at any given time.
HMS talks to the NameNode over thrift and functions as a client to HDFS.
HMS connects directly to Ranger and the NameNode (HDFS), and so does HiveServer.
One or more HMS instances on the backend can talk to other services, such as Ranger.
Comparison with Traditional Database :
RDBMS HIVE
It is used to maintain the
It is used to maintain a data warehouse.
database.
It uses SQL (Structured Query
It uses HQL (Hive Query Language).
Language).
Schema is fixed in RDBMS Schema varies in it.
Normalized and de-normalized both type of
Normalized data is stored.
data is stored.
Tables in rdms are sparse. The table in hive is dense.
It doesn’t support partitioning. It supports automation partition.
No partition method is used The sharding method is used for partition
HiveQL
HiveQL is a query language for Hive to analyze and process structured data in a Meta-
store. It is a mixture of SQL-92, MySQL, and Oracle’s SQL. It is very much similar to
SQL and highly scalable. It reuses familiar concepts from the relational database world,
such as tables, rows, columns and schema, to ease learning. Hive supports four file formats
those are TEXT FILE, SEQUENCE FILE, ORC and RC FILE (Record Columnar File).
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
Hive internal tables vs external tables
There are two types of tables that you can create with Hive:
• Internal: Data is stored in the Hive data warehouse. The data warehouse is located
at /hive/warehouse/ on the default storage for the cluster.
Use internal tables when one of the following conditions applies:
o Data is temporary.
o You want Hive to manage the lifecycle of the table and data.
• External: Data is stored outside the data warehouse. The data can be stored on any
storage accessible by the cluster.
Use external tables when one of the following conditions apply:
o The data is also used outside of Hive. For example, the data files are updated by
another process (that doesn't lock the files.)
o Data needs to remain in the underlying location, even after dropping the table.
o We need a custom location, such as a non-default storage account.
o A program other than hive manages the data format, location, and so on.
What are the querying data in Hive?
Hive enables data summarization, querying, and analysis of data. Hive queries are written in
HiveQL, which is a query language similar to SQL. Hive allows you to project structure on
largely unstructured data. After you define the structure, you can use HiveQL to query the
data without knowledge of Java or MapReduce.
User-defined functions (UDF)
Hive can also be extended through user-defined functions (UDF). A UDF allows you to
implement functionality or logic that isn't easily modeled in HiveQL. For an example of
using UDFs with Hive, see the following documents:
• Use a Java user-defined function with Apache Hive
• Use a Python user-defined function with Apache Hive
• Use a C# user-defined function with Apache Hive
• How to add a custom Apache Hive user-defined function to HDInsight
• An example Apache Hive user-defined function to convert date/time formats to Hive
timestamp
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
HDFS HBase
HDFS is a distributed file
system suitable for storing HBase is a database built on top of the HDFS.
large files.
HDFS does not support fast
HBase provides fast lookups for larger tables.
individual record lookups.
It provides high latency batch
It provides low latency access to single rows
processing; no concept of batch
from billions of records (Random access).
processing.
HBase internally uses Hash tables and provides
It provides only sequential
random access, and it stores the data in indexed
access of data.
HDFS files for faster lookups.
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.
Where to Use HBase
• Apache HBase is used to have random, real-time read/write access to Big Data.
• It hosts very large tables on top of clusters of commodity hardware.
• Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.
Applications of HBase
• It is used whenever there is a need to write heavy applications.
• HBase is used whenever we need to provide fast random access to available data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase Vs. RDBMS
While comparing HBase with Traditional Relational databases, we have to take three key
areas into consideration. Those are data model, data storage, and data diversity.
HBASE RDBMS
• Schema-less in database • Having fixed schema in database
• Column-oriented databases • Row oriented datastore
• Designed to store De-normalized
• Designed to store Normalized data
data
• Wide and sparsely populated tables
• Contains thin tables in database
present in HBase
• Supports automatic partitioning • Has no built in support for partitioning
• Well suited for OLAP systems • Well suited for OLTP systems
• Retrieve one row at a time and hence could read
• Read only relevant data from
unnecessary data if only some of the data in a
database
row is required
Rizwana K.H,Assistant Professor
SNGC,Coimbatore
• Structured and semi-structure data
• Structured data can be stored and processed
can be stored and processed using
using RDBMS
HBase
• Enables aggregation over many
• Aggregation is an expensive operation
rows and columns
Big SQL
Big SQL is a high performance massively parallel processing (MPP) SQL engine for Hadoop
that makes querying enterprise data from across the organization an easy and secure
experience. A Big SQL query can quickly access a variety of data sources including HDFS,
RDBMS, NoSQL databases, object stores, and WebHDFS by using a single database
connection or single query for best-in-class analytic capabilities.
What Big SQL looks like
Big SQL provides tools to help you manage your system and your databases, and we can use
popular analytic tools to visualize your data.
How Big SQL works
Big SQL's robust engine executes complex queries for relational data and Hadoop data. Big
SQL provides an advanced SQL compiler and a cost-based optimizer for efficient query
execution. Combining these with a massive parallel processing (MPP) engine helps distribute
query execution across nodes in a cluster.
Rizwana K.H,Assistant Professor
SNGC,Coimbatore