0% found this document useful (0 votes)
246 views

The Big Data Technology Landscape

Consistency, Availability, Partition tolerance
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
246 views

The Big Data Technology Landscape

Consistency, Availability, Partition tolerance
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

The Big Data Technology Landscape

Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes


The big data technology landscape a) To understand the significance of NoSQL
databases
1. What is NoSQL databases?
b) To understand the need for NewSQL
2. Why NoSQL?
c) To understand the Hadoop platform and be
3. Key advantages of NoSQL able to appreciate the difference between
Hadoop 1.0 and Hadoop 2.0
4. What is NewSQL?

5. SQL Vs. NoSQL

6. Getting familiar with Hadoop.


Session Plan

Lecture time 45 to 60 minutes

Q/A 15 minutes
Agenda
 NoSQL
 What is it?
 Types of NoSQL Databases
 Why NoSQL?
 Advantages of NoSQL
 NoSQL Vendors
 SQL versus NoSQL
 NewSQL
 Comparison of SQL, NoSQL and NewSQL
 Hadoop
 Features of Hadoop
 Key Advantages of Hadoop
 Versions of Hadoop
What is NoSQL?
What is NoSQL?

Non-relational data storage systems

No fixed table schema

No Joins
NoSQL

No multi-document transactions

Relaxes one or more ACID properties


Types of NoSQL
Types of NoSQL

Key value data Column-oriented Document data Graph data


store data store store store
• Riak • Cassandra • MongoDB • InfiniteGraph
• Redis • HBase • CouchDB • Neo4
• Membase • HyperTable • RavenDB • Allegro Graph
Advantages of NoSQL
Advantages of NoSQL

Cheap, Easy to implement

Easy to distribute

Can easily scale up & down


Advantages of NoSQL
Relaxes the data consistency
requirement

Doesn’t require a pre-defined


schema

Data can be replicated to


multiple nodes and can be
partitioned
NoSQL Vendors
NoSQL Vendors

Company Product Most widely used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, eBay

Google BigTable Adobe Photoshop


SQL Vs. NoSQL
SQL Vs. NoSQL

SQL NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide column store or
key-value pairs databases
Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)
Uses SQL Uses UnQL (Unstructured Query Language)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows the key-
value pair of storing data similar to JSON (Java Script
Object Notation)
Emphasis on ACID properties Follows Brewer’s CAP theorem
Excellent support from vendors Relies heavily on community support
Supports complex querying and data Does not have good support for complex querying
keeping needs
Can be configured for strong consistency Few support strong consistency (e.g., MongoDB), few
others can be configured for eventual consistency (e.g.,
Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL, MongoDB, HBase, Cassandra, Redis, Neo4j, CouchDB,
PostgreSQL, etc. Couchbase, Riak, etc.
NewSQL
NewSQL

SQL interface for application interaction

ACID support for transactions

Characteristics of NewSQL An architecture that provides higher per node


performance vis-a-vs traditional RDBMS solution

Scale out, shared nothing architecture

Non-locking concurrency control mechanism so


that real time reads will not conflict with writes
SQL Vs. NoSQL Vs. NewSQL
SQL Vs. NoSQL Vs. NewSQL

  SQL NoSQL NewSQL


Adherence to ACID Yes No Yes
properties
OLTP/OLAP Yes No Yes
Schema rigidity Yes No Maybe
Adherence to data model Adherence to
relational model
Data Format Flexibility No Yes Maybe
Scalability Scale up Scale out Scale
Vertical Scaling Horizontal Scaling out
Distributed Computing Yes Yes Yes
Community Support Huge Growing Slowly
growing
Hadoop

Hadoop
Apache Open-Source Software Framework

Inspired by
- Google MapReduce
- Google File System

Hadoop Distributed File System


MapReduce
Hadoop
Key Advantages of Hadoop

 Stores data in its native format


 Scalable
 Cost-effective
 Resilient to failure
 Flexibility
 Fast
Versions of Hadoop
Versions of Hadoop

Hadoop 1.0 Hadoop 2.0

MapReduce MapReduce Others


(Cluster Resource Manager
(Data Processing) (Data Processing)
& Data Processing)

HDFS YARN
(redundant, reliable storage) (Cluster Resource Manager)
HDFS
(redundant, reliable storage)
Hadoop Ecosystem

Ambari
(Provisioning, Managing & Monitoring Hadoop Cluster)

Mahout Pig R Hive


Sqoop
(Machine learning) (Data Flow) (Statistics) (Data Warehouse) Oozie
(Relational Database
(Workflow)
Data Collector)
Map Reduce Hbase
(Distributed Processing) (Distributed Table Store)
Flume/Chukwa Zookeeper
(Log Data Collector) HDFS (Coordination)
(Hadoop Distributed File System)
Hadoop Ecosystem

Components that help with Data Ingestion are:


1. Sqoop
2. Flume
Components that help with Data Processing are:
3. MapReduce
4. Spark
Components that help with Data Analysis are:
5. Pig
6. Hive
7. Impala
Three Difference between HBase and Hadoop/ HDFS

 HDFS is the file system where as HBase is a Hadoop database. It is like NTFS
and MySQL.

 HDFS is WORM (Write once and read multiple times or many times). Latest
versions supports appending of data but this feature is rarely used. However
HBase supports real time random read and write.

 HDFS is based on Google File System (GFS) whereas Hbase is based on Google
Big Table.
Hadoop Ecosystem Components for Data Ingestion

Sqoop:
 Sqoop stands for SQL to Hadoop. It can provision the data from external
system on to HDFS and populate tables in Hive and HBase.
Flume:
 Flume is an important log aggregator (aggregates logs from different
machines and places them in HDFS) component in the Hadoop Ecosystem.
Hadoop Ecosystem Components for Data Processing

MapReduce:
 It is a programing paradigm that allows distributed and parallel processing of
huge datasets. It is based on Google MapReduce.

Spark:
 It is both a programming model as well as a computing model. It is an open
source big data processing framework.
 It is written in Scala. It provides in-memory computing for Hadoop.
 Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting
on top of Hadoop YARN) or used independently of Hadoop (standalone).
Hadoop ecosystem components for Data Analysis

Pig
 It is a high level scripting language used with Hadoop. It serves as an
alternative to MapReduce. It has two parts:
 Pig Latin: It is a SQL like scripting language.
 Pig runtime: is the runtime environment.
Hive:
 Hive is a data warehouse software project built on top of Hadoop. Three main
tasks performed by Hive are summarization, querying and analysis
Impala:
 It is a high performance SQL engine that runs on Hadoop cluster. It is ideal
for interactive analysis. It has very low latency measured in milliseconds. It
supports a dialect of SQL called Impala SQL.
Answer a few quick questions …
Fill in the blanks

Fill in the blanks


1. The expansion for CAP is _____________, ____________ and ___________________.
2. The expansion of BASE is ___________________.
3. MongoDB is ___________________ and ___________________.
4. Cassandra is ___________________ and ___________________.
5. ___________________ has no support for ACID properties of transactions.
6. ___________________ is a robust database that supports ACID properties of
transactions and has the scalability of NoSQL.
Answer Me

 Cite the difference between Hadoop 1.0 and Hadoop 2.0.

 Compare and contrast SQL, NoSQL and NewSQL.


Summary please…

Ask a few participants of the learning program to summarize the lecture.


References …
Further Readings

 http://www.mongodb.com/nosql-explained
 http://nosql-database.org/
 http
://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapr
educe-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html
 http://hadoop.apache.org/
Thank you

You might also like