0% found this document useful (0 votes)

246 views

The Big Data Technology Landscape

Consistency, Availability, Partition tolerance

Uploaded by

Ponnusamy S Pichaimuthu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

246 views

The Big Data Technology Landscape

Consistency, Availability, Partition tolerance

Uploaded by

Ponnusamy S Pichaimuthu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

The Big Data Technology Landscape

Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

The big data technology landscape a) To understand the significance of NoSQL
databases
1. What is NoSQL databases?
b) To understand the need for NewSQL
2. Why NoSQL?
c) To understand the Hadoop platform and be
3. Key advantages of NoSQL able to appreciate the difference between
Hadoop 1.0 and Hadoop 2.0
4. What is NewSQL?

5. SQL Vs. NoSQL

6. Getting familiar with Hadoop.

Session Plan

Lecture time 45 to 60 minutes

Q/A 15 minutes
Agenda
 NoSQL
 What is it?
 Types of NoSQL Databases
 Why NoSQL?
 Advantages of NoSQL
 NoSQL Vendors
 SQL versus NoSQL
 NewSQL
 Comparison of SQL, NoSQL and NewSQL
 Hadoop
 Features of Hadoop
 Key Advantages of Hadoop
 Versions of Hadoop
What is NoSQL?
What is NoSQL?

Non-relational data storage systems

No fixed table schema

No Joins
NoSQL

No multi-document transactions

Relaxes one or more ACID properties

Types of NoSQL
Types of NoSQL

Key value data Column-oriented Document data Graph data

store data store store store
• Riak • Cassandra • MongoDB • InfiniteGraph
• Redis • HBase • CouchDB • Neo4
• Membase • HyperTable • RavenDB • Allegro Graph
Advantages of NoSQL
Advantages of NoSQL

Cheap, Easy to implement

Easy to distribute

Can easily scale up & down

Advantages of NoSQL
Relaxes the data consistency
requirement

Doesn’t require a pre-defined

schema

Data can be replicated to

multiple nodes and can be
partitioned
NoSQL Vendors
NoSQL Vendors

Company Product Most widely used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, eBay

Google BigTable Adobe Photoshop

SQL Vs. NoSQL
SQL Vs. NoSQL

SQL NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide column store or
key-value pairs databases
Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)
Uses SQL Uses UnQL (Unstructured Query Language)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows the key-
value pair of storing data similar to JSON (Java Script
Object Notation)
Emphasis on ACID properties Follows Brewer’s CAP theorem
Excellent support from vendors Relies heavily on community support
Supports complex querying and data Does not have good support for complex querying
keeping needs
Can be configured for strong consistency Few support strong consistency (e.g., MongoDB), few
others can be configured for eventual consistency (e.g.,
Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL, MongoDB, HBase, Cassandra, Redis, Neo4j, CouchDB,
PostgreSQL, etc. Couchbase, Riak, etc.
NewSQL
NewSQL

SQL interface for application interaction

ACID support for transactions

Characteristics of NewSQL An architecture that provides higher per node

performance vis-a-vs traditional RDBMS solution

Scale out, shared nothing architecture

Non-locking concurrency control mechanism so

that real time reads will not conflict with writes
SQL Vs. NoSQL Vs. NewSQL
SQL Vs. NoSQL Vs. NewSQL

SQL NoSQL NewSQL

Adherence to ACID Yes No Yes
properties
OLTP/OLAP Yes No Yes
Schema rigidity Yes No Maybe
Adherence to data model Adherence to
relational model
Data Format Flexibility No Yes Maybe
Scalability Scale up Scale out Scale
Vertical Scaling Horizontal Scaling out
Distributed Computing Yes Yes Yes
Community Support Huge Growing Slowly
growing
Hadoop

Hadoop
Apache Open-Source Software Framework

Inspired by
- Google MapReduce
- Google File System

Hadoop Distributed File System

MapReduce
Hadoop
Key Advantages of Hadoop

 Stores data in its native format

 Scalable
 Cost-effective
 Resilient to failure
 Flexibility
 Fast
Versions of Hadoop
Versions of Hadoop

Hadoop 1.0 Hadoop 2.0

MapReduce MapReduce Others

(Cluster Resource Manager
(Data Processing) (Data Processing)
& Data Processing)

HDFS YARN
(redundant, reliable storage) (Cluster Resource Manager)
HDFS
(redundant, reliable storage)
Hadoop Ecosystem

Ambari
(Provisioning, Managing & Monitoring Hadoop Cluster)

Mahout Pig R Hive

Sqoop
(Machine learning) (Data Flow) (Statistics) (Data Warehouse) Oozie
(Relational Database
(Workflow)
Data Collector)
Map Reduce Hbase
(Distributed Processing) (Distributed Table Store)
Flume/Chukwa Zookeeper
(Log Data Collector) HDFS (Coordination)
(Hadoop Distributed File System)
Hadoop Ecosystem

Components that help with Data Ingestion are:

1. Sqoop
2. Flume
Components that help with Data Processing are:
3. MapReduce
4. Spark
Components that help with Data Analysis are:
5. Pig
6. Hive
7. Impala
Three Difference between HBase and Hadoop/ HDFS

 HDFS is the file system where as HBase is a Hadoop database. It is like NTFS
and MySQL.

 HDFS is WORM (Write once and read multiple times or many times). Latest
versions supports appending of data but this feature is rarely used. However
HBase supports real time random read and write.

 HDFS is based on Google File System (GFS) whereas Hbase is based on Google
Big Table.
Hadoop Ecosystem Components for Data Ingestion

Sqoop:
 Sqoop stands for SQL to Hadoop. It can provision the data from external
system on to HDFS and populate tables in Hive and HBase.
Flume:
 Flume is an important log aggregator (aggregates logs from different
machines and places them in HDFS) component in the Hadoop Ecosystem.
Hadoop Ecosystem Components for Data Processing

MapReduce:
 It is a programing paradigm that allows distributed and parallel processing of
huge datasets. It is based on Google MapReduce.

Spark:
 It is both a programming model as well as a computing model. It is an open
source big data processing framework.
 It is written in Scala. It provides in-memory computing for Hadoop.
 Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting
on top of Hadoop YARN) or used independently of Hadoop (standalone).
Hadoop ecosystem components for Data Analysis

Pig
 It is a high level scripting language used with Hadoop. It serves as an
alternative to MapReduce. It has two parts:
 Pig Latin: It is a SQL like scripting language.
 Pig runtime: is the runtime environment.
Hive:
 Hive is a data warehouse software project built on top of Hadoop. Three main
tasks performed by Hive are summarization, querying and analysis
Impala:
 It is a high performance SQL engine that runs on Hadoop cluster. It is ideal
for interactive analysis. It has very low latency measured in milliseconds. It
supports a dialect of SQL called Impala SQL.
Answer a few quick questions …
Fill in the blanks

Fill in the blanks

1. The expansion for CAP is _____________, ____________ and ___________________.
2. The expansion of BASE is ___________________.
3. MongoDB is ___________________ and ___________________.
4. Cassandra is ___________________ and ___________________.
5. ___________________ has no support for ACID properties of transactions.
6. ___________________ is a robust database that supports ACID properties of
transactions and has the scalability of NoSQL.
Answer Me

 Cite the difference between Hadoop 1.0 and Hadoop 2.0.

 Compare and contrast SQL, NoSQL and NewSQL.

Summary please…

Ask a few participants of the learning program to summarize the lecture.

References …
Further Readings

 http://www.mongodb.com/nosql-explained
 http://nosql-database.org/
 http
://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapr
educe-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html
 http://hadoop.apache.org/
Thank you

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Hadoop Hive - One
No ratings yet
Hadoop Hive - One
10 pages
Chapter 10
No ratings yet
Chapter 10
50 pages
Chapter 6
100% (1)
Chapter 6
51 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
86 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
Big Data Seema Acharya
100% (1)
Big Data Seema Acharya
86 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
18CS72 Module1 Qbank
No ratings yet
18CS72 Module1 Qbank
2 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
43 PPT On Apache Pig
No ratings yet
43 PPT On Apache Pig
16 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
Big Data Management Syllabus
100% (1)
Big Data Management Syllabus
5 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
17 2017 Lecture1-2 INT312
0% (2)
17 2017 Lecture1-2 INT312
21 pages
Unit 4 (MongoDB)
No ratings yet
Unit 4 (MongoDB)
46 pages
MCQ Type Questions
No ratings yet
MCQ Type Questions
24 pages
Tutorial Hbase
No ratings yet
Tutorial Hbase
100 pages
Big Data Analytics by Seema Acharya PDF 9 PDF Free
No ratings yet
Big Data Analytics by Seema Acharya PDF 9 PDF Free
370 pages
ST-1 Solution Big Data KCS061
No ratings yet
ST-1 Solution Big Data KCS061
26 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Lecture 2 - Introduction To Big Data Analytics - 1691894427998
No ratings yet
Lecture 2 - Introduction To Big Data Analytics - 1691894427998
55 pages
6th Sem Big Data Assignment 1
No ratings yet
6th Sem Big Data Assignment 1
1 page
Big Data Syllabus For Theory and Lab
No ratings yet
Big Data Syllabus For Theory and Lab
4 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Big Data and Analytics Syllabus 2021
No ratings yet
Big Data and Analytics Syllabus 2021
3 pages
CHAPTER - 1 - Introduction - 1
No ratings yet
CHAPTER - 1 - Introduction - 1
33 pages
Mc5502 Bda Unit I Notes
No ratings yet
Mc5502 Bda Unit I Notes
106 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
Chapter 5
No ratings yet
Chapter 5
45 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
F PDF
100% (1)
F PDF
86 pages
391 - CS8091 Big Data Analytics - Anna University 2017 Regulation Syllabus
0% (2)
391 - CS8091 Big Data Analytics - Anna University 2017 Regulation Syllabus
2 pages
NoSQL Technologies Notes Unit 1
100% (1)
NoSQL Technologies Notes Unit 1
20 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
BDA Experiment 14 PDF
No ratings yet
BDA Experiment 14 PDF
77 pages
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
No ratings yet
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
11 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
33 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Passport Automation System: A Case Study Report On
No ratings yet
Passport Automation System: A Case Study Report On
97 pages
Big Data Simplified: Book Description
No ratings yet
Big Data Simplified: Book Description
14 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Unit 5-Key - Value Store Database
No ratings yet
Unit 5-Key - Value Store Database
16 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Chapter 7
No ratings yet
Chapter 7
48 pages
Priyanka DB New Resume
0% (1)
Priyanka DB New Resume
3 pages
DWDM UNIT-1 Lecture Notes
No ratings yet
DWDM UNIT-1 Lecture Notes
15 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Mongo DB
No ratings yet
Mongo DB
31 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
Introduction To Mongodb
No ratings yet
Introduction To Mongodb
50 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
Python UNIT III-Part-1
No ratings yet
Python UNIT III-Part-1
34 pages
Types of Digital Data
No ratings yet
Types of Digital Data
26 pages
DECS 43A - Big Data Analysis
No ratings yet
DECS 43A - Big Data Analysis
29 pages
Introduction To Cassandra
No ratings yet
Introduction To Cassandra
47 pages
DECS 43A - Big Data Analysis
No ratings yet
DECS 43A - Big Data Analysis
52 pages
Data Communication & Network: Unit - 2
No ratings yet
Data Communication & Network: Unit - 2
72 pages
M.phil Regulations
0% (1)
M.phil Regulations
22 pages
DECS 43A - Big Data Analysis
No ratings yet
DECS 43A - Big Data Analysis
40 pages
Computer Networks Unit-I-New
No ratings yet
Computer Networks Unit-I-New
102 pages
Data Communication & Network: Unit - 3
No ratings yet
Data Communication & Network: Unit - 3
58 pages
Ex 4
No ratings yet
Ex 4
7 pages
2 Marks Question Bank-Ecom - 15!05!2013
No ratings yet
2 Marks Question Bank-Ecom - 15!05!2013
32 pages
CS Syllabus 2019 2022
No ratings yet
CS Syllabus 2019 2022
115 pages
NAG-IBA II - Dec
No ratings yet
NAG-IBA II - Dec
35 pages
Project Based C++ - Embedded C Programming - Simple Mips Assembly - Programming in Javascript - Fpga Verilog Programming
No ratings yet
Project Based C++ - Embedded C Programming - Simple Mips Assembly - Programming in Javascript - Fpga Verilog Programming
1 page
Design and Optimization of An Inventory Management System For Central Stores
No ratings yet
Design and Optimization of An Inventory Management System For Central Stores
8 pages
Final PPT
No ratings yet
Final PPT
27 pages
Document Version: 1.0 Firmware Version: LG02 - LG08-v5.1.15 For Products: LG01-N, OLG01-N, LG02, OLG02
No ratings yet
Document Version: 1.0 Firmware Version: LG02 - LG08-v5.1.15 For Products: LG01-N, OLG01-N, LG02, OLG02
25 pages
How To Inspect Pages Customly With Google Chrome
No ratings yet
How To Inspect Pages Customly With Google Chrome
1 page
How To Craft Your Marketing Plan
No ratings yet
How To Craft Your Marketing Plan
45 pages
Storage Media and Devices: IGCSE - 0417
0% (1)
Storage Media and Devices: IGCSE - 0417
35 pages
Project Charter
No ratings yet
Project Charter
4 pages
Cloud Front
No ratings yet
Cloud Front
19 pages
SOAL Cscu 1
No ratings yet
SOAL Cscu 1
11 pages
Spring 2025_CS619_10956
No ratings yet
Spring 2025_CS619_10956
2 pages
GDPR Audit & Compliance: Consulting
No ratings yet
GDPR Audit & Compliance: Consulting
6 pages
BCOM ITM - Principles of Software Engineering
No ratings yet
BCOM ITM - Principles of Software Engineering
126 pages
Evolution of Communication
No ratings yet
Evolution of Communication
12 pages
Capynova_Writeup
No ratings yet
Capynova_Writeup
3 pages
ICDL 1st Midterm 30-10-2016 Solution A
No ratings yet
ICDL 1st Midterm 30-10-2016 Solution A
3 pages
AWS Lab Practice Guide by WWW - Server-Computer - Com - v1
100% (1)
AWS Lab Practice Guide by WWW - Server-Computer - Com - v1
86 pages
Venus M. Atienza BSA 411
100% (1)
Venus M. Atienza BSA 411
3 pages
SQL Quick Guide PDF
No ratings yet
SQL Quick Guide PDF
7 pages
CHECK-LIST PI-04 Stress Analysis Report - Piping - Fr.en
No ratings yet
CHECK-LIST PI-04 Stress Analysis Report - Piping - Fr.en
4 pages
Enterprise Service Bus (ESB)
No ratings yet
Enterprise Service Bus (ESB)
3 pages
AD2
No ratings yet
AD2
12 pages
Tarea de Programación II Con Interfaces Graficas
No ratings yet
Tarea de Programación II Con Interfaces Graficas
30 pages
10 Excel Functions Every Marketer Should Know
No ratings yet
10 Excel Functions Every Marketer Should Know
10 pages
Memory Arch
No ratings yet
Memory Arch
9 pages
Enabling Vulkan Validation Layers
No ratings yet
Enabling Vulkan Validation Layers
2 pages
Final Business Architecture Presentation - v1 0
No ratings yet
Final Business Architecture Presentation - v1 0
36 pages
Microsoft: Question & Answers
No ratings yet
Microsoft: Question & Answers
32 pages
NMap Tutorial For Beginners PDF
No ratings yet
NMap Tutorial For Beginners PDF
7 pages

The Big Data Technology Landscape

Uploaded by

The Big Data Technology Landscape

Uploaded by

The Big Data Technology Landscape

Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

5. SQL Vs. NoSQL

6. Getting familiar with Hadoop.

Lecture time 45 to 60 minutes

Non-relational data storage systems

No fixed table schema

Relaxes one or more ACID properties

Key value data Column-oriented Document data Graph data

Cheap, Easy to implement

Can easily scale up & down

Doesn’t require a pre-defined

Data can be replicated to

Company Product Most widely used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, eBay

Google BigTable Adobe Photoshop

SQL interface for application interaction

ACID support for transactions

Characteristics of NewSQL An architecture that provides higher per node

Scale out, shared nothing architecture

Non-locking concurrency control mechanism so

SQL NoSQL NewSQL

Hadoop Distributed File System

 Stores data in its native format

Hadoop 1.0 Hadoop 2.0

MapReduce MapReduce Others

Mahout Pig R Hive

Components that help with Data Ingestion are:

Fill in the blanks

 Cite the difference between Hadoop 1.0 and Hadoop 2.0.

 Compare and contrast SQL, NoSQL and NewSQL.

Ask a few participants of the learning program to summarize the lecture.

You might also like