Chapter 6 - Big Data Architecture Part 1

Uploaded by

Sarveshwaran Balasundaram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views41 pages

Chapter 6 - Big Data Architecture Part 1

Uploaded by

Sarveshwaran Balasundaram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Big Data

Architecture
Part 1
Introduction
The non-stop growth of data, the frantic
releases of new electronic devices and the
data-driven decision-making trend in
companies is fueling a constant demand
for more efficient Big Data processing
systems.

The investment in Big Data architecture

has been rapidly growing these past years
and according to Gartner, businesses will
keep investing more in IT in 2018 and 2019
focusing on IOT, Block-chain and Big Data
Introduction
Big Data refers to huge amounts of
heterogeneous data from both
traditional and new sources, growing at
a higher rate than ever.

Due to their high heterogeneity, it is a

challenge to build systems to centrally
process and analyze efficiently such
huge amount of data which are internal
and external to an organization
Big Data
Architecture
Definition
“Big data architecture refers to the
logical and physical structure that
dictates how high volumes of data are
ingested, processed, stored, managed,
and accessed.” (Omni, 2020)

“A Big data architecture describes the

blueprint of a system handling massive
volume of data during its storage,
processing, analysis and visualization.”
(Kalipe & Behera, 2019)
Big Data
Architecture
Definition
The big data architecture framework
serves as a reference blueprint for big
data infrastructures and solutions,
logically defining how big data solutions
will work, the components that will be
used, how information will flow, and
security details.
How to Build a Big
Data Architecture
Designing a big data reference
architecture, while complex, follows the
same general procedure:
Analyze the Problem
Select A Vendor
Deployment Strategy
Capacity Planning
Infrastructure Sizing
Plan a Disaster Recovery
How to Build a Big
Data Architecture
• Analyze the Problem
• First determine if the business does in fact
have a big data problem, taking into
consideration criteria such as data variety,
velocity, and challenges with the current
system.
• Common use cases include data archival,
process offload, data lake implementation,
unstructured data processing, and data
warehouse modernization.
How to Build a Big
Data Architecture
• Select a Vendor
• Hadoop is one of the most widely
recognized big data architecture tools
for managing big data end to end
architecture.
• Popular vendors for Hadoop distribution
include Amazon Web Services,
BigInsights, Cloudera, Hortonworks,
Mapr, and Microsoft.
How to Build a Big
Data Architecture
• Deployment Strategy
• Deployment can be either on-premises,
which tends to be more secure; cloud-
based, which is cost effective and
provides flexibility regarding scalability;
or a mix deployment strategy.
How to Build a Big
Data Architecture
• Capacity Planning
• When planning hardware and
infrastructure sizing, consider daily
data ingestion volume, data volume for
one-time historical load, the data
retention period, multi-data center
deployment, and the time period for
which the cluster is sized
How to Build a Big
Data Architecture
• Infrastructure Sizing
• This is based on capacity planning and
determines the number of
clusters/environment required and the
type of hardware required.
• Consider the type of disk and number of
disks per machine, the types of
processing memory and memory size,
number of CPUs and cores, and the data
retained and stored in each environment.
How to Build a Big
Data Architecture
• Plan a Disaster Recover
• In developing a backup and disaster
recovery plan, consider the criticality of
data stored, the Recovery Point
Objective and Recovery Time Objective
requirements, backup interval, multi
datacenter deployment, and whether
Active-Active or Active-Passive disaster
recovery is most appropriate.
General Big Data
architecture
General Big Data
architecture
• Big data solutions typically involve
one or more of the following types of
workload:
• Batch processing of big data sources
at rest.
• Real-time processing of big data in
motion.
• Interactive exploration of big data.
• Predictive analytics and machine
learning.
General Big Data
architecture
• Most big data architectures include
some or all of the following components:
• Data Sources
• Data Storage
• Batch Processing
• Real-time message ingestion
• Stream Processing
• Analytical Data Store
• Analysis & Reporting
• Orchestration
General Big Data
architecture
• Data sources: All big data solutions
start with one or more data sources.
Examples include:
• Application data stores, such as
relational databases.
• Static files produced by applications,
such as web server log files.
• Real-time data sources, such as IoT
devices.
General Big Data
architecture
• Data storage:
• Data for batch processing operations
is typically stored in a distributed file
store that can hold high volumes of
large files in various formats. This
kind of store is often called a data
lake.
• Examples: Azure Data Lake Store or
blob containers in Azure Storage.
General Big Data
architecture
• Batch processing
• Because the data sets are so large, often a big data
solution must process data files using long-running
batch jobs to filter, aggregate, and otherwise
prepare the data for analysis.
• Usually these jobs involve reading source files,
processing them, and writing the output to new files.
• Options include running U-SQL jobs in Azure Data
Lake Analytics, using Hive, Pig, or custom
Map/Reduce jobs in an HDInsight Hadoop cluster, or
using Java, Scala, or Python programs in an
HDInsight Spark cluster.
General Big Data
architecture
• Real-time message ingestion
• If the solution includes real-time sources, the
architecture must include a way to capture and
store real-time messages for stream processing.
• This might be a simple data store, where incoming
messages are dropped into a folder for processing.
• However, many solutions need a message
ingestion store to act as a buffer for messages,
and to support scale-out processing, reliable
delivery, and other message queuing semantics.
Options include Azure Event Hubs, Azure IoT Hubs,
and Kafka.
General Big Data
architecture
• Stream processing
• After capturing real-time messages, the solution
must process them by filtering, aggregating, and
otherwise preparing the data for analysis.
• The processed stream data is then written to an
output sink.
• Example: Azure Stream Analytics provides a
managed stream processing service based on
perpetually running SQL queries that operate on
unbounded streams. You can also use open source
Apache streaming technologies like Storm and
Spark Streaming in an HDInsight cluster.
General Big Data
architecture
• Analytical data store:
• Many big data solutions prepare data for analysis and
then serve the processed data in a structured format
that can be queried using analytical tools.
• The analytical data store used to serve these queries
can be a Kimball-style relational data warehouse, as
seen in most traditional business intelligence (BI)
solutions.
• Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an
interactive Hive database that provides a metadata
abstraction over data files in the distributed data store.
General Big Data
architecture
• Analysis and reporting:
• The goal of most big data solutions is
to provide insights into the data
through analysis and reporting.
• To empower users to analyze the data,
the architecture may include a data
modeling layer, such as a
multidimensional OLAP cube or
tabular data model
General Big Data
architecture
• Analysis and reporting:
• Analysis and reporting can also take the form
of interactive data exploration by data
scientists or data analysts.
• For these scenarios, many Azure services
support analytical notebooks, such as Jupyter,
enabling these users to leverage their
existing skills with Python or R. For large-
scale data exploration, you can use Microsoft
R Server, either standalone or with Spark.
General Big Data
architecture
• Orchestration
• Most big data solutions consist of repeated data
processing operations, encapsulated in
workflows, that transform source data, move
data between multiple sources and sinks, load
the processed data into an analytical data store,
or push the results straight to a report or
dashboard.
• To automate these workflows, you can use an
orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
The benefits of using an
‘open’ Big Data reference
architecture

• It provides a common language for the

various stakeholders;
• It encourages adherence to common
standards, specifications, and patterns;
• It provides consistent methods for
implementation of technology to solve
similar problem sets;
The benefits of using an
‘open’ Big Data reference
architecture

• It illustrates and improves

understanding of the various Big Data
components, processes, and systems, in
the context of a vendor- and
technology-agnostic Big Data
conceptual model;
• It facilitates analysis of candidate
standards for interoperability,
portability, reusability, and
extendibility.
The benefits of using an
‘open’ Big Data reference
architecture
Big Data Architecture
application in industry
• Use cases of Big Data based on
Architectural components
Types of Big Data
Architecture
• Lambda Architecture
• The lambda architecture is an approach to big
data processing that aims to achieve low
latency updates while maintaining the highest
possible accuracy.
• It is divided in 3 layers. The first, “the batch
layer” is composed of a distributed file system
which stores the entirety of the collected data.
• The same layer stores a set of predefined
functions to be run on the dataset to produce
what is called a batch view. Those views are
stored in a database constituting the “serving
layer” from which they can be queried
interactively by the user.
Types of Big Data
Architecture
• Lambda Architecture
Types of Big Data
Architecture
• Lambda Architecture
• The third layer called “speed layer”
computes incremental functions on the
new data as it arrives in the system.
• It processes only data which is
generated between two consecutive
batch views re-computation producing
and it produces real-time views which
are also stored in the serving layer.
The different views are queried
together to obtain the most accurate
possible results
Types of Big Data
Architecture
• Lambda Architecture advantages:
• provides better accuracy, higher throughput
and lower latency for reads and updates
simultaneously without compromise on data
consistency
• more resilient thanks to the Distributed File
System used to store the master dataset,
mostly because it is less subject to human
errors (such as unintended bulk deletions)
than a traditional RDBMS
• helps achieve the main requirements of a
reliable Big Data system among which are
robustness and fault tolerance provided
through the batch layer.
Types of Big Data
Architecture
• Lambda Architecture disadvantages:
• Different layers of this architecture may make it
complex. Synchronization between the layers can be an
expensive affair. So, it has to be handled in a cautious
manner.
• Support and maintenance become difficult because of
distinct and distributed layers namely batch and speed.
• A lot of technologies have emerged that can help in the
construction of Lambda architecture but finding people
who have mastered these technologies can be difficult.
• It can be difficult to apply this architecture for the
open-source technologies and the trouble further
solidifies if it has to be implemented in the cloud.
• Maintenance of the code of the architecture is also
difficult. As it has to produce the same results in the
distributed system.
Types of Big Data
Architecture
• Lambda Architecture disadvantages:
• one of the major disadvantages of
this architecture is the need to
maintain two similar code bases: one
in the speed layer and another in the
batch layer to perform the same
computation on different sets of
data.
• That implies redundancy and it
requires two different sets of skills
in order to write the logic for the
streaming and for the batch data
Types of Big Data
Architecture
• Lambda Architecture Implementation
• A particularly suitable application of the
Lambda architecture is found in Log
ingestion and analytics. The reason is that
log messages are immutable and often
generated at a high speed in systems that
need to offer high availability
• The Lambda Architecture is preferred in
cases where there is an equal need for real-
time/fluid analysis of incoming data and for
periodic analysis of the entire repository of
data collected. Social media and especially
tweets analysis is a perfect example of such
an application.
Types of Big Data
Architecture
• Lambda Architecture Hardware
Requirement
Types of Big Data
Architecture
• Lambda Architecture Software
Requirement
• Batch layer
• The requirements of the batch layer
make
• Hadoop the most suitable
framework to use for its
implementation. HDFS provide the
perfect append-only technology to
accommodate the master dataset
Types of Big Data
Architecture
• Lambda Architecture Software
Requirement

• Speed layer. The speed layer can be implemented using real-

time processing tools such as Storm or S4. Spark Streaming can
also be used although it treats data in micro-batches rather than
in real streams. The advantage is that the Spark code can be
reused of in the batch layer

• Serving layer. Any random-access NoSQL database

can host the real-time and batch views. Some
examples are: HBase, CouchDB, V oldemort or
even MongoDB. Cassandra is particularly preferred
because of the write-fast option that it provides.
Types of Big Data
Architecture
• Lambda Architecture Software
Requirement

• Queuing system. A queuing system is necessary to

ensure asynchronous and fault-tolerant transmission of
the real- time data to the batch and speed layer.
Popular options include Apache Kafka or Flume.
Types of Big Data
Architecture
• Kappa Architecture

• The Kappa architecture was proposed to reduce the

lambda architecture’s overhead that came with handling
two separate code bases for stream and batch processing.

• Its author, Jay Kreps, observed that the necessity of a

batch processing system came from the need to reprocess
previously streamed data again when the code changed.
In Kappa architecture the batch layer was removed and
the speed layer enhanced to offer reprocessing
capabilities.
References
• Kalipe, Godson & Behera, Rajat. (2019). Big Data
Architectures : A detailed and application oriented
review. ]
• OmniSci, 2020 -
https://www.omnisci.com/technical-glossary/big-data-a
rchitecture
• Big Data Architecture Style -
https://docs.microsoft.com/en-us/azure/architecture/gu
ide/architecture-styles/big-data#:~:text=A%20big
%20data%20architecture%20is,big%20data
%20sources%20at%20rest.

Business Intelligence Assignment 2
No ratings yet
Business Intelligence Assignment 2
25 pages
The DAMA Guide To The Data Management 1a 6
100% (2)
The DAMA Guide To The Data Management 1a 6
176 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
DM4ML Quiz
No ratings yet
DM4ML Quiz
7 pages
Devops For Database
No ratings yet
Devops For Database
40 pages
NAV 2009 - Manage User Rights and Profiles
No ratings yet
NAV 2009 - Manage User Rights and Profiles
40 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Big Data
100% (2)
Big Data
20 pages
BYTE D1-4 BigDataTechnologiesInfrastructures FINAL - Compressed
No ratings yet
BYTE D1-4 BigDataTechnologiesInfrastructures FINAL - Compressed
34 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data Fabric Architecture
No ratings yet
Big Data Fabric Architecture
15 pages
Cloud & Big Data
No ratings yet
Cloud & Big Data
5 pages
BD1 1
0% (1)
BD1 1
9 pages
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
No ratings yet
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
3 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data Platforms
No ratings yet
Big Data Platforms
4 pages
Ingestion Layer PDF
No ratings yet
Ingestion Layer PDF
11 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Big Data Architectures: A Detailed and Application Oriented Review
No ratings yet
Big Data Architectures: A Detailed and Application Oriented Review
11 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Module-3 (Part-2)
No ratings yet
Module-3 (Part-2)
46 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
Big Data
No ratings yet
Big Data
20 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Escritura 1
No ratings yet
Escritura 1
7 pages
LS1.1 - V6 Generalized Architecture of Big Data Systems
No ratings yet
LS1.1 - V6 Generalized Architecture of Big Data Systems
8 pages
The Data Engineering Cookbook: Andreas Kretz December 2, 2018 v0.1
No ratings yet
The Data Engineering Cookbook: Andreas Kretz December 2, 2018 v0.1
40 pages
Big Data Architectures
No ratings yet
Big Data Architectures
11 pages
BigData OSFY Nov
No ratings yet
BigData OSFY Nov
6 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
IEEE BigDataOpenSourcePlatforms
No ratings yet
IEEE BigDataOpenSourcePlatforms
8 pages
InfoQ Modern Data Architectures Pipelines Streams
No ratings yet
InfoQ Modern Data Architectures Pipelines Streams
42 pages
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
No ratings yet
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
14 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
No ratings yet
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
8 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Lecture 2 - Big Data
No ratings yet
Lecture 2 - Big Data
8 pages
Bring Data Lakes and Data Warehouses Together
100% (1)
Bring Data Lakes and Data Warehouses Together
19 pages
Chapter 2 - MARKETING ANALYTICS DATA
No ratings yet
Chapter 2 - MARKETING ANALYTICS DATA
25 pages
PHP and MYSQL Interview Questions and Answers PART 1
No ratings yet
PHP and MYSQL Interview Questions and Answers PART 1
25 pages
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
No ratings yet
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
3 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Blue Pumpkin (Database+Source Code+Lib+Excel)
No ratings yet
Blue Pumpkin (Database+Source Code+Lib+Excel)
25 pages
Module 1 Glossary What Is Big Data
No ratings yet
Module 1 Glossary What Is Big Data
2 pages
Binary Search Trees: Welcome To CS221: Programming & Data Structures
No ratings yet
Binary Search Trees: Welcome To CS221: Programming & Data Structures
37 pages
Oracle Data Pump (Expdp, Impdp) in Oracle Database 10g, 11g, 12c, 18c
No ratings yet
Oracle Data Pump (Expdp, Impdp) in Oracle Database 10g, 11g, 12c, 18c
20 pages
BDA Architecture
No ratings yet
BDA Architecture
15 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Spuninst
No ratings yet
Spuninst
158 pages
Data Mining
100% (1)
Data Mining
7 pages
Module 1
No ratings yet
Module 1
29 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
ACID Properties: Atomicity
No ratings yet
ACID Properties: Atomicity
2 pages
Cloning Database Training
No ratings yet
Cloning Database Training
16 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Understanding Linux File System
No ratings yet
Understanding Linux File System
2 pages
Directory Commands and File Readers: Presented by
No ratings yet
Directory Commands and File Readers: Presented by
12 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Cours BI 23 24 Session 4 2
No ratings yet
Cours BI 23 24 Session 4 2
46 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Create Procedure
No ratings yet
Create Procedure
5 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
NimmagaddaDreher 2006 ObjectRelationalDataWarehousing
No ratings yet
NimmagaddaDreher 2006 ObjectRelationalDataWarehousing
7 pages
Unit-11 Big Data
No ratings yet
Unit-11 Big Data
18 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
DBMS Mini-Project Report Format
No ratings yet
DBMS Mini-Project Report Format
31 pages
BCE Report
No ratings yet
BCE Report
14 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
GhettoVCB - SH Vmware Backup Server Free Software.
No ratings yet
GhettoVCB - SH Vmware Backup Server Free Software.
29 pages
Unit 2
No ratings yet
Unit 2
69 pages
Big Data Frameworks
No ratings yet
Big Data Frameworks
3 pages
Big Data Arch
No ratings yet
Big Data Arch
2 pages
Day 0 Introduction
No ratings yet
Day 0 Introduction
15 pages
Big Data Architecture Basics
No ratings yet
Big Data Architecture Basics
24 pages
CV - Hariharan Mallikarjunan
No ratings yet
CV - Hariharan Mallikarjunan
2 pages
Linked Lists
No ratings yet
Linked Lists
1 page
IT350 Kennedy Unit3
No ratings yet
IT350 Kennedy Unit3
10 pages
3 ERP Architecture L3
No ratings yet
3 ERP Architecture L3
5 pages
Oracle APEX Learner Guide
No ratings yet
Oracle APEX Learner Guide
32 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
UNIT-I Symbols Used in ER Model
No ratings yet
UNIT-I Symbols Used in ER Model
21 pages
Latency 5
No ratings yet
Latency 5
8 pages
MS Excel Formulas and There Usage On 26-2-2025
No ratings yet
MS Excel Formulas and There Usage On 26-2-2025
6 pages
4 Big Data Architectures, Data Streaming, Lambda Architecture, Kappa Architecture, And Unifield Architecture _ by Sajjad Hussain _ Data Prophet _ Medium
No ratings yet
4 Big Data Architectures, Data Streaming, Lambda Architecture, Kappa Architecture, And Unifield Architecture _ by Sajjad Hussain _ Data Prophet _ Medium
7 pages
Unified Big Data Lambda Architecture Wit
No ratings yet
Unified Big Data Lambda Architecture Wit
13 pages

Chapter 6 - Big Data Architecture Part 1

Uploaded by

Chapter 6 - Big Data Architecture Part 1

Uploaded by

Big Data

The investment in Big Data architecture

Due to their high heterogeneity, it is a

“A Big data architecture describes the

• It provides a common language for the

• It illustrates and improves

• Speed layer. The speed layer can be implemented using real-

• Serving layer. Any random-access NoSQL database

• Queuing system. A queuing system is necessary to

• The Kappa architecture was proposed to reduce the

• Its author, Jay Kreps, observed that the necessity of a

You might also like