0% found this document useful (0 votes)
9 views41 pages

Chapter 6 - Big Data Architecture Part 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views41 pages

Chapter 6 - Big Data Architecture Part 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Big Data

Architecture
Part 1
Introduction
The non-stop growth of data, the frantic
releases of new electronic devices and the
data-driven decision-making trend in
companies is fueling a constant demand
for more efficient Big Data processing
systems.

The investment in Big Data architecture


has been rapidly growing these past years
and according to Gartner, businesses will
keep investing more in IT in 2018 and 2019
focusing on IOT, Block-chain and Big Data
Introduction
Big Data refers to huge amounts of
heterogeneous data from both
traditional and new sources, growing at
a higher rate than ever.

Due to their high heterogeneity, it is a


challenge to build systems to centrally
process and analyze efficiently such
huge amount of data which are internal
and external to an organization
Big Data
Architecture
Definition
“Big data architecture refers to the
logical and physical structure that
dictates how high volumes of data are
ingested, processed, stored, managed,
and accessed.” (Omni, 2020)

“A Big data architecture describes the


blueprint of a system handling massive
volume of data during its storage,
processing, analysis and visualization.”
(Kalipe & Behera, 2019)
Big Data
Architecture
Definition
The big data architecture framework
serves as a reference blueprint for big
data infrastructures and solutions,
logically defining how big data solutions
will work, the components that will be
used, how information will flow, and
security details.
How to Build a Big
Data Architecture
Designing a big data reference
architecture, while complex, follows the
same general procedure:
Analyze the Problem
Select A Vendor
Deployment Strategy
Capacity Planning
Infrastructure Sizing
Plan a Disaster Recovery
How to Build a Big
Data Architecture
• Analyze the Problem
• First determine if the business does in fact
have a big data problem, taking into
consideration criteria such as data variety,
velocity, and challenges with the current
system.
• Common use cases include data archival,
process offload, data lake implementation,
unstructured data processing, and data
warehouse modernization.
How to Build a Big
Data Architecture
• Select a Vendor
• Hadoop is one of the most widely
recognized big data architecture tools
for managing big data end to end
architecture.
• Popular vendors for Hadoop distribution
include Amazon Web Services,
BigInsights, Cloudera, Hortonworks,
Mapr, and Microsoft.
How to Build a Big
Data Architecture
• Deployment Strategy
• Deployment can be either on-premises,
which tends to be more secure; cloud-
based, which is cost effective and
provides flexibility regarding scalability;
or a mix deployment strategy.
How to Build a Big
Data Architecture
• Capacity Planning
• When planning hardware and
infrastructure sizing, consider daily
data ingestion volume, data volume for
one-time historical load, the data
retention period, multi-data center
deployment, and the time period for
which the cluster is sized
How to Build a Big
Data Architecture
• Infrastructure Sizing
• This is based on capacity planning and
determines the number of
clusters/environment required and the
type of hardware required.
• Consider the type of disk and number of
disks per machine, the types of
processing memory and memory size,
number of CPUs and cores, and the data
retained and stored in each environment.
How to Build a Big
Data Architecture
• Plan a Disaster Recover
• In developing a backup and disaster
recovery plan, consider the criticality of
data stored, the Recovery Point
Objective and Recovery Time Objective
requirements, backup interval, multi
datacenter deployment, and whether
Active-Active or Active-Passive disaster
recovery is most appropriate.
General Big Data
architecture
General Big Data
architecture
• Big data solutions typically involve
one or more of the following types of
workload:
• Batch processing of big data sources
at rest.
• Real-time processing of big data in
motion.
• Interactive exploration of big data.
• Predictive analytics and machine
learning.
General Big Data
architecture
• Most big data architectures include
some or all of the following components:
• Data Sources
• Data Storage
• Batch Processing
• Real-time message ingestion
• Stream Processing
• Analytical Data Store
• Analysis & Reporting
• Orchestration
General Big Data
architecture
• Data sources: All big data solutions
start with one or more data sources.
Examples include:
• Application data stores, such as
relational databases.
• Static files produced by applications,
such as web server log files.
• Real-time data sources, such as IoT
devices.
General Big Data
architecture
• Data storage:
• Data for batch processing operations
is typically stored in a distributed file
store that can hold high volumes of
large files in various formats. This
kind of store is often called a data
lake.
• Examples: Azure Data Lake Store or
blob containers in Azure Storage.
General Big Data
architecture
• Batch processing
• Because the data sets are so large, often a big data
solution must process data files using long-running
batch jobs to filter, aggregate, and otherwise
prepare the data for analysis.
• Usually these jobs involve reading source files,
processing them, and writing the output to new files.
• Options include running U-SQL jobs in Azure Data
Lake Analytics, using Hive, Pig, or custom
Map/Reduce jobs in an HDInsight Hadoop cluster, or
using Java, Scala, or Python programs in an
HDInsight Spark cluster.
General Big Data
architecture
• Real-time message ingestion
• If the solution includes real-time sources, the
architecture must include a way to capture and
store real-time messages for stream processing.
• This might be a simple data store, where incoming
messages are dropped into a folder for processing.
• However, many solutions need a message
ingestion store to act as a buffer for messages,
and to support scale-out processing, reliable
delivery, and other message queuing semantics.
Options include Azure Event Hubs, Azure IoT Hubs,
and Kafka.
General Big Data
architecture
• Stream processing
• After capturing real-time messages, the solution
must process them by filtering, aggregating, and
otherwise preparing the data for analysis.
• The processed stream data is then written to an
output sink.
• Example: Azure Stream Analytics provides a
managed stream processing service based on
perpetually running SQL queries that operate on
unbounded streams. You can also use open source
Apache streaming technologies like Storm and
Spark Streaming in an HDInsight cluster.
General Big Data
architecture
• Analytical data store:
• Many big data solutions prepare data for analysis and
then serve the processed data in a structured format
that can be queried using analytical tools.
• The analytical data store used to serve these queries
can be a Kimball-style relational data warehouse, as
seen in most traditional business intelligence (BI)
solutions.
• Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an
interactive Hive database that provides a metadata
abstraction over data files in the distributed data store.
General Big Data
architecture
• Analysis and reporting:
• The goal of most big data solutions is
to provide insights into the data
through analysis and reporting.
• To empower users to analyze the data,
the architecture may include a data
modeling layer, such as a
multidimensional OLAP cube or
tabular data model
General Big Data
architecture
• Analysis and reporting:
• Analysis and reporting can also take the form
of interactive data exploration by data
scientists or data analysts.
• For these scenarios, many Azure services
support analytical notebooks, such as Jupyter,
enabling these users to leverage their
existing skills with Python or R. For large-
scale data exploration, you can use Microsoft
R Server, either standalone or with Spark.
General Big Data
architecture
• Orchestration
• Most big data solutions consist of repeated data
processing operations, encapsulated in
workflows, that transform source data, move
data between multiple sources and sinks, load
the processed data into an analytical data store,
or push the results straight to a report or
dashboard.
• To automate these workflows, you can use an
orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
The benefits of using an
‘open’ Big Data reference
architecture

• It provides a common language for the


various stakeholders;
• It encourages adherence to common
standards, specifications, and patterns;
• It provides consistent methods for
implementation of technology to solve
similar problem sets;
The benefits of using an
‘open’ Big Data reference
architecture

• It illustrates and improves


understanding of the various Big Data
components, processes, and systems, in
the context of a vendor- and
technology-agnostic Big Data
conceptual model;
• It facilitates analysis of candidate
standards for interoperability,
portability, reusability, and
extendibility.
The benefits of using an
‘open’ Big Data reference
architecture
Big Data Architecture
application in industry
• Use cases of Big Data based on
Architectural components
Types of Big Data
Architecture
• Lambda Architecture
• The lambda architecture is an approach to big
data processing that aims to achieve low
latency updates while maintaining the highest
possible accuracy.
• It is divided in 3 layers. The first, “the batch
layer” is composed of a distributed file system
which stores the entirety of the collected data.
• The same layer stores a set of predefined
functions to be run on the dataset to produce
what is called a batch view. Those views are
stored in a database constituting the “serving
layer” from which they can be queried
interactively by the user.
Types of Big Data
Architecture
• Lambda Architecture
Types of Big Data
Architecture
• Lambda Architecture
• The third layer called “speed layer”
computes incremental functions on the
new data as it arrives in the system.
• It processes only data which is
generated between two consecutive
batch views re-computation producing
and it produces real-time views which
are also stored in the serving layer.
The different views are queried
together to obtain the most accurate
possible results
Types of Big Data
Architecture
• Lambda Architecture advantages:
• provides better accuracy, higher throughput
and lower latency for reads and updates
simultaneously without compromise on data
consistency
• more resilient thanks to the Distributed File
System used to store the master dataset,
mostly because it is less subject to human
errors (such as unintended bulk deletions)
than a traditional RDBMS
• helps achieve the main requirements of a
reliable Big Data system among which are
robustness and fault tolerance provided
through the batch layer.
Types of Big Data
Architecture
• Lambda Architecture disadvantages:
• Different layers of this architecture may make it
complex. Synchronization between the layers can be an
expensive affair. So, it has to be handled in a cautious
manner.
• Support and maintenance become difficult because of
distinct and distributed layers namely batch and speed.
• A lot of technologies have emerged that can help in the
construction of Lambda architecture but finding people
who have mastered these technologies can be difficult.
• It can be difficult to apply this architecture for the
open-source technologies and the trouble further
solidifies if it has to be implemented in the cloud.
• Maintenance of the code of the architecture is also
difficult. As it has to produce the same results in the
distributed system.
Types of Big Data
Architecture
• Lambda Architecture disadvantages:
• one of the major disadvantages of
this architecture is the need to
maintain two similar code bases: one
in the speed layer and another in the
batch layer to perform the same
computation on different sets of
data.
• That implies redundancy and it
requires two different sets of skills
in order to write the logic for the
streaming and for the batch data
Types of Big Data
Architecture
• Lambda Architecture Implementation
• A particularly suitable application of the
Lambda architecture is found in Log
ingestion and analytics. The reason is that
log messages are immutable and often
generated at a high speed in systems that
need to offer high availability
• The Lambda Architecture is preferred in
cases where there is an equal need for real-
time/fluid analysis of incoming data and for
periodic analysis of the entire repository of
data collected. Social media and especially
tweets analysis is a perfect example of such
an application.
Types of Big Data
Architecture
• Lambda Architecture Hardware
Requirement
Types of Big Data
Architecture
• Lambda Architecture Software
Requirement
• Batch layer
• The requirements of the batch layer
make
• Hadoop the most suitable
framework to use for its
implementation. HDFS provide the
perfect append-only technology to
accommodate the master dataset
Types of Big Data
Architecture
• Lambda Architecture Software
Requirement

• Speed layer. The speed layer can be implemented using real-


time processing tools such as Storm or S4. Spark Streaming can
also be used although it treats data in micro-batches rather than
in real streams. The advantage is that the Spark code can be
reused of in the batch layer

• Serving layer. Any random-access NoSQL database


can host the real-time and batch views. Some
examples are: HBase, CouchDB, V oldemort or
even MongoDB. Cassandra is particularly preferred
because of the write-fast option that it provides.
Types of Big Data
Architecture
• Lambda Architecture Software
Requirement

• Queuing system. A queuing system is necessary to


ensure asynchronous and fault-tolerant transmission of
the real- time data to the batch and speed layer.
Popular options include Apache Kafka or Flume.
Types of Big Data
Architecture
• Kappa Architecture

• The Kappa architecture was proposed to reduce the


lambda architecture’s overhead that came with handling
two separate code bases for stream and batch processing.

• Its author, Jay Kreps, observed that the necessity of a


batch processing system came from the need to reprocess
previously streamed data again when the code changed.
In Kappa architecture the batch layer was removed and
the speed layer enhanced to offer reprocessing
capabilities.
References
• Kalipe, Godson & Behera, Rajat. (2019). Big Data
Architectures : A detailed and application oriented
review. ]
• OmniSci, 2020 -
https://www.omnisci.com/technical-glossary/big-data-a
rchitecture
• Big Data Architecture Style -
https://docs.microsoft.com/en-us/azure/architecture/gu
ide/architecture-styles/big-data#:~:text=A%20big
%20data%20architecture%20is,big%20data
%20sources%20at%20rest.

You might also like