Big Data Systems: Fault Tolerance & Analytics
Big Data Systems: Fault Tolerance & Analytics
Janardhanan PS
[email protected]
Topics for today
2
Distributed computing – Living with failures
3
Metrics
• MTTF - Mean Time To Failure
• Failure rate = 1 / MTTF (assuming average value over time)
• MTTF = 1 / failure rate = Total #hours of operation / Total #units
• MTTF is an averaged value.
• MTTR - Mean Time to Recovery / Repair
• MTTR = Total #hours for maintenance / Total #repairs
• MTTD - Mean Time to Diagnose
• MTBF - Mean Time Between Failures
• MTBF = MTTD + MTTR + MTTF
https://www.epsilonengineer.com/mtbf-mttr-and-reliability.html
4
Availability
• In reality, failure rate changes over time because it may depend on age of
component.
• Availability = Time system is UP and accessible / Total time observed
• Availability = MTTF / (MTTD* + MTTR + MTTF)
But MTBF = MTTD + MTTR + MTTF
or
• Availability = MTTF / MTBF
• A system is highly available when
• MTTF is high
• MTTR is low
* Unless specified one can assume MTTD = 0
(availabilitydigest.com)
5
Example
• A node in a cluster fails every 100 hours while other parts never fail.
• On failure of the node the whole system needs to be shutdown, faulty node replaced and system.
This takes 2 hours.
• The application needs to be restarted, which takes 2 hours.
• What is the availability of the cluster ?
• If downtime is $80k per hour, then what is the yearly cost ?
• Solution
• MTTF = 100 hours
• MTTR = 2 + 2 = 4 hours
• MTBF = MTTR+MTTF = 104 hours
• Availability = MTTF/MTBF = 100/104 = 96.15%
• Cost of downtime per year = 80000 x (100-96.15)* 365 * 24 / 100 = USD 27 million
6
Little Maths
Availability
Availability of the module is the percentage of time when system is operational.
A = (1 – (MTTR/MTBF)) x 100%.
Availability is typically specified in nines notation. For example 3-nines availability corresponds to
99.9% availability. A 5-nines availability corresponds to 99.999% availability.
Downtime
Downtime per year is a more intuitive way of understanding the availability.
The table below compares the availability and the corresponding downtime.
Availability Downtime
90% (1-nine) 36.5 days/year
99% (2-nines) 3.65 days/year
99.9% (3-nines) 8.76 hours/year
99.99% (4-nines) 52 minutes/year Business Critical
99.999% (5-nines) 5 minutes/year Mission Critical ( Considered as HA)
99.9999% (6-nines) 31 seconds/year Fault Tolerant (no restart)
http://www.eventhelix.com/RealtimeMantra/FaultHandling/reliability_availability_basics.htm
7
The Nines of Availability
8
The 9s Game – Different platforms
9
Reliability - Serial assembly
• MTTF of a system as a function of MTTF of components connected serially
• Serial assembly of components
• Failure of any component results in system failure
• Failure rate of C = Failure rate of A + Failure rate of B = 1/ma + 1/mb
• MTTF of system = 1 / SUM (1/MTTFi) for all components i
• Failure rate of system = SUM(1/MTTFi) for all components i
C
A B
MTTF=ma MTTF=mb
10
Reliability - Parallel assembly
11
Fault tolerance configurations - standby options
12
Fault tolerance configurations - standby options
Decreasing cost
vs
Increasing MTTR
aka active-active
aka active-passive
Assumption:
Same software bug or runtime
fault will not recur in the standby
https://www.ibm.com/docs/en/cdfsp/7.6.0?topic=systems-high-availability-models
13
Fault tolerance configurations - cluster topologies
• N+1
• One node is configured to take the role of the primary
P S
• N+M
• M standby nodes if multiple failures are anticipated especially when
running multiple services in the cluster P S S
• N to 1
• The secondary failover node is a temporary replacement and once
primary is restored, services must be reactivated on it
P
• N to N
• When any node fails, the workload is redistributed to the remaining
active nodes. So there is no special standby node.
P
https://www.ibm.com/docs/en/cdfsp/7.6.0?topic=systems-high-availability-models
14
Fault Tolerant Clusters – Recovery
• Diagnosis
• Detection of failure and location of the failed component, e.g. using heartbeat
messages between nodes
• Backward (Rollback) recovery - Checkpoints
• Fault tolerance is achieved by periodically saving the processes’ consistent states
during the failure-free execution using stable storage
checkpoints
• Each of the saved state is called a checkpoint
• On failure, isolate the failed component, rollback to last checkpoint and resume
normal operation
• Ease to implement, independent of application, but leads to wastage of execution
time on rollback besides unused checkpointing work
• Forward recovery
rollback on errors
• Finding a new state from which the system can continue operation
• In real-time systems or time-critical systems cannot rollback. So state is
reconstructed on the fly from diagnosis data.
• Application specific and may need additional hardware
15
Topics for today
16
Analytics - Definitions
• Extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact
based management to drive decisions and actions
• Purpose
✓ To unearth hidden patterns
✓ 2 / 100 stores had no sales for a promotion item because it was not in the right shelf
✓ To decipher unknown correlations
✓ The famous “Beer and diapers” story
✓ Understand the rationale behind trends
✓ What do users like about a popular product that has growing sales
✓ Mine useful business information
✓ Popular items during specific holiday sales
• Helps in
✓ Effective marketing
✓ Better customer service and satisfaction
✓ Improved operational efficiency
✓ Competitive advantage over rivals
17
Process of Analysis
Transformation of Data
18
Analytics 1.0, 2.0, 3.0
19
Analytics Maturity and Competitive Advantage
20
Types of Analytics - Descriptive
• Provides ability to alert, explore and report using mostly internal and external data
• Business Intelligence (BI) or Performance reporting
• Provides access to historical and current data
• Reports on events, occurrences of the past
• Usually data from legacy systems, ERP, CRM used for analysis
• Based on relational databases and warehouse
• Structured and not very large data sets
• Sometimes also referred as Analytics 1.0
• Era : mid 1950s to 2009
• Questions asked
✓ What happened?
E.g. number of infections is significantly higher this month
✓ Why did it happen? (Diagnostic analysis) than last month and highly correlated with factor X
21
Types of Analytics - Predictive
• Uses past data to predict the future
• Uses quantitative techniques like segmentation, forecasting etc. but also makes use of descriptive
analytics for data exploration
• Uses technologies like models and rule based systems
• Based on large data set gathered over period of time
• Externally sourced data also used
• Unstructured data may be included
• Hadoop clusters, SQL on Hadoop data etc. technologies used
• aka Analytics 2.0
• Era : from 2005 to 2012
• Key questions
✓ What will happen?
✓ Why it will happen? E.g. number of infections will reach X in month Y and likely cause will be Z
✓ When will it happen?
22
Types of Analytics - Prescriptive
• Uses data from past to make prophecies of future and at the same time make recommendations
to leverage the situation to one’s advantage
• Suggests optimal behaviors and actions
• Uses a variety of quantitative techniques like optimization and technologies like models, machine
learning and recommendations engines
• Data is blend from Big data and legacy systems, ERP, CRM etc.
• In-memory analysis etc.
• Aka Analytics 3.0 = Descriptive + Predictive + Prescriptive
• post 2012
• Questions:
✓ What will happen
✓ When will it happen E.g. number of infections will reach X in month Y and likely
✓ Why will it happen cause will be Z. W is the best recommended action to keep
✓ What actions should be taken the number in month Y below X/2.
23
Alternative categorization of Analytics
• Basic Analytics
✓Slicing and dicing of data to help with basic insights
✓Reporting on historical data, basic visualizations etc.
• Operationalized Analytics
✓Analytics integrated in business processes
• Advanced Analytics
✓Forecasting the future by predictive modelling
• Monetized Analytics
✓Used for direct business revenue
24
Topics for today
25
Big Data Analytics
Better , Faster
Working with datasets with huge
decision in real time
volume, variety and velocity beyond
Richer, faster insights into
storage and processing capability of
customers, partners and
RDBMS
business
Technology enabled
IT’s collaboration with Analytics
business users and Data
Scientists Support for both batch
and stream processing
of data
What makes you think about this differently ?
26
What Big Data Analytics is NOT
Only Volume
Game
‘One size fit all!’ solution based
on RDBMS with shared disk and Just bothered
memory about Technology
Big Data
Analytics is NOT
Only meant for Meant to
big data replace RDBMS
companies
Meant to replace
Data warehouse*
28
Adoption Challenges in Organizations
• Obtaining executive sponsorship for investments in big data and its related activities
• Getting business units to share data / information across organizational silos
• Finding right skills (Business Analysts/Data Scientists and Data Engineers) that can manage
large amount of variety of data and create insights from it
• Determining approach to scale rapidly and elastically , address storage and processing of
large volume, velocity and variety of Big data
• Deciding whether to use structured or unstructured, internal or external data to make
business decisions
• Choosing optimal way to report findings and analysis of big data
• Determining what to do with the insights created from big data
29
Requirements of Big Data analytics
30
Technology challenges (1)
• Scale • Security *
✓ Need is to have storage that can best ✓ Most of recent NoSQL big data platforms have poor
withstand large volume, velocity and security mechanisms, e.g. challenges:
variety of data ✓ Fine grain control in semi-structured data, esp with
✓ Scale vertically / horizontally ? columnar storage
✓ RDBMS / NoSQL ? ✓ Options for inconsistent data complicate matters
✓ How does compute scale with storage - ✓ Larger attack surface across distributed nodes
coupled or de-coupled, i.e. good idea to ✓ Often encryption is turned off for performance
put common nodes for compute and
✓ Lack of authorization techniques while safeguarding big
storage
data
✓ May contain PII data (personally identifiable info)
• Schema
✓ Need is to have dynamic schema, static / fixed schemas don’t fit
• Continuous availability
✓ Needs 24 * 7 * 365 support as data is continuously getting generated and needs
to be processed
✓ Almost all RDBMS, NoSQL big data platforms has some sort of downtime
✓ Memory cleanup, replica rebalancing, indexing, …
✓ Most of the large-scale NoSQL systems also need weekly maintenance
32
Technology challenges (3)
• Consistency
✓ Should one go for strict consistency or eventual consistency? Is this like social media
comments or application needs consistent reads ?
• Partition Tolerant
✓ When a system get’s partitioned by hardware / software failures. How to build partition
tolerant systems ? When faults happen is consistent data available ?
✓ We will discuss options in CAP Theorem
• Data quality
✓ How to maintain data quality – data accuracy, completeness, timeliness etc.?
✓ Do we have appropriate metadata in place esp with semi/un-structured data ?
33
Popular technologies
34
Topics for today
35
Common Bigdata Platforms and their users
36
Introduction to Hadoop
What problems does Hadoop solve
✓Problems ✓ Problems
✓ Multiple partitions of data for parallel access ✓ Data is spread across systems, how to process it in
but more systems means more failures quick manner?
✓ Multiple nodes can make the system expensive ✓ Challenge is to integrate data from different
✓ Arbitrary data - binary, structured … machines before processing
✓ Solution ✓ Solution
✓ Replication Factor (RF) for failures : Number of ✓ MapReduce programming model to process huge
data copies of a given data item / data block amount of data with high throughput
stored across the network ✓ Compute is close to storage for handling large
✓ Uses commodity heterogenous hardware data sets
✓ Multiple file formats
38
Basic Design Principles of Hadoop
1. All roads lead to scale-out, scale-up architectures are rarely used and scale-out is the
standard in big data processing.
2. Share nothing: communication and dependencies are bottlenecks, individual
components should be as independent as possible to allow to proceed regardless of
whether others fail.
3. Expect failure: components will fail at inconvenient times.
4. Smart software, dumb hardware: push smartness in the software, responsible for
allocating generic hardware.
5. Move processing, not data: perform processing locally on data. What gets moved
through the network are program binaries and status reports, which are dwarfed in
size by the actual data set.
6. Build applications, not infrastructure. Instead of placing focus on data movement and
processing, work on job scheduling, error handling, and coordination.
39
What’s different from a Distributed Database
• Distributed Databases
Tabular schema
• Data model
• Tables and relations
• Schema is predefined (during write)
• Supports partitioning
• Fast indexed reads
• Read and write many times
consistent states of a transaction
• Compute model
• Generate notations of a transaction insert update delete commit
• ACID properties
• Allow distributed transactions
rollback on errors
• OLTP workloads
40
Hadoop in contrast …
DFS blocks
• Data model
• Flat files supporting multiple formats, including binary
• No pre-defined schema (i.e. during write)
map
• Divides files automatically into blocks reduce
• Handles large files
• Optimized for write data
result
• Write once and read many times workload
• Meant for scan workloads with high throughput
map
• Compute model
• Generate notations of a job divided into tasks
• MapReduce compute model
• Every task is a map or a reduce
• High latency analytics, data discovery workloads
map
41
Hadoop Distributions
42
Versions of Hadoop
MapReduce
MapReduce Others
(Cluster Resource Manager
(Data Processing) (Data Processing)
& Data Processing)
HDFS YARN
(redundant, reliable storage) (Cluster Resource Manager)
HDFS
(redundant, reliable storage)
43
Hadoop 2.0 high level architecture
YARN Architecture
44
Advantages of using Hadoop
Scalability – simple to add nodes in system for parallel processing and storage
45
Hadoop ecosystem
47
Hadoop Ecosystem Components for Data Ingestion
Sqoop:
• Sqoop stands for SQL to Hadoop. It can provision the data from
external RDBMS system on to HDFS and populate tables in Hive
and HBase.
Flume:
• Flume is an important log aggregator (aggregates logs from
different machines and places them in HDFS) component in the
Hadoop Ecosystem.
48
Hadoop Ecosystem Components for Data Processing
MapReduce:
• It is a programing paradigm that allows distributed and parallel processing of
huge datasets. It is based on Google MapReduce.
Spark:
• It is both a programming model as well as a computing model. It is an open
source big data processing framework.
• It is written in Scala. It provides in-memory computing for Hadoop.
• Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting on
top of Hadoop YARN) or used independently of Hadoop (standalone).
49
Hadoop ecosystem components for Data Analysis
Pig
• It is a high level scripting language used with Hadoop. It serves as an alternative to
MapReduce. It has two parts:
• Pig Latin: It is a SQL like scripting language.
• Pig runtime: is the runtime environment.
Hive:
• Hive is a data warehouse software project built on top of Hadoop. Three main tasks
performed by Hive are summarization, querying and analysis
Impala:
• It is a high performance SQL engine that runs on Hadoop cluster. It is ideal for
interactive analysis. It has very low latency measured in milliseconds. It supports a
dialect of SQL called Impala SQL.
50
RDBMS Vs Hadoop
51
Hadoop – Different modes of operation
• Standalone Mode
• Runs on single system on single JVM (Java Virtual Machine), called Local mode
• None of the Daemon will run
• Resource Manager and Node Manager is available for running jobs.
• Mainly used Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
• Pseudo-distributed Mode (Demo cluster Setup)
• Uses only single node
• All the daemons: Namenode, Datanode, Secondary Name node, Resource Manager, Node Manager, etc.
will be running as a separate process on separate JVMs
• Daemons run on different java processes that is why it is called a Pseudo-distributed.
• Fully-Distributed Mode
• Hadoop runs on clusters of Machine or nodes.
• Few of the nodes run the Master Daemon’s that are Namenode and Resource Manager
• Rest of the nodes run the Slave Daemon’s that are DataNode and Node Manager.
• Data is distributed across different Data nodes.
• Production Mode of Hadoop cluster providing high availability of data by replication
52
Hands on
Hadoop cluster setup in
Pseudo-distributed Mode
Map Reduce overview
The term MapReduce actually refers to two separate and distinct tasks that
Hadoop programs perform:
Map & Reduce
• Map : This is the first job, which takes a data set and
converts it into another set of data, where
individual elements are broken down into tuples
(key/value pairs).
54
What the developer does
55
Example - Word count
hello world bye world
input data
hello hadoop goodbye hadoop
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) { map map
word.set(itr.nextToken());
context.write(word, one);
} <hello, 1> <hello, 1>
}
<world ,1> file outputs of <hadoop, 1>
<bye, 1> map jobs <goodbye,1>
<world, 1> <hadoop, 1>
✓ Its reliance on persistent storage to provide fault tolerance and its one-pass computation model
make MapReduce a poor fit for
▪ low-latency / interactive applications
▪ iterative computations, such as machine learning and graph algorithms
▪ There are extensions for iterative MapReduce that we study later
59
In-Memory computing
• In-memory computing
✓ Means using a type of middleware software that allows one to store data in RAM, across a cluster of computers,
and process it in parallel
• For example,
✓ Operational datasets typically stored in a centralized database which you can now store in “connected” RAM
across multiple computers.
✓ RAM is roughly 5,000 times faster than traditional spinning disk.
✓ Native support for parallel processing makes it faster
Note:
Could be batch or streaming
60
Apache Spark: Fast In-Memory Computing for Your Big Data Applications
61
Fast and easy big data processing with Spark
• At its core, Spark provides a general programming model that enables developers to write
application by composing arbitrary operators, such as
✓ mappers
✓ reducers
✓ joins
✓ group-bys
✓ filters
• This composition makes it easy to express a wide array of computations, including iterative
machine learning, streaming, complex queries, and batch.
• Spark keeps track of the data that each of the operators produce , and enables applications to
reliably store this data in memory using RDDs*.
✓ This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses.
62
Example - Word count
.map(word => (word, 1)) map: transform each word into <k,v> pair
.saveAsTextFile("hdfs://...")
63
Ideal Apache Spark applications
• Low-latency computations
✓by caching the working dataset in memory and
then performing computations at memory speeds
• Efficient iterative algorithm
✓by having subsequent iterations share data
through memory, or repeatedly accessing the
same dataset
64
Hands on with Apache Spark
• Hadoop and Spark are fault tolerant. They can handle node failures, data corruption, and network
issues without affecting the overall execution.
• However, Hadoop and Spark have different approaches to fault tolerance
• Hadoop
• When any node fails, the workload is redistributed to the remaining active nodes. So, there is no special
standby node (N to N)
• When task fails on one node, YARN restarts it (stateless mode)
• Relies on data replication and checkpointing to ensure fault tolerance, which means that it duplicates the
data blocks across multiple nodes and periodically saves the state of the computation to disk.
• HDFS maintains a secondary Namenode with automatic checkpointing of updates happening on Namenode.
• Hadoop provides high reliability and durability, but also consumes more disk space and network bandwidth
• Spark
• Spark provides fault-tolerance to RDDs by rebuilding them from their lineage
• Spark relies on data lineage and lazy evaluation to ensure fault tolerance, which means that it tracks the
dependencies and transformations of the RDDs and only executes them when needed.
• Spark provides high performance and flexibility, but also requires more memory and computation power.
66
Summary
67
Next Session:
Big Data Lifecycle, CAP Theorem, NoSQL