Disaster Recovery for Big Data
About us

We are nerds!

Started working in Big Data for international companies

Founded a start-up a few years ago:
 With colleagues working in related technical areas
 And who also knew business stuff!

We’ve been participating in different Big Data projects
Introduction
“I already have HDFS replication and High
Availability in my services, why would I
need Disaster Recovery (or backup)?”
Concepts

High Availability (HA)
 Protects from failing
components: disks,
servers, network
 Is generally a “systems”
issue
 Redundant, doubles
components
 Generally has strict
network requirements
 Fully automated,
immediate
Concepts

Backup
 Allows you to go back to
a previous state in time:
daily, monthly, etc.
 It is a “data” issue
 Protects from accidental
deletion or modification
 Also used to check for
unwanted modifications
 Takes some time to
restore
Concepts

Disaster Recovery
 Allows you to work
elsewhere
 It is a “business” issue
 Covers you from: main site
failures such as electric
power or network outages,
fires, floods or building
damage
 Similar to having insurance
 Medium time to be back
online
The ideal Disaster Recovery

High Availability for
datacenters

Exact duplicate of the
main site
 Seamless operation (no
changes required)
 Same performance
 Same data

This is often very
expensive and sometimes
downright impossible
DR considerations

So, can we build a cheap(ish) DR?

We must evaluate some tradeoffs:
 What’s the cost of the service not being
available? (Murphy’s Law: accidents will happen
when you are busiest)
 Is all information equally important? Can we lose
a small amount of data?
 Can we wait until we recover certain data from
backup?
 Can I find other uses for the DR site?
DR considerations

Near or far?
 Availability
 Latency
 Legal considerations
DR considerations

Synchronous vs
Asynchronous
 Synchronous replication
requires a FAST connection
 Synchronous works at
transaction level and is
necessary for operational
systems
 Asynchronous replication
converges over time
 Asynchronous is not
affected by delays nor does
it create them
Big Data DR

Can’t generally be copied
synchronously

No VM replication

Other DR rules apply:
 Since it impacts users,
someone is in charge of the
“starting gun”
 DNS and network changes
to point clients

Main types:
 Storage replication
 Dual ingestion
Storage replication

Similar to non-Big Data solutions, where central
storage is replicated

Generally implemented using distcp and HDFS
snapshots

Data is ingested in source cluster and then copied
Storage replication

Administrative
overhead:
 Copy jobs must be
scheduled
 Metadata changes
must be tracked

Good enough for
data that comes
traditional ETLs such
as daily batches
Dual Ingestion

No files, just streams

Generally ingested from multiple outside
sources through Kafka

Streams must be directed to both sites
Dual Ingestion

Adds complexity to apps
 Nifi can be set up as a front-end to both
endpoints

Data consistency must be checked
 Can be automatically set up via monitoring
 Consolidation processes (such as a monthly
re-sync) might be needed
Others

Ingestion replication
 Variant of the dual ingestion
 A consumer is set up in the source Kafka that in turn
writes to a destination Kafka
 Bottleneck if the initial streams were generated by
many producers

Mixed:
 Previous solutions are not mutually exclusive
 Storage replication for batch processes’ results
 Dual ingestion for streams
Commercial offerings

Solutions that ease DR setup

Cloudera BDR
 Coordinates HDFS snapshots and copy

WANdisco Fusion
 Continuous storage replication

Confluent Multi-site
 Allows multi-site Kafka data replication
Tips

Big Data clusters
have many nodes
 Costly to replicate
 Performance /
Capacity tradeoff
 We can use
cheaper servers in
DR, since we don’t
expect to use them
often
Tips

Document and test procedures
 DR is rarely fully automated, so responsibilities and
actions should be clearly defined
 Plan for (at least) a yearly DR run
 Track changes in software and configuration
Tips

Once you have a DR
solution, other uses will
surface

DR site can be used for
backup
 Maintain HDFS
snapshots

DR data can be used
for testing / reporting
 Warning: it may alter
stored data
Conclusions

Balance HA / Backup / DR as needed, they
are not exclusive:
 Different costs
 Different impact

Big Data DR is different:
 Dedicated hardware
 No VMs, no storage cabin

Plan for DATA CENTRIC solutions
Questions

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

  • 2.
  • 3.
    About us  We arenerds!  Started working in Big Data for international companies  Founded a start-up a few years ago:  With colleagues working in related technical areas  And who also knew business stuff!  We’ve been participating in different Big Data projects
  • 4.
    Introduction “I already haveHDFS replication and High Availability in my services, why would I need Disaster Recovery (or backup)?”
  • 5.
    Concepts  High Availability (HA) Protects from failing components: disks, servers, network  Is generally a “systems” issue  Redundant, doubles components  Generally has strict network requirements  Fully automated, immediate
  • 6.
    Concepts  Backup  Allows youto go back to a previous state in time: daily, monthly, etc.  It is a “data” issue  Protects from accidental deletion or modification  Also used to check for unwanted modifications  Takes some time to restore
  • 7.
    Concepts  Disaster Recovery  Allowsyou to work elsewhere  It is a “business” issue  Covers you from: main site failures such as electric power or network outages, fires, floods or building damage  Similar to having insurance  Medium time to be back online
  • 8.
    The ideal DisasterRecovery  High Availability for datacenters  Exact duplicate of the main site  Seamless operation (no changes required)  Same performance  Same data  This is often very expensive and sometimes downright impossible
  • 9.
    DR considerations  So, canwe build a cheap(ish) DR?  We must evaluate some tradeoffs:  What’s the cost of the service not being available? (Murphy’s Law: accidents will happen when you are busiest)  Is all information equally important? Can we lose a small amount of data?  Can we wait until we recover certain data from backup?  Can I find other uses for the DR site?
  • 10.
    DR considerations  Near orfar?  Availability  Latency  Legal considerations
  • 11.
    DR considerations  Synchronous vs Asynchronous Synchronous replication requires a FAST connection  Synchronous works at transaction level and is necessary for operational systems  Asynchronous replication converges over time  Asynchronous is not affected by delays nor does it create them
  • 12.
    Big Data DR  Can’tgenerally be copied synchronously  No VM replication  Other DR rules apply:  Since it impacts users, someone is in charge of the “starting gun”  DNS and network changes to point clients  Main types:  Storage replication  Dual ingestion
  • 13.
    Storage replication  Similar tonon-Big Data solutions, where central storage is replicated  Generally implemented using distcp and HDFS snapshots  Data is ingested in source cluster and then copied
  • 14.
    Storage replication  Administrative overhead:  Copyjobs must be scheduled  Metadata changes must be tracked  Good enough for data that comes traditional ETLs such as daily batches
  • 15.
    Dual Ingestion  No files,just streams  Generally ingested from multiple outside sources through Kafka  Streams must be directed to both sites
  • 16.
    Dual Ingestion  Adds complexityto apps  Nifi can be set up as a front-end to both endpoints  Data consistency must be checked  Can be automatically set up via monitoring  Consolidation processes (such as a monthly re-sync) might be needed
  • 17.
    Others  Ingestion replication  Variantof the dual ingestion  A consumer is set up in the source Kafka that in turn writes to a destination Kafka  Bottleneck if the initial streams were generated by many producers  Mixed:  Previous solutions are not mutually exclusive  Storage replication for batch processes’ results  Dual ingestion for streams
  • 18.
    Commercial offerings  Solutions thatease DR setup  Cloudera BDR  Coordinates HDFS snapshots and copy  WANdisco Fusion  Continuous storage replication  Confluent Multi-site  Allows multi-site Kafka data replication
  • 19.
    Tips  Big Data clusters havemany nodes  Costly to replicate  Performance / Capacity tradeoff  We can use cheaper servers in DR, since we don’t expect to use them often
  • 20.
    Tips  Document and testprocedures  DR is rarely fully automated, so responsibilities and actions should be clearly defined  Plan for (at least) a yearly DR run  Track changes in software and configuration
  • 21.
    Tips  Once you havea DR solution, other uses will surface  DR site can be used for backup  Maintain HDFS snapshots  DR data can be used for testing / reporting  Warning: it may alter stored data
  • 22.
    Conclusions  Balance HA /Backup / DR as needed, they are not exclusive:  Different costs  Different impact  Big Data DR is different:  Dedicated hardware  No VMs, no storage cabin  Plan for DATA CENTRIC solutions
  • 23.