Processing Large Data with Apache Spark -- HasGeek

Agenda
Big Data
Overview
Spark
Overview
Spark
Internals
Spark
Libraries

Big Data -- Digital Data growth…

Legacy Architecture Pain Points
• Report arrival latency quite high - Hours to perform joins,
aggregate data
• Existing frameworks cannot do both
• Either, stream processing of 100s of MB/s with low latency
• Or, batch processing of TBs of data with high latency
• Expressibility of business logic in Hadoop MR is challenging

Why Spark
Separate, fast, Map-Reduce-like engine
In-memory data storage for very fast iterative queries
Better Fault Tolerance
Combine SQL, Streaming and complex analytics
Runs on Hadoop, Mesos, standalone, or in the cloud
Data sources -> HDFS, Cassandra, HBase and S3

In Memory - Spark vs Hadoop
Improve efficiency over MapReduce
100x in memory , 2-10x in disk
Up to 40x faster than Hadoop

Spark In & Out
RDBMS
Streaming
SQL
GraphX
BlinkDB
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon
MLlib
Ref: http://training.databricks.com/intro.pdf

Spark Streaming + SQL
Streaming
SQL

Resilient Distributed Data (RDD)
Immutable + Distributed+ Catchable+ Lazy evaluated
 Distributed collections of objects
 Can be cached in memory across cluster nodes
 Manipulated through various parallel operations

Spark Cluster Overview
o Application
o Driver program
o Cluster manage
o Worker node
o Job
o Stage
o Executor
o Task

Task Scheduler , DAG
• Pipelines functions within a stage
• Cache-aware data reuse & locality
• Partitioning-aware to avoid
shuffles
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)

Fault Recovery & Checkpoints
• Efficient fault recovery using Lineage
• log one operation to apply to many elements (lineage)
• Recomputed lost partitions on failure
• Checkpoint RDDs to prevent long lineage chains during fault
recovery

Spark SQL
• Seamlessly mix SQL queries with Spark programs
• Load and query data from a variety of sources
• Standard Connectivity through (J)ODBC
• Hive Compatibility

Data Frames
• A distributed collection of data organized into named columns
• Like a table in a relational database
Spark SQL
Resilient Distributed Datasets
Spark
JDBC Console
User Programs
(Java, Scala, Python)
Catalyst Optimizer
DataFrame API
Figur e 1: I nter faces to Spar k SQL , and inter action with Spar k.
3.1 DataFr ame API
The main abstraction in Spark SQL’s API is a DataFrame, a dis-
tributed collection of rows with a homogeneous schema. A DataFrame
is equivalent to a table in a relational database, and can also be
manipulated in similar ways to the “ native” distributed collections
as well
maps an
to creat
Spark S
the quer
ports us
Using
data fro
tional d
3.3 D
Users c
domain
Python
operato
aggrega
jects in
expressi
of fema
empl oy
. j oi

SparkR
• New R language for Spark and SparkSQL
• Exposes existing Spark functionality in
an R-friendly syntax view the DataFrame API

Spark Streaming
File systems
Databases
Dashboards
Flume
HDFS
Kinesis
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
Chop up the live stream into batches of X seconds. DStream is represented by
a continuous series of RDDs

MLib
• Scalable Machine learning library
• Iterative computing -> High Quality algorithm 100x faster than
hadoop

ML Pipeline
• Feature Extraction
• Normalization
• Dimensionality reduction
• Model training

GraphX
• Spark’s API For Graph and Graph-parallel computation
• Graph abstraction: a directed multigraph with properties attached
to each vertex and edge
• Seamlessly works with both graph and collections

GraphX Framework & Algorithms
Algorithms

Thanks to Apache Spark by….
Started using it in our projects…
Contribute to their open source community…
Socialize Spark ..

Cluster Support
• Standalone – a simple cluster manager included with Spark that makes it easy to set
up a cluster
• Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and
service applications
• Hadoop YARN – the resource manager in Hadoop 2

Data Science Process
Data Science in Practice
• Data Collection
• Munging
• Analysis
• Visualization
• Decision

Project Tungsten
• Memory Management and Binary Processing: leveraging application semantics to
manage memory explicitly and eliminate the overhead of JVM object model and
garbage collection
• Cache-aware computation: algorithms and data structures to exploit memory
hierarchy
• Code generation: using code generation to exploit modern compilers and CPUs

BDAS - Berkeley Data Analytics
Stackhttps://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.

Optimization
• groupBy is costlier – use mapr() or reduceByKey()
• RDD storage level MEMOR_ONLY is better

RDDs vs Distributed Shared Mem

Links References
• Spark
• Spark Submit 2015
• Spark External Projects
• Spark Central

TACHYON
• Tachyon is a memory-centric distributed storage system enabling
reliable data sharing at memory-speed across cluster frameworks,
such as Spark and MapReduce. It achieves high performance by
leveraging lineage information and using memory aggressively.
Tachyon caches working set files in memory, thereby avoiding going
to disk to load datasets that are frequently read. This enables
different jobs/queries and frameworks to access cached files at
memory speed.

Batches…
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes
them using RDD operations
• Finally, the processed results of the RDD operations are
returned in batches
Micro Batch

Dstream (Discretized Streams)
DStream is represented by a continuous series of RDDs

Streaming
• Scalable high-throughput
streaming process of live data
• Integrate with many sources
• Fault-tolerant- Stateful
exactly-once semantics out of
box
• Combine streaming with
batch and interactive queries

Spark streaming
data streams
Receiv
ers
batches
as RDDs
results as
RDDs

Micro Batch (Near Real Time)
Micro Batch

100 opensourceBig Dataarchitecturepapers

Processing Large Data with Apache Spark -- HasGeek

Recommended

More Related Content

What's hot (20)

Similar to Processing Large Data with Apache Spark -- HasGeek (20)

More from Venkata Naga Ravi (11)

Recently uploaded (20)

Processing Large Data with Apache Spark -- HasGeek

Editor's Notes