0% found this document useful (0 votes)
30 views

BDA Lect5 Apache Spark 2023

Uploaded by

Parth Ashtikar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

BDA Lect5 Apache Spark 2023

Uploaded by

Parth Ashtikar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

Introduction to Apache Spark

Basics of Apache Spark and Resilient Distributed Datasets (RDDs)


Introduction to Apache Spark

• Powerful cluster computing engine.


• Primarily based on Hadoop, offers several new
computations such
• Interactive queries
• Stream processing
• The most Sparkling feature of Apache Spark is it offers in-
memory cluster computing.
• In-memory cluster computing enhances the processing
speed of an application.
Introduction to Apache Spark

• Workloads in Apache Spark:


• Streaming and batch applications
• iterative algorithms
• interactive queries
• Spark supports high-level APIs such a Java, Scala, Python and
R. It is basically built upon Scala language.
• Framework for large scale stream processing
• Scales to 100s of nodes
• Can achieve second scale latencies
• Integrates with Spark’s batch and interactive processing
• Provides a simple batch-like API for implementing complex algorithm
• Can absorb live data streams from Kafka, Flume etc.
Why Spark is better than Hadoop
Why Spark is better than Hadoop
Apache Spark Component

• Apache Spark Core


• It is accountable for all basic I/O Functionalities.
• It monitors the role of the cluster. This also observes the jobs of the
cluster.
• Significant in programming with Spark and supports fault recovery
• As we are using In-memory computation it enhances the
productivity. It also overcomes the drawbacks of MapReduce.
Apache Spark Component
• Apache Spark SQL
• In Spark SQL, we have extensible optimizer for the core functioning in
SQL and fully compatible with HIVE data.
• This helps in querying and analyzing large datasets stored in Hadoop
files.
• Along with dataFrame, SQL provides a common way to access several
data sources.
• Spark SQL gives provision to perform structured as well as semi-
structured data analysis.
Apache Spark Component
• Apache Spark Streaming
• It allows easy, reliable and fast processing of live data streams.
• In Spark streaming, integration of the streaming data with historical
data is possible. We can also reuse the same code for stream and
batch processing.
• Spark Streaming allows exactly-once message guarantees. It helps to
recover lost data without having to write adding extra configurations.
• It supports the inclusion of Spark MLlib for machine learning
pipelines into data pathways.
Apache Spark Component

• MLlib (Machine Learning Library)


• It consists of common learning algorithms as well as utilities.
• This library also includes classification, regression, clustering & many
more.
• It is also capable of performing in-memory data processing.
• That enhances the performance of iterative algorithm drastically.
Apache Spark Component
• GraphX
• It is a graph computation engine built on top of Spark.
• GraphX enables users to build, transform and reason about data at
scale.
• It is available with a library of common algorithms already.
• Through Graphx Clustering, classification pathfinding is also possible.
Also, Graph and GraphX parallel execution are possible through it.
• Spark has its own Graph Computation Engine for graphs and
graphical computations.
Apache Spark Component

• The two cluster managers, Apache Mesos and Hadoop YARN, manage
the underlying data that we want to analyze.
Apache Spark Application Flow
Apache Spark RDD
• Resilient Distributed Datasets (RDDs) in Spark is an API of Spark. It
collects all the elements of the data in the cluster which are well
partitioned.
• Resilient - meaning ability to re-computed from history which in
turn fault tolerant.
Hands-on Exercise

There are three ways to create RDDs


1. Parallelize the present collection in our dataset
2. Referencing a dataset in external storage system
3. To Create Resilient Distributed Datasets from already
existing RDDs
Creation of RDD

• Using Parallelized collection:


• This is a basic method to create RDD which is applied at the very
initial stage of spark.
• It creates RDD very quickly. It also initializes further operations
on them at the same time.
• To operate this method, we need entire dataset on one machine.
• RDDs are generally created by parallelized collection i.e. by taking an
existing collection in the program and passing it
to SparkContext’s parallelize() method.
Using Parallelized collection

Once created RDD performs two types of operations: Transformations and Actions
By using creatDataFrame() function

Once created RDD performs two types of operations:


Transformations and Actions
By using read and load functions

• Reading data from a CSV file

Once created RDD performs two types of operations: Transformations and Actions
By using read and load functions

• Than you will get your RDD data

Once created RDD performs two types of operations: Transformations and Actions
Read data from dataset

Once created RDD performs two types of operations: Transformations and Actions
Read data from dataset

• Than you will get your RDD data

Once created RDD performs two types of operations: Transformations and Actions
Read dataset from HDFS

Once created RDD performs two types of operations: Transformations and Actions
Apache Spark RDD

• We can perform different operations on RDD


• Transformations mean to create a new data sets from the existing
ones. As we know RDDs are Immutable, we can transform the data
from one to another.
• Actions are operations that return a value to the program. All the
transformations done on a Resilient Distributed Datasets are later
applied when an action is called.
RDD Operations

• The coarse-grained operation means to apply operations on all the


objects at once. (Entire cluster simultaneously)
• Fine-grained operations mean to apply operations on a smaller set.
• Through its name, RDD itself indicating its properties like:
• Resilient – Means that it is able to withstand all the losses itself.
• Distributed – Indicates that the data at different locations or partitioned.
• Datasets – Group of data on which we are performing different operations.
Python Lambda Functions

• Lambda functions are anonymous functions (i.e., they do not have


a name) that are created at runtime.
• They can be used wherever function objects are required and are
syntactically restricted to a single expression. The following
example shows a lambda function that returns the sum of its two
arguments:
lambda a, b: a + b
• Lambdas are defined by the keyword lambda, followed by a
comma-separated list of arguments.
• A colon separates the function declaration from the function
expression.
Apache Spark Transformation Operations

• Map()
• Using map() transformation we take in any function, and
that function is applied to every element of RDD.
>>> data = [1, 2, 3, 4, 5, 6]
>>> rdd = sc.parallelize(data)
>>> map_result = rdd.map(lambda x: x * 2)
>>> map_result.collect()
[2, 4, 6, 8, 10, 12]
Apache Spark Transformation Operations

• Filter()
• The filter(func) function returns a new RDD containing only the
elements of the source that the supplied function returns as true.
The following example returns only the even numbers from the
source RDD:

>>> data = [1, 2, 3, 4, 5, 6]


>>> filter_result = rdd.filter(lambda x: x % 2 == 0)
>>> filter_result.collect()
[2, 4, 6]
Apache Spark Transformation Operations

• Distinct()
• The distinct() method returns a new RDD containing only the distinct
elements from the source RDD. The following example returns the
unique elements in a list:

>>> data = [1, 2, 3, 2, 4, 1]


>>> rdd = sc.parallelize(data)
>>> distinct_result = rdd.distinct()
>>> distinct_result.collect()
[ 1, 2, 3, 4]
Apache Spark Transformation Operations

• flatMap()
• The flatMap(func) function is similar to the map() function, except it
returns a flattened version of the results. For comparison, the following
examples return the original element from the source RDD and its
square. The example using the map() function returns the pairs as a list
within a list:

>>> data = [1, 2, 3, 4]


>>> rdd = sc.parallelize(data)
>>> map = rdd.map(lambda x: [x, pow(x,2)])
>>> map.collect()
[[1, 1], [2, 4], [3, 9], [4, 16]]
Apache Spark Transformation Operations

• While the flatMap() function concatenates the results, returning a


single list:

>>> rdd = sc.parallelize()


>>> flat_map = rdd.flatMap(lambda x: [x, pow(x,2)])
>>> flat_map.collect()
[1, 1, 2, 4, 3, 9, 4, 16]
Apache Spark Action Operations

• Reduce()
• The reduce() method aggregates elements in an RDD using a function,
which takes two arguments and returns one. The function used in the
reduce method is commutative and associative, ensuring that it can be
correctly computed in parallel. The following example returns the
product of all of the elements in the RDD:

>>> data = [2, 3]


>>> rdd = sc.parallelize(data)
>>> rdd.reduce(lambda a, b: a * b)
6
Apache Spark Action Operations

• Take()
• The take(n) method returns an array with the first n elements of the
RDD. The following example returns the first two elements of an RDD:

>>> data = [1, 2, 3]


>>> rdd = sc.parallelize(data)
>>> rdd.take(2)
[1, 2]
Apache Spark Action Operations

• Collect()
• The collect() method returns all of the elements of the RDD as an array.
The following example returns all of the elements from an RDD:
>>> data = [1, 2, 3, 4, 5]
>>> rdd = sc.parallelize(data)
>>> rdd.collect()
[1, 2, 3, 4, 5]
• It is important to note that calling collect() on large datasets could
cause the driver to run out of memory. To inspect large RDDs, the
take() and collect() methods can be used to inspect the top n
elements of a large RDD. The following example will return the
first 100 elements of the RDD to the driver:
• >>> rdd.take(100).collect()
Apache Spark Action Operations

• takeOrdered()
• The takeOrdered(n, key=func) method returns the first n elements of
the RDD, in their natural order, or as specified by the function func. The
following example returns the first four elements of the RDD in
descending order:
>>> data = [6,1,5,2,4,3]
>>> rdd = sc.parallelize(data)
>>> rdd.takeOrdered(4, lambda s: -s)
[6, 5, 4, 3]
Perform Transformations and Actions on RDD
Apache Spark Transformation Operations

• Filter()
• Spark RDD filter() function returns a new RDD, containing only the
elements that meet a predicate. It is a narrow operation because it does
not shuffle data from one partition to many partitions.
• For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4,
and 5) and the predicate is check for an even number. The resulting RDD
after the filter will contain only the even numbers i.e., 2 and 4.
Apache Spark Transformation Operations

• Filter()
• Spark RDD filter() function returns a new RDD, containing only the
elements that meet a predicate. It is a narrow operation because it does
not shuffle data from one partition to many partitions.
• For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4,
and 5) and the predicate is check for an even number. The resulting RDD
after the filter will contain only the even numbers i.e., 2 and 4.
>>> data = [1, 2, 3, 4, 5, 6]
>>> filter_result = rdd.filter(lambda x: x % 2 == 0)
>>> filter_result.collect()
[2, 4, 6]
Apache Spark Transformation Operations

• Filter()
• Spark RDD filter() function returns a new RDD, containing only the
elements that meet a predicate. It is a narrow operation because it does
not shuffle data from one partition to many partitions.
• For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4,
and 5) and the predicate is check for an even number. The resulting RDD
after the filter will contain only the even numbers i.e., 2 and 4.

• union(dataset)
• With the union() function, we get the elements of both the RDD in new
RDD. The key rule of this function is that the two RDDs should be of the
same type.
• For example, the elements of RDD1 are (Spark, Spark, Hadoop, Flink) and
that of RDD2 are (Big data, Spark, Flink) so the resultant
rdd1.union(rdd2) will have elements (Spark, Spark, Spark, Hadoop, Flink,
Flink, Big data).
Apache Spark Transformation Operations

Scala Example: Union()

val rdd1 =
spark.sparkContext.parallelize(Seq((1,"jan",2016),(3,"nov",2014),(16,"fe
b",2014)))
val rdd2 =
spark.sparkContext.parallelize(Seq((5,"dec",2014),(17,"sep",2015)))
val rdd3 =
spark.sparkContext.parallelize(Seq((6,"dec",2011),(16,"may",2015)))
val rddUnion = rdd1.union(rdd2).union(rdd3)
rddUnion.foreach(println)
Introduction to Apache Spark
Apache Spark Streaming
Spark Streaming

• The core Spark API’s extension is what we call a “Spark Streaming”.


• It enables high-throughput, scalability, fault-tolerant stream
processing of live data streams. Ingestion of data is possible from
many sources like Kafka, Flume, Kinesis, or TCP sockets.
• We can also process it by using complex algorithms expressed with
high-level functions, such as map, reduce, join and window.
Afterwards, data which is already processed can be pushed out to
filesystems.
• In addition, we can apply spark’s machine learning and graph
processing on data streams.
Spark Streaming

• Spark DStream (Discretized Stream) is the basic abstraction of Spark


Streaming.
• It is a continuous sequence of spark RDDs, that represents a
continuous stream of data, that is Spark’s abstraction of
an immutable, distributed dataset.
• Creation of DStreams can be possible from live data, such as data
from HDFS, Kafka or Flume.
Spark Streaming

• Also, generation of Dstream is possible by transformation


existing DStreams using operations, such as map, window,
and reduceByKey and window.
• Each DStream periodically generates a Spark RDD, while a
spark streaming program is running, that RDD is either
generated by live data or by transforming the RDD
generated by a parent DStream.
Spark DStream (Discretized Stream)

• There are few basic properties of DStreams:


• Record of other DStreams that the DStream depends on.
• Time duration at which DStream generates an RDD.
• The function which we use to generate an RDD after each time
interval.
Input DStreams and Receivers

• The stream of input data received from streaming sources is


represented as DStream, which are input DStream.
• Following are the two types of built-in streaming sources.
• Basic sources: Those are directly available in the
StreamingContext API are basic sources, such as file
systems, and socket connections.
• Advanced sources: Advanced sources, those are available
through extra utility classes, for example, Kafka, Flume,
Kinesis and much more. These sources require linking
against extra dependencies.
Input DStreams and Receivers

• We can receive many streams of data in parallel by creating


multiple input DStream.
• This process will create many receivers, so we will receive
many data streams at the same time.
• Although, a spark executor/worker is a long-running task.
Hence, it occupies one of the cores allocated to the spark
streaming application.
• Importantly, we need to allocate enough cores to the
streaming application, that helps to process the received
data also to run the receiver.
How to create a stream

• Basic Sources
• File Streams
• Streams based on Custom Receivers
• The queue of RDDs as a Stream
• Advance Sources
• Kafka: Kafka broker versions 0.8.2.1 is compatible with spark
streaming 2.2.0.
• Flume: Flume 1.6.0 is compatible with spark streaming 2.2.0.
• Kinesis: Kinesis client library 1.2.1 is compatible with spark
streaming 2.2.0.
Hands-on Exercise

• Running Wordcount program in Spark and how to create a stream


from console and HDFS
Apache Spark DStream Operations

• Spark DStream supports two types of operations:


• Transformations
• Stateless Transformations
• we don’t need data of previous batches for
processing. As a result, these are simple RDD
transformations.
• map(), filter(), reduceByKey(), transform(). etc.

• Stateful Transformations
• Output operations
• Print()
• Save()
Hands-on Exercise

• Demonstrate Apache Spark DStream Operations


Spark Streaming Window Operations

• Spark streaming leverages advantage of windowed computations in


Apache Spark. It offers to apply transformations over a sliding
window of data.
Spark Streaming Window Operations

• As window slides over a source DStream, the source RDDs that fall
within the window are combined. It also operated upon which produces
spark RDDs of the windowed DStream. Hence, In this specific case, the
operation is applied over the last 3 time units of data, also slides by 2-
time units.
• Basically, any Spark window operation requires specifying two
parameters.
• Window length – It defines the duration of the window (3 in the
figure).
• Sliding interval – It defines the interval at which the window
operation is performed (2 in the figure).
• However, these 2 parameters must be multiples of the batch
interval of the source DStream.
Spark Streaming Window Operations

• Theseoperations describe two parameters –


windowLength and slideInterval.
• Window (windowLength, slideInterval)
• Window operation returns a new DStream. On the basis of windowed
batches of the source DStream, it gets computed.
• CountByWindow (windowLength, slideInterval)
• In the stream, countByWindow operation returns a sliding window count
of elements.
• ReduceByWindow (func, windowLength, slideInterval)
• ReduceByWindow returns a new single-element stream, that is created
by aggregating elements in the stream over a sliding interval using func.
However, a function must be commutative and associative, so that it can
be computed correctly in parallel.
Spark Streaming Window Operations

• ReduceByKeyAndWindow(func, windowLength,
slideInterval, [numTasks])
• Whenever we call reduceByKeyAndWindow window on a
DStream of (K, V) pairs, it returns a new DStream of (K, V) pairs.
Here, we aggregate values of each key, by given reduce function
func over batches in a sliding window.
• In addition, it uses spark’s default number of parallel tasks, for
grouping purpose. Like for local mode, it is 2. While in cluster
mode it determines number using spark.default.parallelism config
property. To set a different number of tasks, it passes an optional
numTasks argument.
Spark Streaming Window Operations

• ReduceByKeyAndWindow (func, invFunc,


windowLength, slideInterval, [numTasks])
• It is the more efficient version of the above
reduceByKeyAndWindow(). As in above one, we calculate the
reduced value of each window by using the reduce values of the
previous window. However, here calculations take place by
reducing the new data. For calculating, it reduces data which
enters the sliding window. Also performs “inverse reducing” of
the old data which leaves the window.
• Note: Checkpointing must be enabled for using this operation.
Spark Streaming Window Operations

• CountByValueAndWindow(windowLength, slideInterval,
[numTasks])
• While, we call countByValueAndWindow on a DStream of (K, V) pairs, it
returns a new DStream of (K, Long) pairs. Here, the value of each key is
its frequency within a sliding window. In one case it is very similar to
reduceByKeyAndWindow operation. Here also, we can configure the
number of reduce tasks by an optional argument.
Apache Spark Streaming Checkpoint

• A process of writing received records at checkpoint intervals to HDFS is


checkpointing. It is be resilient to failures unrelated and creates fault-
tolerant stream processing pipelines.
• In Streaming, DStreams can checkpoint input data at specified time
intervals. Data checkpoint are of two types.
• Metadata checkpointing: We use it to recover from the failure of
the node running the driver of the streaming application.
Metadata includes:
• Configuration – We use to create the streaming application.
• DStream operations – Defines the streaming application.
• Incomplete batches -Jobs are in the queue but have not
completed yet.
Apache Spark Streaming Checkpoint

• Data checkpointing
• All the generated RDDs are saving to reliable storage.
• For some stateful transformations, it is necessary to combine the
data across multiple batches. Since generated RDDs in some
transformations depend on RDDs of previous batches.
• It causes the length of the dependency chain to keep increasing with
time, to reduce increases in recovery time.
• Checkpoint intermediate RDDs of stateful transformation, it happens
at reliable storage to cut off the dependency chains.
Apache Spark Streaming Checkpoint

• When to enable Checkpointing in Spark Streaming


• While we use stateful transformations
• Recovering from failures of the driver running the application
• Types of Checkpointing in Spark Streaming
• Reliable Checkpointing
• Local Checkpointing
Hands-on Exercise

• Demonstrate Apache Spark Streaming Checkpoints


Introduction to Apache Spark
Apache Spark MLib
APACHE SPARK MLib

• What is MLlib?
• MLlib is a Spark subproject providing machine learning primitives:
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8
• 33 contributors
APACHE SPARK MLlib Tools

• Spark MLlib provides the following tools:


• ML Algorithms: ML Algorithms form the core of MLlib. These include
common learning algorithms such as classification, regression, clustering and
collaborative filtering.
• Featurization: Featurization includes feature extraction, transformation,
dimensionality reduction and selection.
• Pipelines: Pipelines provide tools for constructing, evaluating and tuning ML
Pipelines.
• Persistence: Persistence helps in saving and loading algorithms, models and
Pipelines.
• Utilities: Utilities for linear algebra, statistics and data handling.
MLlib Algorithms

• Algorithms:
• Basic Statistics
• Regression
• Classification
• Recommendation System
• Clustering
• Dimensionality Reduction
• Feature Extraction
• Optimization
Basic Statistics

• Calculating the correlation between two series of data is a


common operation in Statistics.
• In spark.ml we provide the flexibility to calculate pairwise
correlations among many series.
• The supported correlation methods are currently
• Pearson’s correlation
• Spearman’s correlation.
Extracting, transforming and selecting
features
• Feature Extractors
• TF-IDF
• Word2Vec
• CountVectorizer
• FeatureHasher
Extracting, transforming and selecting
features
Feature Extractors
• TF-IDF
• Word2Vec
• CountVectorizer
• FeatureHasher
Extracting, transforming and selecting
features
Feature Transformers
• Tokenizer
• Tokenization is the process of taking text (such as a sentence) and
breaking it into individual terms (usually words). A simple Tokenizer class
provides this functionality. The example below shows how to split
sentences into sequences of words.
StopWordsRemover
Stop words are words which should be excluded from the input, typically because
the words appear frequently and don’t carry as much meaning.

Assume that we have the following DataFrame with columns id and raw:

id | raw
----|----------
0 | [I, saw, the, red, baloon]
1 | [Mary, had, a, little, lamb]

Applying StopWordsRemover with raw as the input column and filtered as the output column,
we should get the following:

id | raw | filtered
----|-----------------------------|--------------------
0 | [I, saw, the, red, baloon] | [saw, red, baloon]
1 | [Mary, had, a, little, lamb]|[Mary, little, lamb]

In filtered, the stop words “I”, “the”, “had”, and “a” have been filtered out.
Extracting, transforming and selecting
features
• n-gram
• An n-gram is a sequence of n tokens (typically words) for some integer n.
The NGram class can be used to transform input features into n-grams.

• NGram takes as input a sequence of strings (e.g. the output of a


Tokenizer). The parameter n is used to determine the number of terms
in each n-gram. The output will consist of a sequence of n-grams where
each n-gram is represented by a space-delimited string of n consecutive
words. If the input sequence contains fewer than n strings, no output is
produced.
Other Feature Transformers

• PolynomialExpansion
• Discrete Cosine Transform (DCT)
• StringIndexer
• IndexToString
• Interaction
• Normalizer
• StandardScaler
• MinMaxScaler
• MaxAbsScaler
• Bucketizer
Feature Selectors

• VectorSlicer
• VectorSlicer is a transformer that takes a feature vector and outputs a
new feature vector with a sub-array of the original features. It is useful
for extracting features from a vector column.

userFeatures | features
------------------|-----------------------
------
[0.0, 10.0, 0.5] | [10.0, 0.5]

userFeatures | features
------------------|-----------------------------
[0.0, 10.0, 0.5] | [10.0, 0.5]
["f1", "f2", "f3"] | ["f2", "f3"]
Hands-on Exercise

• Demonstrate MLib for Basic Statistics


• Correlations
• Hypothesis testing
• Streaming Significance Testing
• How to deal with Image from HDFS
• Demonstrate MLib feature selection, feature transformation and
dimensionality reduction
• Demonstrate MLib Classification and Clustering
Introduction to Apache Flink
Apache Flink

• Apache Flink is the next generation Big Data tool also known
as 4G of Big Data.
• It is the true stream processing framework (doesn’t cut
stream into micro-batches).
• Flink’s kernel (core) is a streaming runtime which also
provides distributed processing, fault tolerance, etc.
• Flink processes events at a consistently high speed with low
latency.
• It processes the data at lightning fast speed.
• It is the large-scale data processing framework which can
process data generated at very high velocity.
What is Apache Flink

• Apache Flink is the powerful open source platform which


can address following types of requirements efficiently:
• Batch Processing
• Interactive processing
• Real-time stream processing
• Graph Processing
• Iterative Processing
• In-memory processing
What is Apache Flink

• Flink is an alternative to MapReduce, it processes data more


than 100 times faster than MapReduce.
• It is independent of Hadoop but it can use HDFS to read,
write, store, process the data.
• Flink does not provide its own data storage system. It takes
data from distributed storage.
Apache Flink
How is Apache Flink different from Apache Hadoop
and Apache Spark ?

• Apache Flink uses Kappa-architecture, the architecture


where only streams(of data) are used for processing.
Hadoop and Spark uses Lambda architecture, where
batches(of data) and micro-batches(of streamed data)
are used respectively, for processing.
• Cyclic or iterative processes are optimized in Flink, as
Flink has optimization of join algorithms, operator
chaining and reusing of partitioning and sorting.
How Apache Flink is related/comparable to Apache Hadoop
and Apache Spark ?

• Both Flink and Spark are general-purpose platforms for


streamed data processing.
• Hadoop and Spark process data in batches. Flink is also
able to do batch processing, by only considering batch of
data as a stream of data with limits.
• Storm/MapReduce code is compatible to run with Flink
execution engine.
• Flink has machine learning module : Flink ML. Spark has
machine learning module : Spark MLlib.
Apache Flink Ecosystem Components
• The development of Flink is started in 2009 at a technical
university in Berlin under the stratosphere.
• It was incubated in Apache in April 2014 and became a top-
level project in December 2014.
• Flink is a German word meaning swift / Agile. The logo of
Flink is a squirrel, in harmony with the Hadoop ecosystem.
Apache Flink Ecosystem Components
Storage / Streaming Component
• Flink has no storage system; it is just a computation engine. Flink can read,
write data from different storage system as well as can consume data from
streaming systems.
• HDFS – Hadoop Distributed File System
• Local-FS – Local File System
• S3 – Simple Storage Service from Amazon
• HBase – NoSQL Database in Hadoop ecosystem
• MongoDB – NoSQL Database
• RDBMS – Any relational database
• Kafka – Distributed messaging Queue
• RabbitMQ – Messaging Queue
• Flume – Data Collection and Aggregation
Tool
Apache Flink Ecosystem Components

• The second layer is the deployment/resource management. Flink can be deployed in


following modes:
• Local mode – On a single node, in single JVM
• Cluster – On a multi-node cluster, with following resource manager.
• Standalone – This is the default resource manager which is shipped with Flink.
• YARN – This is a very popular resource manager, it is part of Hadoop, introduced in Hadoop 2.x
• Mesos – This is a generalized resource manager.
• Cloud – on Amazon or Google cloud
Apache Flink Ecosystem Components

• The next layer is Runtime – the Distributed Streaming Dataflow, which is


also called as the kernel of Apache Flink.
• This is the core layer of flink which provides distributed processing, fault
tolerance, reliability, native iterative processing capability, etc.
Apache Flink Ecosystem Components

DataSet API
• It handles the data at the rest, it allows the user to implement operations like map, filter, join,
group, etc. on the dataset. It is mainly used for distributed processing. Actually, it is a special
case of Stream processing where we have a finite data source. The batch application is also
executed on the streaming runtime.
DataStream API
• It handles a continuous stream of the data. To process live data stream it provides various
operations like map, filter, update states, window, aggregate, etc. It can consume the data from
the various streaming source and can write the data to different sinks. It supports
both Java and Scala.
Apache Flink Ecosystem Components
Table
• It enables users to perform ad-hoc analysis using SQL like expression
language for relational stream and batch processing. It can be embedded
in DataSet and DataStream APIs.
• Actually, it saves users from writing complex code to process the data
instead allows them to run SQL queries on the top of Flink.
Apache Flink Ecosystem Components

Gelly
• It is the graph processing engine which allows users to run set of
operations to create, transform and process the graph. Gelly also provides
the library of an algorithm to simplify the development of graph
applications. It leverages native iterative processing model of Flink to
handle graph efficiently. Its APIs are available in Java and Scala.
Apache Flink Ecosystem Components
FlinkML
• It is the machine learning library which provides intuitive
APIs and an efficient algorithm to handle machine learning
applications. We write it in Scala. As we know machine
learning algorithms are iterative in nature, Flink provides
native support for iterative algorithm to handle the same
quite effectively and efficiently.
Flink Architecture
Flink Features
• High performance – Flink’s data streaming Runtime provides very high
throughput.
• Low latency – Flink can process the data in sub-second range without any delay/
• Event Time and Out-of-Order Events – Flink supports stream processing and
windowing where events arrive delayed or out of order.
• Lightning fast speed – Flink processes data at lightning fast speed (hence also
called as 4G of Big Data).
• Fault Tolerance – Failure of hardware, node, software or a process doesn’t
affect the cluster.
• Memory management – Flink works in managed memory and never get out of
memory exception.
• Broad integration – Flink can be integrated with the various storage system to
process their data, it can be deployed with various resource management tools.
It can also be integrated with several BI tools for reporting.
Flink Features
• Stream processing – Flink is a true streaming engine, can process live
streams in the sub-second interval.
• Program optimizer – Flink is shipped with an optimizer, before
execution of a program it is optimized.
• Scalable – Flink is highly scalable. With increasing requirements, we
can scale the flink cluster.
• Rich set of operators – Flink has lots of pre-defined operators to
process the data. All the common operations can be done using
these operators.
• Exactly-once Semantics – It can maintain custom state during
computation.
• Highly flexible Streaming Windows – In flink we can customize
windows by triggering conditions flexibly, to get the required
streaming patterns. We can create window according to time t1 to t5
and data-driven windows.
Steps to execute the applications in Flink:
• Program – Developer wrote the application program.
• Parse and Optimize – The code parsing, Type Extractor, and
Optimization are done during this step.
• DataFlow Graph – Each and every job converts into the data flow graph.
• Job Manager – Now job manager schedules the task on the task
managers; keeps the data flow metadata. Job manager deploys the
operators and monitors the intermediate task results
• Task Manager – The tasks are executed on the task manager, they are
the worker nodes.
Anatomy of a Flink Program

• Flink
programs look like regular programs that
transform collections of data. Each program
consists of the same basic parts:
• Obtain an execution environment,
• Load/create the initial data,
• Specify transformations on this data,
• Specify where to put the results of your computations,
• Trigger the program execution
FlinkCEP - Complex event processing for Flink
• FlinkCEP is the Complex Event Processing (CEP) library implemented
on top of Flink.
• It allows you to detect event patterns in an endless stream of events,
giving you the opportunity to get hold of what’s important in your
data.
• Flink-The Pattern API
• Individual Patterns
• A Pattern can be either a singleton or a looping pattern.
• Singleton patterns accept a single event, while looping patterns
can accept more than one
• By default, a pattern is a singleton pattern and you can transform
it to a looping one by using Quantifiers.
• Each pattern can have one or more Conditions based on which it
accepts events.
Flink-The Pattern API

• Individual Patterns
• A Pattern can be either a singleton or a looping pattern.
• Singleton patterns accept a single event, while looping patterns can
accept more than one
• By default, a pattern is a singleton pattern and you can transform it
to a looping one by using Quantifiers.
• Each pattern can have one or more Conditions based on which it
accepts events.
Flink CEP Quantifiers

• // expecting 4 occurrences
start.times(4)
• // expecting 0 or 4 occurrences
start.times(4).optional()
• // expecting 2, 3 or 4 occurrences
start.times(2, 4)
• // expecting 2, 3 or 4 occurrences and repeating as many as possible
start.times(2, 4).greedy()
Flink CEP Quantifiers

• // expecting 0, 2, 3 or 4 occurrences
start.times(2, 4).optional()
• // expecting 0, 2, 3 or 4 occurrences and repeating as many as
possible
start.times(2, 4).optional().greedy()
• // expecting 1 or more occurrences
start.oneOrMore()
• // expecting 1 or more occurrences and repeating as many as
possible
start.oneOrMore().greedy()
Flink CEP Quantifiers

• // expecting 2 or more occurrences


start.timesOrMore(2)

• // expecting 2 or more occurrences and repeating as many as possible


start.timesOrMore(2).greedy()

• // expecting 0, 2 or more occurrences


start.timesOrMore(2).optional()

• // expecting 0, 2 or more occurrences and repeating as many as


possible
start.timesOrMore(2).optional().greedy()
Flink CEP Conditions

• For every pattern you can specify a condition that an


incoming event has to meet in order to be “accepted” into
the pattern e.g. its value should be larger than 5, or larger
than the average value of the previously accepted events.
• Iterative Conditions: It accepts subsequent events based on
properties of the previously accepted events or a statistic over a
subset of them.
• Simple Conditions: This type of condition extends the
aforementioned IterativeCondition class and decides whether to
accept an event or not, based only on properties of the event itself.
Apache Flink CEP program to monitor rack
temperatures in a data center

• The Flink program monitors an incoming stream of monitor events


from a data center. The input stream contains events about the
temperature and power consumption of the individual racks.
Whenever two temperature events occur within a given interval
which exceed a certain threshold temperature, a warning will be
raised. If the system should detect two temperature warnings for the
same rack and with increasing temperatures, the system will
generate an alert for this rack.
Comparison of Tools

• A quick review of Hadoop, Spark and Flink


Hadoop vs Spark vs Flink – Data Processing

• Hadoop: Apache Hadoop built for batch processing. It


takes large data set in the input, all at once, processes it
and produces the result. Batch processing is very efficient
in the processing of high volume data. An output gets
delay due to the size of the data and the computational
power of the system.
• Spark: Apache Spark is also a part of Hadoop Ecosystem.
It is a batch processing System at heart too but it also
supports stream processing.
• Flink: Apache Flink provides a single runtime for the
streaming and batch processing.
Hadoop vs Spark vs Flink – Streaming Engine

• Hadoop: Map-reduce is batch-oriented processing tool. It


takes large data set in the input, all at once, processes it and
produces the result.
• Spark: Apache Spark Streaming processes data streams in
micro-batches. Each batch contains a collection of events
that arrived over the batch period. But it is not enough for
use cases where we need to process large streams of live
data and provide results in real time.
• Flink: Apache Flink is the true streaming engine. It uses
streams for workloads: streaming, SQL, micro-batch, and
batch. Batch is a finite set of streamed data.
Hadoop vs Spark vs Flink – Computation
Model

• Hadoop: MapReduce adopted the batch-oriented model.


Batch is processing data at rest. It takes a large amount of
data at once, processing it and then writing out the output.
• Spark: Spark has adopted micro-batching. Micro-batches are
an essentially “collect and then process” kind of
computational model.
• Flink: Flink has adopted a continuous flow, operator-based
streaming model. A continuous flow operator processes
data when it arrives, without any delay in collecting the data
or processing the data.
Hadoop vs Spark vs Flink – Performance

• Hadoop: Apache Hadoop supports batch processing only. It doesn’t


process streamed data hence performance is slower when compared
Hadoop vs Spark vs Flink.
• Spark: Though Apache Spark has an excellent community
background and now It is considered as most matured community.
But Its stream processing is not much efficient than Apache Flink as it
uses micro-batch processing.
• Flink: Performance of Apache Flink is excellent as compared to any
other data processing system. Apache Flink uses native closed loop
iteration operators which make machine learning and graph
processing more faster when we compare Hadoop vs Spark vs Flink.
Hadoop vs Spark vs Flink – Memory
management

• Hadoop: It provides configurable Memory management.


You can do it dynamically or statically.
• Spark: It provides configurable memory management. The
latest release of Spark 1.6 has moved towards automating
memory management.
• Flink: It provides automatic memory management. It has its
own memory management system, separate from Java’s
garbage collector.
Hadoop vs Spark vs Flink – Fault tolerance

• Hadoop: MapReduce is highly fault-tolerant. There is no


need to restart the application from scratch in case of any
failure in Hadoop.
• Spark: Apache Spark Streaming recovers lost work and
with no extra code or configuration, it delivers exactly-
once semantics out of the box.
• Flink: The fault tolerance mechanism followed by Apache
Flink is based on Chandy-Lamport distributed snapshots.
The mechanism is lightweight, which results in
maintaining high throughput rates and provide strong
consistency guarantees at the same time.
Hadoop vs Spark vs Flink – Scalability

• Hadoop: MapReduce has incredible scalability potential and


has been used in production on tens of thousands of Nodes.
• Spark: It is highly scalable, we can keep adding n number of
nodes in the cluster. A large known sSpark cluster is of 8000
nodes.
• Flink: Apache Flink is also highly scalable, we can keep
adding n number of nodes in the cluster A large known Flink
cluster is of thousands of nodes.
Hadoop vs Spark vs Flink – Latency

• Hadoop: The MapReduce framework of Hadoop is relatively slower


since it is designed to support the different format, structure and the
huge volume of data. That’s why Hadoop has higher latency than both
Spark and Flink.
• Spark: Apache Spark is yet another batch processing system but it is
relatively faster than Hadoop MapReduce since it caches much of the
input data on memory by RDD and keeps intermediate data in memory
itself, eventually writes the data to disk upon completion or whenever
required.
• Flink: With small efforts in configuration, Apache Flink’s data streaming
runtime achieves low latency and high throughput.
Hadoop vs Spark vs Flink – Processing Speed
• Hadoop: MapReduce processes slower than Spark and Flink. The slowness
occurs only because of the nature of the MapReduce-based execution, where
it produces lots of intermediate data, much data exchanged between nodes,
thus causes huge disk IO latency. Furthermore, it has to persist much data in
disk for synchronization between phases so that it can support Job recovery
from failures. Also, there are no ways in MapReduce to cache all subset of the
data in memory.
• Spark: Apache Spark processes faster than MapReduce because it caches
much of the input data on memory by RDD and keeps intermediate data in
memory itself, eventually writes the data to disk upon completion or
whenever required. Spark is 100 times faster than MapReduce and this shows
how Spark is better than Hadoop MapReduce.
• Flink: It processes faster than Spark because of its streaming architecture.
Flink increases the performance of the job by instructing to only process part
of data that have actually changed.
Hadoop vs Spark vs Flink – Recovery

• Hadoop: MapReduce is naturally resilient to system faults or


failures. It is the highly fault-tolerant system.
• Spark: Apache Spark RDDs allow recovery of partitions on
failed nodes by re-computation of the DAG while also
supporting a more similar recovery style to Hadoop by way
of checkpointing, to reduce the dependencies of RDDs.
• Flink: It supports checkpointing mechanism that stores the
program in the data sources and data sink, the state of the
window, as well as user-defined state that recovers
streaming job after failure.
Hadoop vs Spark vs Flink – Cost

• Hadoop: MapReduce can typically run on less expensive


hardware than some alternatives since it does not attempt
to store everything in memory.
• Spark: As spark requires a lot of RAM to run in-memory,
increasing it in the cluster, gradually increases its cost.
• Flink: Apache Flink also requires a lot of RAM to run in-
memory, so it will increase its cost gradually.
Hadoop vs Spark vs Flink – Caching and
Hardware Requirements

• Hadoop: MapReduce cannot cache the data in memory for


future requirements. : MapReduce runs very well on
Commodity Hardware.

• Spark: It can cache data in memory for further iterations


which enhance its performance. Apache Spark needs mid to
high-level hardware. Since Spark cache data in-memory for
further iterations which enhance its performance.

• Flink: It can cache data in memory for further iterations to


enhance its performance. Apache Flink also needs mid to
High-level Hardware. Flink can also cache data in memory
for further iterations which enhance its performance.
Hadoop vs Spark vs Flink – Windows criteria

• A data stream needs to be grouped into many logical


streams on each of which a window operator can be
applied.
• Hadoop: It doesn’t support streaming so there is no need of
window criteria.
• Spark: It has time-based window criteria.
• Flink: It has record-based or any custom user-defined Flink
Window criteria.
Thank You

You might also like