BDA Lect5 Apache Spark 2023
BDA Lect5 Apache Spark 2023
• The two cluster managers, Apache Mesos and Hadoop YARN, manage
the underlying data that we want to analyze.
Apache Spark Application Flow
Apache Spark RDD
• Resilient Distributed Datasets (RDDs) in Spark is an API of Spark. It
collects all the elements of the data in the cluster which are well
partitioned.
• Resilient - meaning ability to re-computed from history which in
turn fault tolerant.
Hands-on Exercise
Once created RDD performs two types of operations: Transformations and Actions
By using creatDataFrame() function
Once created RDD performs two types of operations: Transformations and Actions
By using read and load functions
Once created RDD performs two types of operations: Transformations and Actions
Read data from dataset
Once created RDD performs two types of operations: Transformations and Actions
Read data from dataset
Once created RDD performs two types of operations: Transformations and Actions
Read dataset from HDFS
Once created RDD performs two types of operations: Transformations and Actions
Apache Spark RDD
• Map()
• Using map() transformation we take in any function, and
that function is applied to every element of RDD.
>>> data = [1, 2, 3, 4, 5, 6]
>>> rdd = sc.parallelize(data)
>>> map_result = rdd.map(lambda x: x * 2)
>>> map_result.collect()
[2, 4, 6, 8, 10, 12]
Apache Spark Transformation Operations
• Filter()
• The filter(func) function returns a new RDD containing only the
elements of the source that the supplied function returns as true.
The following example returns only the even numbers from the
source RDD:
• Distinct()
• The distinct() method returns a new RDD containing only the distinct
elements from the source RDD. The following example returns the
unique elements in a list:
• flatMap()
• The flatMap(func) function is similar to the map() function, except it
returns a flattened version of the results. For comparison, the following
examples return the original element from the source RDD and its
square. The example using the map() function returns the pairs as a list
within a list:
• Reduce()
• The reduce() method aggregates elements in an RDD using a function,
which takes two arguments and returns one. The function used in the
reduce method is commutative and associative, ensuring that it can be
correctly computed in parallel. The following example returns the
product of all of the elements in the RDD:
• Take()
• The take(n) method returns an array with the first n elements of the
RDD. The following example returns the first two elements of an RDD:
• Collect()
• The collect() method returns all of the elements of the RDD as an array.
The following example returns all of the elements from an RDD:
>>> data = [1, 2, 3, 4, 5]
>>> rdd = sc.parallelize(data)
>>> rdd.collect()
[1, 2, 3, 4, 5]
• It is important to note that calling collect() on large datasets could
cause the driver to run out of memory. To inspect large RDDs, the
take() and collect() methods can be used to inspect the top n
elements of a large RDD. The following example will return the
first 100 elements of the RDD to the driver:
• >>> rdd.take(100).collect()
Apache Spark Action Operations
• takeOrdered()
• The takeOrdered(n, key=func) method returns the first n elements of
the RDD, in their natural order, or as specified by the function func. The
following example returns the first four elements of the RDD in
descending order:
>>> data = [6,1,5,2,4,3]
>>> rdd = sc.parallelize(data)
>>> rdd.takeOrdered(4, lambda s: -s)
[6, 5, 4, 3]
Perform Transformations and Actions on RDD
Apache Spark Transformation Operations
• Filter()
• Spark RDD filter() function returns a new RDD, containing only the
elements that meet a predicate. It is a narrow operation because it does
not shuffle data from one partition to many partitions.
• For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4,
and 5) and the predicate is check for an even number. The resulting RDD
after the filter will contain only the even numbers i.e., 2 and 4.
Apache Spark Transformation Operations
• Filter()
• Spark RDD filter() function returns a new RDD, containing only the
elements that meet a predicate. It is a narrow operation because it does
not shuffle data from one partition to many partitions.
• For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4,
and 5) and the predicate is check for an even number. The resulting RDD
after the filter will contain only the even numbers i.e., 2 and 4.
>>> data = [1, 2, 3, 4, 5, 6]
>>> filter_result = rdd.filter(lambda x: x % 2 == 0)
>>> filter_result.collect()
[2, 4, 6]
Apache Spark Transformation Operations
• Filter()
• Spark RDD filter() function returns a new RDD, containing only the
elements that meet a predicate. It is a narrow operation because it does
not shuffle data from one partition to many partitions.
• For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4,
and 5) and the predicate is check for an even number. The resulting RDD
after the filter will contain only the even numbers i.e., 2 and 4.
• union(dataset)
• With the union() function, we get the elements of both the RDD in new
RDD. The key rule of this function is that the two RDDs should be of the
same type.
• For example, the elements of RDD1 are (Spark, Spark, Hadoop, Flink) and
that of RDD2 are (Big data, Spark, Flink) so the resultant
rdd1.union(rdd2) will have elements (Spark, Spark, Spark, Hadoop, Flink,
Flink, Big data).
Apache Spark Transformation Operations
val rdd1 =
spark.sparkContext.parallelize(Seq((1,"jan",2016),(3,"nov",2014),(16,"fe
b",2014)))
val rdd2 =
spark.sparkContext.parallelize(Seq((5,"dec",2014),(17,"sep",2015)))
val rdd3 =
spark.sparkContext.parallelize(Seq((6,"dec",2011),(16,"may",2015)))
val rddUnion = rdd1.union(rdd2).union(rdd3)
rddUnion.foreach(println)
Introduction to Apache Spark
Apache Spark Streaming
Spark Streaming
• Basic Sources
• File Streams
• Streams based on Custom Receivers
• The queue of RDDs as a Stream
• Advance Sources
• Kafka: Kafka broker versions 0.8.2.1 is compatible with spark
streaming 2.2.0.
• Flume: Flume 1.6.0 is compatible with spark streaming 2.2.0.
• Kinesis: Kinesis client library 1.2.1 is compatible with spark
streaming 2.2.0.
Hands-on Exercise
• Stateful Transformations
• Output operations
• Print()
• Save()
Hands-on Exercise
• As window slides over a source DStream, the source RDDs that fall
within the window are combined. It also operated upon which produces
spark RDDs of the windowed DStream. Hence, In this specific case, the
operation is applied over the last 3 time units of data, also slides by 2-
time units.
• Basically, any Spark window operation requires specifying two
parameters.
• Window length – It defines the duration of the window (3 in the
figure).
• Sliding interval – It defines the interval at which the window
operation is performed (2 in the figure).
• However, these 2 parameters must be multiples of the batch
interval of the source DStream.
Spark Streaming Window Operations
• ReduceByKeyAndWindow(func, windowLength,
slideInterval, [numTasks])
• Whenever we call reduceByKeyAndWindow window on a
DStream of (K, V) pairs, it returns a new DStream of (K, V) pairs.
Here, we aggregate values of each key, by given reduce function
func over batches in a sliding window.
• In addition, it uses spark’s default number of parallel tasks, for
grouping purpose. Like for local mode, it is 2. While in cluster
mode it determines number using spark.default.parallelism config
property. To set a different number of tasks, it passes an optional
numTasks argument.
Spark Streaming Window Operations
• CountByValueAndWindow(windowLength, slideInterval,
[numTasks])
• While, we call countByValueAndWindow on a DStream of (K, V) pairs, it
returns a new DStream of (K, Long) pairs. Here, the value of each key is
its frequency within a sliding window. In one case it is very similar to
reduceByKeyAndWindow operation. Here also, we can configure the
number of reduce tasks by an optional argument.
Apache Spark Streaming Checkpoint
• Data checkpointing
• All the generated RDDs are saving to reliable storage.
• For some stateful transformations, it is necessary to combine the
data across multiple batches. Since generated RDDs in some
transformations depend on RDDs of previous batches.
• It causes the length of the dependency chain to keep increasing with
time, to reduce increases in recovery time.
• Checkpoint intermediate RDDs of stateful transformation, it happens
at reliable storage to cut off the dependency chains.
Apache Spark Streaming Checkpoint
• What is MLlib?
• MLlib is a Spark subproject providing machine learning primitives:
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8
• 33 contributors
APACHE SPARK MLlib Tools
• Algorithms:
• Basic Statistics
• Regression
• Classification
• Recommendation System
• Clustering
• Dimensionality Reduction
• Feature Extraction
• Optimization
Basic Statistics
Assume that we have the following DataFrame with columns id and raw:
id | raw
----|----------
0 | [I, saw, the, red, baloon]
1 | [Mary, had, a, little, lamb]
Applying StopWordsRemover with raw as the input column and filtered as the output column,
we should get the following:
id | raw | filtered
----|-----------------------------|--------------------
0 | [I, saw, the, red, baloon] | [saw, red, baloon]
1 | [Mary, had, a, little, lamb]|[Mary, little, lamb]
In filtered, the stop words “I”, “the”, “had”, and “a” have been filtered out.
Extracting, transforming and selecting
features
• n-gram
• An n-gram is a sequence of n tokens (typically words) for some integer n.
The NGram class can be used to transform input features into n-grams.
• PolynomialExpansion
• Discrete Cosine Transform (DCT)
• StringIndexer
• IndexToString
• Interaction
• Normalizer
• StandardScaler
• MinMaxScaler
• MaxAbsScaler
• Bucketizer
Feature Selectors
• VectorSlicer
• VectorSlicer is a transformer that takes a feature vector and outputs a
new feature vector with a sub-array of the original features. It is useful
for extracting features from a vector column.
userFeatures | features
------------------|-----------------------
------
[0.0, 10.0, 0.5] | [10.0, 0.5]
userFeatures | features
------------------|-----------------------------
[0.0, 10.0, 0.5] | [10.0, 0.5]
["f1", "f2", "f3"] | ["f2", "f3"]
Hands-on Exercise
• Apache Flink is the next generation Big Data tool also known
as 4G of Big Data.
• It is the true stream processing framework (doesn’t cut
stream into micro-batches).
• Flink’s kernel (core) is a streaming runtime which also
provides distributed processing, fault tolerance, etc.
• Flink processes events at a consistently high speed with low
latency.
• It processes the data at lightning fast speed.
• It is the large-scale data processing framework which can
process data generated at very high velocity.
What is Apache Flink
DataSet API
• It handles the data at the rest, it allows the user to implement operations like map, filter, join,
group, etc. on the dataset. It is mainly used for distributed processing. Actually, it is a special
case of Stream processing where we have a finite data source. The batch application is also
executed on the streaming runtime.
DataStream API
• It handles a continuous stream of the data. To process live data stream it provides various
operations like map, filter, update states, window, aggregate, etc. It can consume the data from
the various streaming source and can write the data to different sinks. It supports
both Java and Scala.
Apache Flink Ecosystem Components
Table
• It enables users to perform ad-hoc analysis using SQL like expression
language for relational stream and batch processing. It can be embedded
in DataSet and DataStream APIs.
• Actually, it saves users from writing complex code to process the data
instead allows them to run SQL queries on the top of Flink.
Apache Flink Ecosystem Components
Gelly
• It is the graph processing engine which allows users to run set of
operations to create, transform and process the graph. Gelly also provides
the library of an algorithm to simplify the development of graph
applications. It leverages native iterative processing model of Flink to
handle graph efficiently. Its APIs are available in Java and Scala.
Apache Flink Ecosystem Components
FlinkML
• It is the machine learning library which provides intuitive
APIs and an efficient algorithm to handle machine learning
applications. We write it in Scala. As we know machine
learning algorithms are iterative in nature, Flink provides
native support for iterative algorithm to handle the same
quite effectively and efficiently.
Flink Architecture
Flink Features
• High performance – Flink’s data streaming Runtime provides very high
throughput.
• Low latency – Flink can process the data in sub-second range without any delay/
• Event Time and Out-of-Order Events – Flink supports stream processing and
windowing where events arrive delayed or out of order.
• Lightning fast speed – Flink processes data at lightning fast speed (hence also
called as 4G of Big Data).
• Fault Tolerance – Failure of hardware, node, software or a process doesn’t
affect the cluster.
• Memory management – Flink works in managed memory and never get out of
memory exception.
• Broad integration – Flink can be integrated with the various storage system to
process their data, it can be deployed with various resource management tools.
It can also be integrated with several BI tools for reporting.
Flink Features
• Stream processing – Flink is a true streaming engine, can process live
streams in the sub-second interval.
• Program optimizer – Flink is shipped with an optimizer, before
execution of a program it is optimized.
• Scalable – Flink is highly scalable. With increasing requirements, we
can scale the flink cluster.
• Rich set of operators – Flink has lots of pre-defined operators to
process the data. All the common operations can be done using
these operators.
• Exactly-once Semantics – It can maintain custom state during
computation.
• Highly flexible Streaming Windows – In flink we can customize
windows by triggering conditions flexibly, to get the required
streaming patterns. We can create window according to time t1 to t5
and data-driven windows.
Steps to execute the applications in Flink:
• Program – Developer wrote the application program.
• Parse and Optimize – The code parsing, Type Extractor, and
Optimization are done during this step.
• DataFlow Graph – Each and every job converts into the data flow graph.
• Job Manager – Now job manager schedules the task on the task
managers; keeps the data flow metadata. Job manager deploys the
operators and monitors the intermediate task results
• Task Manager – The tasks are executed on the task manager, they are
the worker nodes.
Anatomy of a Flink Program
• Flink
programs look like regular programs that
transform collections of data. Each program
consists of the same basic parts:
• Obtain an execution environment,
• Load/create the initial data,
• Specify transformations on this data,
• Specify where to put the results of your computations,
• Trigger the program execution
FlinkCEP - Complex event processing for Flink
• FlinkCEP is the Complex Event Processing (CEP) library implemented
on top of Flink.
• It allows you to detect event patterns in an endless stream of events,
giving you the opportunity to get hold of what’s important in your
data.
• Flink-The Pattern API
• Individual Patterns
• A Pattern can be either a singleton or a looping pattern.
• Singleton patterns accept a single event, while looping patterns
can accept more than one
• By default, a pattern is a singleton pattern and you can transform
it to a looping one by using Quantifiers.
• Each pattern can have one or more Conditions based on which it
accepts events.
Flink-The Pattern API
• Individual Patterns
• A Pattern can be either a singleton or a looping pattern.
• Singleton patterns accept a single event, while looping patterns can
accept more than one
• By default, a pattern is a singleton pattern and you can transform it
to a looping one by using Quantifiers.
• Each pattern can have one or more Conditions based on which it
accepts events.
Flink CEP Quantifiers
• // expecting 4 occurrences
start.times(4)
• // expecting 0 or 4 occurrences
start.times(4).optional()
• // expecting 2, 3 or 4 occurrences
start.times(2, 4)
• // expecting 2, 3 or 4 occurrences and repeating as many as possible
start.times(2, 4).greedy()
Flink CEP Quantifiers
• // expecting 0, 2, 3 or 4 occurrences
start.times(2, 4).optional()
• // expecting 0, 2, 3 or 4 occurrences and repeating as many as
possible
start.times(2, 4).optional().greedy()
• // expecting 1 or more occurrences
start.oneOrMore()
• // expecting 1 or more occurrences and repeating as many as
possible
start.oneOrMore().greedy()
Flink CEP Quantifiers