0% found this document useful (0 votes)
39 views42 pages

Chapter 7 Spark Computing Engine

Chapter 7 provides an overview of Apache Spark and its ecosystem, detailing components like Spark Core, Spark SQL, and Spark Streaming, along with their principles and examples. It explains key concepts such as RDDs, DataFrames, and DataSets, including their creation, transformation, and action operations. The chapter aims to equip readers with a foundational understanding of Spark's functionalities and its application in big data processing.

Uploaded by

Eric Sandria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views42 pages

Chapter 7 Spark Computing Engine

Chapter 7 provides an overview of Apache Spark and its ecosystem, detailing components like Spark Core, Spark SQL, and Spark Streaming, along with their principles and examples. It explains key concepts such as RDDs, DataFrames, and DataSets, including their creation, transformation, and action operations. The chapter aims to equip readers with a foundational understanding of Spark's functionalities and its application in big data processing.

Uploaded by

Eric Sandria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 7 Spark — Computing Engine

Foreword
⚫ This chapter describes Spark and its ecosystem, including Spark
Core, Spark SQL, Spark Streaming, RDD, DataFrame, DataSet,
DStream, their internal principles, and typical examples.

2
Objectives
⚫ Upon completion of this course, you will understand:
 Spark and its ecosystem
 Basic principles of Spark Core, Spark SQL, and Spark Streaming
 Concepts, characteristics, and principles of RDD, DataFrame, DataSet, and
DStream

3
Contents
1. Spark Overview

2. Spark Core

3. Spark SQL

4. Spark Streaming

4
Introduction
⚫ Apache Spark started as a research project at the UC Berkeley AMPLab
in 2009.
⚫ It is a fast, versatile, and scalable memory-based big data compute
engine.
⚫ It is a one-stop solution that integrates batch processing, real-time
streaming, interactive query, graph programming, and machine
learning.

5
Highlights

Strong Abundant High


Fast
compatibility APIs universality

Memory-based Integration with APIs for multiple Support for batch


computing other big data programming and stream
Subsecond-level latency components languages computing

6
Spark Ecosystem
⚫ The Spark computing framework consists of Spark Core and other extended
libraries such as Spark SQL, Spark Streaming, MLlib, and GraphX, forming the
Spark ecosystem.

Spark
SparkSQL MLlib GraphX
Streaming
Structured Machine Graph
Real-time
data learning computing
computing

Spark Core

7
Contents
1. Spark Overview

2. Spark Core

3. Spark SQL

4. Spark Streaming

8
Introduction to Spark Core
⚫ Spark Core implements basic Spark functions, including task
scheduling, memory management, error recovery, and interaction
with the storage system. It is the core component of the Spark
framework.
⚫ Spark Core defines APIs for resilient distributed datasets (RDDs)
and provides multiple APIs for creating and using RDDs.

9
RDD Overview
⚫ RDD indicates a set of elements that are distributed on multiple compute nodes and can be
used concurrently. An RDD is the main abstraction provided by Spark. Its essence is as
follows:
 An RDD is a read-only and partitionable distributed dataset.
 By default, RDDs are stored in the memory and are written to disks when memory is insufficient.
 RDD data is stored in clusters as partitions.
 RDDs have a lineage mechanism, which allows for rapid data recovery when data is lost.
HDFS Spark cluster External storage

RDD 1 RDD 2
Hello Spark Hello Spark
Hello Hadoop "Hello Spark" "Hello, Spark" Hello Hadoop
China Mobile "Hello Hadoop" "Hello, Hadoop" China Mobile
"China Mobile" "China, Mobile"

10
RDD Features
⚫ From the perspective of source code, RDD has
the following five basic attributes:
RDD 1
 An RDD consists of a group of partitions.
Zone 1
 An RDD has a function for calculating each shard.
Zone 2
 An RDD maintains a list of dependencies with

...
other RDDs.
 RDDs of the key-value type have partitioners, such
as HashPartitioner.
 An RDD maintains a list of preferred locations of
each computed shard, such as block locations of DataNode DataNode DataNode
HDFS files. HDFS

11
RDD Dependencies
⚫ An RDD has two types Narrow Dependencies: Wide Dependencies:

of dependencies:
 Wide dependencies: Each
partition of a parent RDD groupByKey
map, filter
may be used by multiple
child RDD partitions.
 Narrow dependencies:
Only one child RDD can join with inputs
co-partitioned
use each partition of a union join with inputs not
co-partitioned
parent RDD.

12
Wide Dependencies vs. Narrow Dependencies

Difference Wide Dependencies Narrow Dependencies

Only the parent RDD partition


Fault In extreme cases, all parent RDD
corresponding to the child RDD partition
tolerance partitions need to be recalculated.
needs to be recalculated.
Generate shuffles. During running, the
The partition of each parent RDD is
partition of the same parent RDD is
transferred to only one child RDD partition.
Transmission transferred to different child RDD
Typically, the conversion can be completed
partitions, which may involve data
on one node.
transmission between multiple nodes.

Stage division They are the basis for stage division. No

13
Major RDD Operation Types

Operation Type Description


RDD
Used for RDD creation. An RDD is created through a memory collection
1
Creation and an external storage system, from a collection, or by a
transformation operation.
An RDD is transformed into a new RDD through certain operations.
Transformation The transformation operation of the RDD is a lazy operation, which
only defines a new RDD but does not execute it immediately.
An operation that can trigger Spark running. There are two types of
Action action operations in Spark. One is to output the calculation result, and
the other is to save the RDD to an external file system or database.

14
Creating an RDD
⚫ From a collection
scala> val rdd = [Link](1 to 10)

⚫ From an external HDFS file system


scala> val rdd = [Link]("/hdfspath/datas/[Link]")

⚫ Through an RDD transformation operation


scala> val rdd1 = [Link](1 to 10)
scala> val rdd2 = [Link](_ * 2)

15
RDD Transformation Operations and Common
Operators
⚫ Spark has many built-in RDD transformation operation functions (also called
operators). Some transformation operators are as follows:
Transformation Operator Description

map(func) Uses the func method to generate a new RDD for each element in the RDD that invokes map.

func is used for each element of an RDD that invokes filter and then an RDD with elements
filter(func)
containing func (the value is true) is returned.

reduceBykey(func, It is similar to groupBykey. However, the value of each key is calculated based on the
[numTasks]) provided func to obtain a new value.
If the data set is (K, V) and the associated data set is (K, W), then (K, (V, W) is returned.
join(otherDataset, [numTasks]) leftOuterJoin, rightOutJoin, and fullOuterJoin are supported.

⚫ Note: The transformation operator is a lazy operator. During execution, only the
transformation logic is recorded and calculation is not performed immediately.

16
RDD Action Operations and Common Operators
⚫ Spark has many built-in RDD action operation functions (also called operators). Some action
operators are as follows:

Action Operator Description


reduce(func) Aggregates elements in a dataset based on functions.
collect() Used to encapsulate the filter result or a small enough result and return an array.
count() Collects statistics on the number of elements in an RDD.
first() Obtains the first element of a data set.
take(n) Obtains the top elements of a dataset and returns an array.
saveAsTextFile(path) Writes data sets to a local file or an external storage system such as HDFS.

⚫ Note: When each action operator is executed, program execution is triggered, that is, a job is
generated.

17
Classic Example of Spark Core — WordCount
Example: Count the occurrences of words in the /[Link] text file in HDFS and store
the result in HDFS. The text content is as follows:
An apple
A pair of shoes
Orange apple

In Spark shell, compile the following program:

scala> val rdd1 = [Link]("/[Link]") // Create an RDD.


scala> val rdd2 = [Link](_.split(" ")) // Perform the transformation operation.
scala> val rdd3 = [Link]((_,1)) // Perform the transformation operation.
scala> val rdd4 = [Link](_+_) // Perform the transformation operation.
scala> rdd4. saveAsTextFile("/result") // Perform the action operation.

18
WordCount Program Running Logic

textFile flatMap map reduceByKey saveAsTextFile

HDFS RDD RDD RDD RDD HDFS

An apple An apple (An, 1) (An, 1)


(An, 1)
A pair of shoes A pair of (apple, 1) (A,1)
(A,1)
An apple Orange apple shoes Orange (A, 1) (apple, 2)
(apple, 2)
A pair HDFS
of shoes apple (pair, 1) (pair, 1)
HDFS
(pair, 1)
Orange apple (of, 1) (of, 1)
(of, 1)
(shoes, 1) (shoes, 1)
(shoes, 1)
(Orange, 1) (Orange, 1)
(Orange, 1)
(apple, 1)

19
Key Concepts of Spark Core
⚫ The RDD operation example implies the following key concepts of Spark:

Concept Description

Application An application is generated when a SparkContext object is initialized.


A directed acyclic graph (DAG) is formed after a series of transformations is performed on
DAG
the original RDD.
Job A job is generated once an action operator is triggered.
Jobs are divided into different stages based on dependencies between RDDs. A job is
Stage
divided into only one stage when wide dependency occurs. A stage is essentially a task set.
A task is the basic execution unit in a Spark program. It sends stage division results to
Task
different executors for execution.

⚫ Note: There is a one-to-many relationship from application, to job, to stage, and then to
task.
20
Spark Core Program Running View
⚫ The program running view is as follows:
Job DAG
Stage 0 Stage1
textFile reduceByKey
HDFS HDFS

File File
[Link] Part-r-
read 00000
flatMap
write
saveAsTextFile
map

21
Contents
1. Spark Overview

2. Spark Core

3. Spark SQL

4. Spark Streaming

22
Introduction to Spark SQL
⚫ Spark SQL is a Spark module for structured data processing. Unlike the basic
Spark RDD API, the interfaces provided by Spark SQL provide Spark with
more information about the structure of both the data and the computation
being performed. Spark SQL provides two core programming abstractions:
DataFrame and DataSet, and supports two interaction modes: SQL and
Dataset API.

23
Basic Principles of Spark SQL
⚫ Spark SQL transforms the read data into DataFrames or DataSets. The DataFrames or
DataSets are transformed by the Spark SQL parser, analyzer, optimizer, and code generator,
compiled and packed, and then sent to the Spark execution engine for subsequent
computing and analysis.

Spark SQL

ID:Int Name:String Age:Int


Parse
Transform

1,Tom,20 Analyze Spark


1 Tom 20
2,Nancy,22 Read Optimize Submit execution
3,Marry,19 2 Nancy 22 Compile engine
Pack
3 Marry 19
DataFrame

24
Spark SQL Execution Process
⚫ Parser: parses the lexical and syntax of SQL statements.
⚫ Analyzer: parses unresolved logical plans into analyzed logical plans based on catalog information.
⚫ Optimizer: uses rules to parse analyzed logical plans into optimized logical plans.
⚫ Planner: transforms a logical plan to multiple physical plans and selects the optimal physical plan.
⚫ Code generator: generates Java code based on SQL statements.

Logical Code
SQL AST Analysis optimization generation

model
Unresolved Optimized Selected

Cost
DataFrame logical plans physical plans RDDs
logical plans logical plans physical plans

DataSet Catalog

25
DataFrame
⚫ Like an RDD, a DataFrame is an immutable distributed collection of data.
Data is organized into named columns, like a table in a relational database.
DataFrame stores data in rows and maintains schemas and data.

Name Age Salary Field (Column) Name


Schema
String Int Double Field (Column) Type
String Int Double
String Int Double Stores data in rows.
String Int Double
String Int Double

26
DataSet
⚫ DataFrame is a special case of DataSet (DataFrame=Dataset[Row]).
Therefore, you can use the as method to convert DataFrame to DataSet. Row
is a common type where all table structure information is represented by
row.
⚫ DataSet, a typed dataset, includes Dataset[Car] and Dataset[Person].

27
DataFrame and DataSet Representations
⚫ Assume that the data in a DataFrame is as follows:

Name:String Age:Int Salary:Double


Tom 30 5435.87
Nancy 20 6124.94

⚫ The data in DataSet is displayed as follows:


Value:Person[Name:String,Age:Int,Salary:Double]
Person[Name=Tom,Age=30,Salary=5435.87]
Person[Name=Nancy,Age=20,Salary=6124.94]

28
Major Operation Types of DataFrame and DataSet

Operation Type Description


RDD1
Creation
Used to create a DataFrame or DataSet.
(Creation Operation)

A DataFrame/Dataset is transformed into a new


DataFrame/DataSet through certain operations. The transformation
Transformation
operation of the DataFrame/DataSet is a lazy operation, which only
(Transformation Operation)
defines a new DataFrame/DataSet but does not execute it
immediately.
An operation that can trigger Spark running. There are two types of
Action action operations in Spark. One is to output the calculation result,
(Action Operation) and the other is to save the DataSet to an external file system or
database.

29
Creating a DataFrame
 Defining a schema
scala> val schema = StructType(List(
StructField("name", StringType, nullable=true),
StructField("age", IntegerType, nullable=true),
StructField("salary",DoubleType, nullable=true)
))

 Defining a row set

scala> val dataList = Seq[Row](


Row("Xiaoming", 20, 6543.88),
Row("xiaohong", 19, 7865.53),
Row("xiaohua", 21, 3425.56))

 Creating a DataFrame using SparkSession


scala> val df = [Link]([Link](dataList),schema)

30
DataFrame Transformation and Action Operations
 Transformation: Multiply the salary in the DataFrame by 10 to generate a new DataFrame.

scala> val df2 = [Link](col("name"),col("age"),col("salary") * 100)

 Action: Print the content of df2.

scala> [Link]

The command output is as follows:

+-----------------+-------+---------------+
| name | age | (salary * 100) |
+--------+---+----------------------------+
|Xiaoming| 20 | 654388.0 |
|xiaohong| 19 | 786553.0 |
| xiaohua | 21 | 342556.0 |
+--------+---+---------------------------+

31
Contents
1. Spark Overview

2. Spark Core

3. Spark SQL

4. Spark Streaming

32
Introduction to Spark Streaming
⚫ Spark Streaming is a streaming (real-time) computing framework based on
micro-batch processing. DStream is an abstraction of all data streams in
Spark Streaming. It is a continuous sequence of RDDs for real-time data
stream processing.
Kafka

Flume HDFS
Spark Databases
HDFS/S3
Streaming
Kinesis Dashboards

Twitter

33
Spark Streaming Principles
⚫ The basic principle of Spark Streaming is to split real-time input data streams
by time slice (in seconds), and then use the Spark engine to process data of
each time slice in a way similar to batch processing.

Batches of
Input data stream Spark Batches of input data Spark processed data
Streaming Engine

34
DStream Overview
⚫ DStream: basic abstraction provided by Spark Streaming. It represents a continuous stream
of data. Internally, a DStream is represented by a continuous series of RDDs, which is Spark's
abstraction of an immutable, distributed dataset.
⚫ Each RDD in a DStream contains data from a certain interval. Any operation applied on a
DStream translates to operations on the underlying RDDs.

RDD@time1 RDD@time2 RDD@time3 RDD@time4

data from data from data from data from


DStream
time 0 to 1 time 1 to 2 time 2 to 3 time 3 to 4

35
DStream Main Operations
⚫ DStream operations include creation, transformation, and output operations.
The transformation and window operations are as follows:

Operation Type Description

Similar to that of RDDs, transformations allow the data from


Transformation
the input DStream to be modified. DStreams support many of
(Transformation Operation)
the transformations available on normal Spark RDDs.

Output Output operations allow DStream's data to be pushed out to


(Output Operation) external systems like a database or a file system.

36
Window Operations
⚫ Windowed computations allow you to apply transformations over a sliding window of data.
The RDDs that fall within the window are combined and operated upon to produce the
RDDs of the windowed DStream. Each window has two attributes:
 Window length: The duration of the window.
 Sliding interval: The interval at which the window operation is performed.
Time 1 Time 2 Time 3 Time 4 Time 5

Original
DStream

Windowed
DStream
Window Window Window
at time 1 at time 3 at time 5

37
Summary
⚫ This chapter described Spark and its ecosystem, including Spark
Core, Spark SQL, Spark Streaming, RDD, DataFrame, DataSet,
DStream, and their internal principles, and uses RDD examples to
analyze the key concepts and running process views.

38
Q&A
1. Which of the following statements are true about the dependencies between RDDs?
A. If a partition of a parent RDD corresponds to multiple partitions of a child RDD, the dependencies
are wide.
B. If multiple partitions of a parent RDD correspond to one partition of a child RDD, the dependencies
are wide.
C. If a partition of a parent RDD can be inherited by only one partition of a child RDD, the
dependencies are narrow.
D. If a partition of a parent RDD can be inherited by multiple partitions of a child RDD, the
dependencies are narrow.
2. RDD stages are divided based on wide dependencies.
A. True
B. False

39
Assignment
1. Use Spark to compile a WordCount program and sort the statistics in
descending order of values.

2. Use Spark SQL to collect statistics on the average score of each subject in
each class.

40
Recommendations
⚫ Huawei Cloud websites
 Official website: [Link]
 Developer Institute: [Link]

Huawei Cloud
Developer Institute

41
Thank You.
Copyright© 2023 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including,
without limitation, statements regarding the future financial and operating results,
future product portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially from those
expressed or implied in the predictive statements. Therefore, such information is
provided for reference purpose only and constitutes neither an offer nor an
acceptance. Huawei may change the information at any time without notice.

42

You might also like