0% found this document useful (0 votes)

39 views42 pages

Chapter 7 Spark Computing Engine

Chapter 7 provides an overview of Apache Spark and its ecosystem, detailing components like Spark Core, Spark SQL, and Spark Streaming, along with their principles and examples. It explains key concepts such as RDDs, DataFrames, and DataSets, including their creation, transformation, and action operations. The chapter aims to equip readers with a foundational understanding of Spark's functionalities and its application in big data processing.

Uploaded by

Eric Sandria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views42 pages

Chapter 7 Spark Computing Engine

Uploaded by

Eric Sandria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 7 Spark — Computing Engine

Foreword
⚫ This chapter describes Spark and its ecosystem, including Spark
Core, Spark SQL, Spark Streaming, RDD, DataFrame, DataSet,
DStream, their internal principles, and typical examples.

2
Objectives
⚫ Upon completion of this course, you will understand:
 Spark and its ecosystem
 Basic principles of Spark Core, Spark SQL, and Spark Streaming
 Concepts, characteristics, and principles of RDD, DataFrame, DataSet, and
DStream

3
Contents
1. Spark Overview

2. Spark Core

3. Spark SQL

4. Spark Streaming

4
Introduction
⚫ Apache Spark started as a research project at the UC Berkeley AMPLab
in 2009.
⚫ It is a fast, versatile, and scalable memory-based big data compute
engine.
⚫ It is a one-stop solution that integrates batch processing, real-time
streaming, interactive query, graph programming, and machine
learning.

5
Highlights

Strong Abundant High

Fast
compatibility APIs universality

Memory-based Integration with APIs for multiple Support for batch

computing other big data programming and stream
Subsecond-level latency components languages computing

6
Spark Ecosystem
⚫ The Spark computing framework consists of Spark Core and other extended
libraries such as Spark SQL, Spark Streaming, MLlib, and GraphX, forming the
Spark ecosystem.

Spark
SparkSQL MLlib GraphX
Streaming
Structured Machine Graph
Real-time
data learning computing
computing

Spark Core

7
Contents
1. Spark Overview

2. Spark Core

3. Spark SQL

4. Spark Streaming

8
Introduction to Spark Core
⚫ Spark Core implements basic Spark functions, including task
scheduling, memory management, error recovery, and interaction
with the storage system. It is the core component of the Spark
framework.
⚫ Spark Core defines APIs for resilient distributed datasets (RDDs)
and provides multiple APIs for creating and using RDDs.

9
RDD Overview
⚫ RDD indicates a set of elements that are distributed on multiple compute nodes and can be
used concurrently. An RDD is the main abstraction provided by Spark. Its essence is as
follows:
 An RDD is a read-only and partitionable distributed dataset.
 By default, RDDs are stored in the memory and are written to disks when memory is insufficient.
 RDD data is stored in clusters as partitions.
 RDDs have a lineage mechanism, which allows for rapid data recovery when data is lost.
HDFS Spark cluster External storage

RDD 1 RDD 2
Hello Spark Hello Spark
Hello Hadoop "Hello Spark" "Hello, Spark" Hello Hadoop
China Mobile "Hello Hadoop" "Hello, Hadoop" China Mobile
"China Mobile" "China, Mobile"

10
RDD Features
⚫ From the perspective of source code, RDD has
the following five basic attributes:
RDD 1
 An RDD consists of a group of partitions.
Zone 1
 An RDD has a function for calculating each shard.
Zone 2
 An RDD maintains a list of dependencies with

...
other RDDs.
 RDDs of the key-value type have partitioners, such
as HashPartitioner.
 An RDD maintains a list of preferred locations of
each computed shard, such as block locations of DataNode DataNode DataNode
HDFS files. HDFS

11
RDD Dependencies
⚫ An RDD has two types Narrow Dependencies: Wide Dependencies:

of dependencies:
 Wide dependencies: Each
partition of a parent RDD groupByKey
map, filter
may be used by multiple
child RDD partitions.
 Narrow dependencies:
Only one child RDD can join with inputs
co-partitioned
use each partition of a union join with inputs not
co-partitioned
parent RDD.

12
Wide Dependencies vs. Narrow Dependencies

Difference Wide Dependencies Narrow Dependencies

Only the parent RDD partition

Fault In extreme cases, all parent RDD
corresponding to the child RDD partition
tolerance partitions need to be recalculated.
needs to be recalculated.
Generate shuffles. During running, the
The partition of each parent RDD is
partition of the same parent RDD is
transferred to only one child RDD partition.
Transmission transferred to different child RDD
Typically, the conversion can be completed
partitions, which may involve data
on one node.
transmission between multiple nodes.

Stage division They are the basis for stage division. No

13
Major RDD Operation Types

Operation Type Description

RDD
Used for RDD creation. An RDD is created through a memory collection
1
Creation and an external storage system, from a collection, or by a
transformation operation.
An RDD is transformed into a new RDD through certain operations.
Transformation The transformation operation of the RDD is a lazy operation, which
only defines a new RDD but does not execute it immediately.
An operation that can trigger Spark running. There are two types of
Action action operations in Spark. One is to output the calculation result, and
the other is to save the RDD to an external file system or database.

14
Creating an RDD
⚫ From a collection
scala> val rdd = [Link](1 to 10)

⚫ From an external HDFS file system

scala> val rdd = [Link]("/hdfspath/datas/[Link]")

⚫ Through an RDD transformation operation

scala> val rdd1 = [Link](1 to 10)
scala> val rdd2 = [Link](_ * 2)

15
RDD Transformation Operations and Common
Operators
⚫ Spark has many built-in RDD transformation operation functions (also called
operators). Some transformation operators are as follows:
Transformation Operator Description

map(func) Uses the func method to generate a new RDD for each element in the RDD that invokes map.

func is used for each element of an RDD that invokes filter and then an RDD with elements
filter(func)
containing func (the value is true) is returned.

reduceBykey(func, It is similar to groupBykey. However, the value of each key is calculated based on the
[numTasks]) provided func to obtain a new value.
If the data set is (K, V) and the associated data set is (K, W), then (K, (V, W) is returned.
join(otherDataset, [numTasks]) leftOuterJoin, rightOutJoin, and fullOuterJoin are supported.

⚫ Note: The transformation operator is a lazy operator. During execution, only the
transformation logic is recorded and calculation is not performed immediately.

16
RDD Action Operations and Common Operators
⚫ Spark has many built-in RDD action operation functions (also called operators). Some action
operators are as follows:

Action Operator Description

reduce(func) Aggregates elements in a dataset based on functions.
collect() Used to encapsulate the filter result or a small enough result and return an array.
count() Collects statistics on the number of elements in an RDD.
first() Obtains the first element of a data set.
take(n) Obtains the top elements of a dataset and returns an array.
saveAsTextFile(path) Writes data sets to a local file or an external storage system such as HDFS.

⚫ Note: When each action operator is executed, program execution is triggered, that is, a job is
generated.

17
Classic Example of Spark Core — WordCount
Example: Count the occurrences of words in the /[Link] text file in HDFS and store
the result in HDFS. The text content is as follows:
An apple
A pair of shoes
Orange apple

In Spark shell, compile the following program:

scala> val rdd1 = [Link]("/[Link]") // Create an RDD.

scala> val rdd2 = [Link](_.split(" ")) // Perform the transformation operation.
scala> val rdd3 = [Link]((_,1)) // Perform the transformation operation.
scala> val rdd4 = [Link](_+_) // Perform the transformation operation.
scala> rdd4. saveAsTextFile("/result") // Perform the action operation.

18
WordCount Program Running Logic

textFile flatMap map reduceByKey saveAsTextFile

HDFS RDD RDD RDD RDD HDFS

An apple An apple (An, 1) (An, 1)

(An, 1)
A pair of shoes A pair of (apple, 1) (A,1)
(A,1)
An apple Orange apple shoes Orange (A, 1) (apple, 2)
(apple, 2)
A pair HDFS
of shoes apple (pair, 1) (pair, 1)
HDFS
(pair, 1)
Orange apple (of, 1) (of, 1)
(of, 1)
(shoes, 1) (shoes, 1)
(shoes, 1)
(Orange, 1) (Orange, 1)
(Orange, 1)
(apple, 1)

19
Key Concepts of Spark Core
⚫ The RDD operation example implies the following key concepts of Spark:

Concept Description

Application An application is generated when a SparkContext object is initialized.

A directed acyclic graph (DAG) is formed after a series of transformations is performed on
DAG
the original RDD.
Job A job is generated once an action operator is triggered.
Jobs are divided into different stages based on dependencies between RDDs. A job is
Stage
divided into only one stage when wide dependency occurs. A stage is essentially a task set.
A task is the basic execution unit in a Spark program. It sends stage division results to
Task
different executors for execution.

⚫ Note: There is a one-to-many relationship from application, to job, to stage, and then to
task.
20
Spark Core Program Running View
⚫ The program running view is as follows:
Job DAG
Stage 0 Stage1
textFile reduceByKey
HDFS HDFS

File File
[Link] Part-r-
read 00000
flatMap
write
saveAsTextFile
map

21
Contents
1. Spark Overview

2. Spark Core

3. Spark SQL

4. Spark Streaming

22
Introduction to Spark SQL
⚫ Spark SQL is a Spark module for structured data processing. Unlike the basic
Spark RDD API, the interfaces provided by Spark SQL provide Spark with
more information about the structure of both the data and the computation
being performed. Spark SQL provides two core programming abstractions:
DataFrame and DataSet, and supports two interaction modes: SQL and
Dataset API.

23
Basic Principles of Spark SQL
⚫ Spark SQL transforms the read data into DataFrames or DataSets. The DataFrames or
DataSets are transformed by the Spark SQL parser, analyzer, optimizer, and code generator,
compiled and packed, and then sent to the Spark execution engine for subsequent
computing and analysis.

Spark SQL

ID:Int Name:String Age:Int

Parse
Transform

1,Tom,20 Analyze Spark

1 Tom 20
2,Nancy,22 Read Optimize Submit execution
3,Marry,19 2 Nancy 22 Compile engine
Pack
3 Marry 19
DataFrame

24
Spark SQL Execution Process
⚫ Parser: parses the lexical and syntax of SQL statements.
⚫ Analyzer: parses unresolved logical plans into analyzed logical plans based on catalog information.
⚫ Optimizer: uses rules to parse analyzed logical plans into optimized logical plans.
⚫ Planner: transforms a logical plan to multiple physical plans and selects the optimal physical plan.
⚫ Code generator: generates Java code based on SQL statements.

Logical Code
SQL AST Analysis optimization generation

model
Unresolved Optimized Selected

Cost
DataFrame logical plans physical plans RDDs
logical plans logical plans physical plans

DataSet Catalog

25
DataFrame
⚫ Like an RDD, a DataFrame is an immutable distributed collection of data.
Data is organized into named columns, like a table in a relational database.
DataFrame stores data in rows and maintains schemas and data.

Name Age Salary Field (Column) Name

Schema
String Int Double Field (Column) Type
String Int Double
String Int Double Stores data in rows.
String Int Double
String Int Double

26
DataSet
⚫ DataFrame is a special case of DataSet (DataFrame=Dataset[Row]).
Therefore, you can use the as method to convert DataFrame to DataSet. Row
is a common type where all table structure information is represented by
row.
⚫ DataSet, a typed dataset, includes Dataset[Car] and Dataset[Person].

27
DataFrame and DataSet Representations
⚫ Assume that the data in a DataFrame is as follows:

Name:String Age:Int Salary:Double

Tom 30 5435.87
Nancy 20 6124.94

⚫ The data in DataSet is displayed as follows:

Value:Person[Name:String,Age:Int,Salary:Double]
Person[Name=Tom,Age=30,Salary=5435.87]
Person[Name=Nancy,Age=20,Salary=6124.94]

28
Major Operation Types of DataFrame and DataSet

Operation Type Description

RDD1
Creation
Used to create a DataFrame or DataSet.
(Creation Operation)

A DataFrame/Dataset is transformed into a new

DataFrame/DataSet through certain operations. The transformation
Transformation
operation of the DataFrame/DataSet is a lazy operation, which only
(Transformation Operation)
defines a new DataFrame/DataSet but does not execute it
immediately.
An operation that can trigger Spark running. There are two types of
Action action operations in Spark. One is to output the calculation result,
(Action Operation) and the other is to save the DataSet to an external file system or
database.

29
Creating a DataFrame
 Defining a schema
scala> val schema = StructType(List(
StructField("name", StringType, nullable=true),
StructField("age", IntegerType, nullable=true),
StructField("salary",DoubleType, nullable=true)
))

 Defining a row set

scala> val dataList = Seq[Row](

Row("Xiaoming", 20, 6543.88),
Row("xiaohong", 19, 7865.53),
Row("xiaohua", 21, 3425.56))

 Creating a DataFrame using SparkSession

scala> val df = [Link]([Link](dataList),schema)

30
DataFrame Transformation and Action Operations
 Transformation: Multiply the salary in the DataFrame by 10 to generate a new DataFrame.

scala> val df2 = [Link](col("name"),col("age"),col("salary") * 100)

 Action: Print the content of df2.

scala> [Link]

The command output is as follows:

+-----------------+-------+---------------+
| name | age | (salary * 100) |
+--------+---+----------------------------+
|Xiaoming| 20 | 654388.0 |
|xiaohong| 19 | 786553.0 |
| xiaohua | 21 | 342556.0 |
+--------+---+---------------------------+

31
Contents
1. Spark Overview

2. Spark Core

3. Spark SQL

4. Spark Streaming

32
Introduction to Spark Streaming
⚫ Spark Streaming is a streaming (real-time) computing framework based on
micro-batch processing. DStream is an abstraction of all data streams in
Spark Streaming. It is a continuous sequence of RDDs for real-time data
stream processing.
Kafka

Flume HDFS
Spark Databases
HDFS/S3
Streaming
Kinesis Dashboards

Twitter

33
Spark Streaming Principles
⚫ The basic principle of Spark Streaming is to split real-time input data streams
by time slice (in seconds), and then use the Spark engine to process data of
each time slice in a way similar to batch processing.

Batches of
Input data stream Spark Batches of input data Spark processed data
Streaming Engine

34
DStream Overview
⚫ DStream: basic abstraction provided by Spark Streaming. It represents a continuous stream
of data. Internally, a DStream is represented by a continuous series of RDDs, which is Spark's
abstraction of an immutable, distributed dataset.
⚫ Each RDD in a DStream contains data from a certain interval. Any operation applied on a
DStream translates to operations on the underlying RDDs.

RDD@time1 RDD@time2 RDD@time3 RDD@time4

data from data from data from data from

DStream
time 0 to 1 time 1 to 2 time 2 to 3 time 3 to 4

35
DStream Main Operations
⚫ DStream operations include creation, transformation, and output operations.
The transformation and window operations are as follows:

Operation Type Description

Similar to that of RDDs, transformations allow the data from

Transformation
the input DStream to be modified. DStreams support many of
(Transformation Operation)
the transformations available on normal Spark RDDs.

Output Output operations allow DStream's data to be pushed out to

(Output Operation) external systems like a database or a file system.

36
Window Operations
⚫ Windowed computations allow you to apply transformations over a sliding window of data.
The RDDs that fall within the window are combined and operated upon to produce the
RDDs of the windowed DStream. Each window has two attributes:
 Window length: The duration of the window.
 Sliding interval: The interval at which the window operation is performed.
Time 1 Time 2 Time 3 Time 4 Time 5

Original
DStream

Windowed
DStream
Window Window Window
at time 1 at time 3 at time 5

37
Summary
⚫ This chapter described Spark and its ecosystem, including Spark
Core, Spark SQL, Spark Streaming, RDD, DataFrame, DataSet,
DStream, and their internal principles, and uses RDD examples to
analyze the key concepts and running process views.

38
Q&A
1. Which of the following statements are true about the dependencies between RDDs?
A. If a partition of a parent RDD corresponds to multiple partitions of a child RDD, the dependencies
are wide.
B. If multiple partitions of a parent RDD correspond to one partition of a child RDD, the dependencies
are wide.
C. If a partition of a parent RDD can be inherited by only one partition of a child RDD, the
dependencies are narrow.
D. If a partition of a parent RDD can be inherited by multiple partitions of a child RDD, the
dependencies are narrow.
2. RDD stages are divided based on wide dependencies.
A. True
B. False

39
Assignment
1. Use Spark to compile a WordCount program and sort the statistics in
descending order of values.

2. Use Spark SQL to collect statistics on the average score of each subject in
each class.

40
Recommendations
⚫ Huawei Cloud websites
 Official website: [Link]
 Developer Institute: [Link]

Huawei Cloud
Developer Institute

41
Thank You.
Copyright© 2023 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including,
without limitation, statements regarding the future financial and operating results,
future product portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially from those
expressed or implied in the predictive statements. Therefore, such information is
provided for reference purpose only and constitutes neither an offer nor an
acceptance. Huawei may change the information at any time without notice.

Overview of SPARK Technology and RDDs
No ratings yet
Overview of SPARK Technology and RDDs
39 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Spark
No ratings yet
Spark
160 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
SPARK
No ratings yet
SPARK
35 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Spark Shell Commands and RDD Examples
No ratings yet
Spark Shell Commands and RDD Examples
61 pages
Spark RDD Basics and Operations
No ratings yet
Spark RDD Basics and Operations
84 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
SPARK
No ratings yet
SPARK
66 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Spark Programming and RDDs Overview
No ratings yet
Spark Programming and RDDs Overview
59 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Spark
No ratings yet
Spark
96 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Day 9
No ratings yet
Day 9
30 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark Essentials for Data Engineers
No ratings yet
Spark Essentials for Data Engineers
17 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Spark-Rdds 2
No ratings yet
Spark-Rdds 2
28 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Spark & RDD Guide for Developers
No ratings yet
Spark & RDD Guide for Developers
1 page
Spark RDD WithCode
No ratings yet
Spark RDD WithCode
34 pages
Chapter 6 Spark - An In-Memory Distributed Computing Engine
No ratings yet
Chapter 6 Spark - An In-Memory Distributed Computing Engine
43 pages
06 Parallel Processing Part2
No ratings yet
06 Parallel Processing Part2
93 pages
Spark Programming Fundamentals Guide
No ratings yet
Spark Programming Fundamentals Guide
54 pages
Spark 1
No ratings yet
Spark 1
97 pages
04 BigQuery
100% (1)
04 BigQuery
243 pages
08 Cloud Dataflow
No ratings yet
08 Cloud Dataflow
45 pages
07-Cloud Composer
No ratings yet
07-Cloud Composer
40 pages
Section 1 - Tableau Basics
No ratings yet
Section 1 - Tableau Basics
29 pages
Sertif GCP
No ratings yet
Sertif GCP
177 pages
Chapter 6 Mapreduce Programming Framework
No ratings yet
Chapter 6 Mapreduce Programming Framework
35 pages
Input Filepo Leguler: Penginputan
No ratings yet
Input Filepo Leguler: Penginputan
3 pages
Himanshi Resume Old
No ratings yet
Himanshi Resume Old
1 page
Srikanth
No ratings yet
Srikanth
10 pages
Sushant Sharma: IIT Kharagpur Alumnus, Engineering at Mayhem Studios
No ratings yet
Sushant Sharma: IIT Kharagpur Alumnus, Engineering at Mayhem Studios
3 pages
Senior Business Analyst Profile
No ratings yet
Senior Business Analyst Profile
6 pages
Time Series Data Warehouse For CDH 6
No ratings yet
Time Series Data Warehouse For CDH 6
18 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
13 pages
A Disease Diagnosis and Treatment Recommendation System Based On Big Data Mining and Cloud Computing
No ratings yet
A Disease Diagnosis and Treatment Recommendation System Based On Big Data Mining and Cloud Computing
26 pages
Anomalous Activity Detection For Intelligent Visual Surveillance
No ratings yet
Anomalous Activity Detection For Intelligent Visual Surveillance
36 pages
SEM RESPOSTA - 736496689-Google-Cloud-Professional-Machine-Learning-Engineer-Exam-Questions
No ratings yet
SEM RESPOSTA - 736496689-Google-Cloud-Professional-Machine-Learning-Engineer-Exam-Questions
82 pages
Karthi Resume
No ratings yet
Karthi Resume
1 page
AWS & Snowflake Data Engineering Expertise
No ratings yet
AWS & Snowflake Data Engineering Expertise
6 pages
HBase and ZooKeeper Overview
No ratings yet
HBase and ZooKeeper Overview
96 pages
Amazon SageMaker
No ratings yet
Amazon SageMaker
1,055 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Data-Science MUMBAI
100% (1)
Data-Science MUMBAI
149 pages
Sreeja.T: SR Hadoop Developer
No ratings yet
Sreeja.T: SR Hadoop Developer
7 pages
Lipsa Sahoo: Senior Data Engineer Profile
No ratings yet
Lipsa Sahoo: Senior Data Engineer Profile
1 page
H2O.ai Partner Prospection
No ratings yet
H2O.ai Partner Prospection
13 pages
PySpark SQL Array Method Explained
No ratings yet
PySpark SQL Array Method Explained
357 pages
CSE 3002 Big Data Technologies - 7sem
No ratings yet
CSE 3002 Big Data Technologies - 7sem
19 pages
AI Mastery 1
No ratings yet
AI Mastery 1
13 pages
Data Science PG Diploma Program
No ratings yet
Data Science PG Diploma Program
9 pages
Data Science Mega Bundle Overview
No ratings yet
Data Science Mega Bundle Overview
1 page
Introduction to Scala with Spark
No ratings yet
Introduction to Scala with Spark
22 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Curriculum Computational Engineering and Networking PDF
No ratings yet
Curriculum Computational Engineering and Networking PDF
26 pages
Ultimate Azure Synapse Analytics 1st Edition Swapnil Mule PDF Download
No ratings yet
Ultimate Azure Synapse Analytics 1st Edition Swapnil Mule PDF Download
52 pages
Big Data Analytics Course Outline (Fall 2020) : Dr. Tariq Mahmood 830 Am - 11 Am (Monday) Scope
No ratings yet
Big Data Analytics Course Outline (Fall 2020) : Dr. Tariq Mahmood 830 Am - 11 Am (Monday) Scope
3 pages
Siva Ram Korakutty
No ratings yet
Siva Ram Korakutty
6 pages

Chapter 7 Spark Computing Engine

Uploaded by

Chapter 7 Spark Computing Engine

Uploaded by

Chapter 7 Spark — Computing Engine

Strong Abundant High

Memory-based Integration with APIs for multiple Support for batch

Difference Wide Dependencies Narrow Dependencies

Only the parent RDD partition

Stage division They are the basis for stage division. No

Operation Type Description

⚫ From an external HDFS file system

⚫ Through an RDD transformation operation

Action Operator Description

In Spark shell, compile the following program:

scala> val rdd1 = [Link]("/[Link]") // Create an RDD.

textFile flatMap map reduceByKey saveAsTextFile

HDFS RDD RDD RDD RDD HDFS

An apple An apple (An, 1) (An, 1)

Application An application is generated when a SparkContext object is initialized.

ID:Int Name:String Age:Int

1,Tom,20 Analyze Spark

Name Age Salary Field (Column) Name

Name:String Age:Int Salary:Double

⚫ The data in DataSet is displayed as follows:

Operation Type Description

A DataFrame/Dataset is transformed into a new

 Defining a row set

scala> val dataList = Seq[Row](

 Creating a DataFrame using SparkSession

scala> val df2 = [Link](col("name"),col("age"),col("salary") * 100)

 Action: Print the content of df2.

The command output is as follows:

RDD@time1 RDD@time2 RDD@time3 RDD@time4

data from data from data from data from

Operation Type Description

Similar to that of RDDs, transformations allow the data from

Output Output operations allow DStream's data to be pushed out to

You might also like