0% found this document useful (0 votes)

15 views54 pages

Introduction To Spark

introduction to spark

Uploaded by

Sana Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views54 pages

Introduction To Spark

introduction to spark

Uploaded by

Sana Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 54

CS5412 / Lecture 25 Kishore Pusukuri,

Apache Spark and RDDs Spring 2019

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1
Recap
MapReduce
• For easily writing applications to process vast amounts of data in-
parallel on large clusters in a reliable, fault-tolerant manner
• Takes care of scheduling tasks, monitoring them and re-executes
the failed tasks

HDFS & MapReduce: Running on the same set of nodes 

compute nodes and storage nodes same (keeping data close
to the computation)  very high throughput
YARN & MapReduce: A single master resource manager, one
slave node manager per node, and AppMaster per application
2
Hadoop (summary)
MapReduce: The original scalable, general, processing
engine of the Hadoop ecosystem
• Disk-based data processing framework (HDFS files)
• Persists intermediate results to disk
• Data is reloaded from disk with every query → Costly I/O
• Best for ETL like workloads (batch processing)
• Costly I/O → Not appropriate for iterative or stream processing
workloads

3
Spark (features)
Spark: General purpose computational framework that
substantially improves performance of MapReduce, but retains
the basic model
• Memory based data processing framework → avoids costly I/O
by keeping intermediate results in memory
• Leverages distributed memory for repeated read/writes
• Remembers operations applied the o dataset via DAGs
• Data locality based computation → High Performance
• Best for both iterative (or stream processing) and batch workloads

4
Today’s Topics
• Why Spark?
• Spark Basic Concepts
• Spark basic Programming

5
Apache Spark
** Spark can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN)

Processing Spark Spark Other

Spark ML Applications
Stream SQL

Resource Spark Core Data

manager Mesos etc. Yet Another Resource
(Standalone Ingestion
Negotiator (YARN) Systems
Scheduler)
e.g.,
Apache
S3, Cassandra etc., Hadoop NoSQL Database (HBase) Kafka,
Data
Storage
other storage systems Flume, etc
Hadoop Distributed File System (HDFS)

Hadoop Spark

6
Spark Ecosystem: A Unified Pipeline

Note: Spark is not designed for IoT real-time. The streaming layer is used for
continuous input streams like financial data from stock markets, where events occur
steadily and must be processed as they occur. But there is no sense of direct I/O
from sensors/actuators. For IoT use cases, Spark would not be suitable.
7
Key ideas
In Hadoop, each developer tends to invent his or her own style of work

With Spark, serious effort are made to standardize the parallel code that people
are writing, that often runs for many “cycles” or “iterations” in which a lot of
reuse of information occurs.

Spark centers on Resilient Distributed Dataset, RDDs, that capture the

information being reused.

8
How this works
You express your application as a graph of RDDs.

The graph is only evaluated as needed, and they only compute the RDDs
actually needed for the output you have requested.

Then Spark can be told to cache the reuseable information either in memory, in
SSD storage or even on disk, based on when it will be needed again, how big it
is, and how costly it would be to recreate.

You write the RDD logic and control all of this via hints
9
Today’s Topics
• Motivation
• Spark Basics
• Spark Programming

10
Spark Basics(1)
Spark: Flexible, in-memory data processing framework written in Scala
Goals:
• Simplicity (Easier to use):
 Rich APIs for Scala, Java, and Python
• Generality: APIs for different types of workloads
 Batch, Streaming, Machine Learning, Graph
• Low Latency (Performance) : In-memory processing and caching
• Fault-tolerance: Ability to recover losses (Resilient)

11
Spark Basics(2)
There are two ways to manipulate data in Spark
• Spark Shell:
 Interactive – for learning or data exploration
 Python or Scala
• Spark Applications
 For large scale data processing

12
Spark Shell
The Spark Shell provides interactive data exploration (REPL)

REPL: Repeat/Evaluate/Print Loop

13
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

14
Spark Context (1)
• Every Spark application requires a spark context: the main
entry point to the Spark API
• Spark Shell provides a preconfigured Spark Context called “sc”

15
Spark Context (3)
Spark context works as a client and represents connection to a Spark cluster

16
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

17
Resilient Distributed Dataset
RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark: An Immutable
collection of objects (or records, or elements) that can be operated on “in parallel” (spread across
a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information  it can be recreated from parent
RDDs
Distributed -- processed across the cluster
• Each RDD is composed of one or more partitions  (more partitions – more
parallelism)
Dataset -- initial data can come from a file or be created

18
RDDs
Key Idea: Write applications in terms of transformations on
distributed datasets. One RDD per transformation.
• Organize the RDDs into a DAG showing how data flows.
• RDD can be saved and reused or recomputed. Spark can save it to
disk if the dataset does not fit in memory
• Built through parallel transformations (map, filter, group-by, join,
etc). Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)

19
RDDs are designed to be “immutable”
• Create once, then reuse without changes. Spark knows lineage
 can be recreated at any time  Fault-tolerance
• Avoids data inconsistency problems (no simultaneous updates)
 Correctness
• Easily live in memory as on disk  Caching  Safe to share
across processes/tasks  Improves performance
• Tradeoff: (Fault-tolerance & Correctness) vs (Disk Memory & CPU)

20
Creating a RDD
Three ways to create a RDD
• From a file or set of files
• From data in memory
• From another RDD

21
Example: A File-based RDD

22
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

23
RDD Operations
Two types of operations
Transformations: Define a
new RDD based on current
RDD(s)
Actions: return values

24
RDD Transformations
• Set of operations on a RDD that define how they should
be transformed
• As in relational algebra, the application of a
transformation to an RDD yields a new RDD (because
RDD are immutable)
• Transformations are lazily evaluated, which allows for
optimizations to take place before execution
• Examples: map(), filter(), groupByKey(), sortByKey(),
etc.
25
Example: map and filter Transformations

26
RDD Actions
• Apply transformation chains on RDDs, eventually performing
some additional operations (e.g., counting)
• Some actions only store data to an external data source (e.g.
HDFS), others fetch data from the RDD (and its transformation
chain) upon which the action is applied, and convey it to the
driver
• Some common actions
count() – return the number of elements

take(n) – return an array of the first n elements

collect()– return an array of all elements

saveAsTextFile(file) – save to text file(s)

27
Lazy Execution of RDDs (1)
Data in RDDs is not processed
until an action is performed

28
Lazy Execution of RDDs (2)
Data in RDDs is not processed
until an action is performed

29
Lazy Execution of RDDs (3)
Data in RDDs is not processed
until an action is performed

30
Lazy Execution of RDDs (4)
Data in RDDs is not processed
until an action is performed

31
Lazy Execution of RDDs (5)
Data in RDDs is not processed
until an action is performed

Output Action “triggers” computation, pull model

32
Example: Mine error logs
Load error messages from a log into memory, then interactively
search for various patterns:

lines = spark.textFile(“hdfs://...”) HadoopRDD

errors = lines.filter(lambda s: s.startswith(“ERROR”)) FilteredRDD
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()

33
Key Idea: Elastic parallelism
RDDs operations are designed to offer embarrassing parallelism.

Spark will spread the task over the nodes where data resides, offers a highly concurrent
execution that minimizes delays. Term: “partitioned computation” .

If some component crashes or even is just slow, Spark simply kills that task and launches
a substitute.

34
RDD and Partitions (Parallelism example)

35
RDD Graph: Data Set vs Partition Views
Much like in Hadoop MapReduce, each RDD is associated to
(input) partitions

36
RDDs: Data Locality
•Data Locality Principle
 Keep high-value RDDs precomputed, in cache or SDD
 Run tasks that need the specific RDD with those same inputs
on the node where the cached copy resides.
 This can maximize in-memory computational performance.

Requires cooperation between your hints to Spark when you

build the RDD, Spark runtime and optimization planner, and the
underlying YARN resource manager.
37
Typical RDD pattern of use
Instead of doing a lot of work in each RDD, developers split
tasks into lots of small RDDs

These are then organized into a DAG.

Developer anticipates which will be costly to recompute and

hints to Spark that it should cache those.

38
Why is this a good strategy?

Spark tries to run tasks that will need the same intermediary data on the same nodes.
If MapReduce jobs were arbitrary programs, this wouldn’t help because reuse would be
very rare.
But in fact the MapReduce model is very repetitious and iterative, and often applies the
same transformations again and again to the same input files.
 Those particular RDDs become great candidates for caching.
 MapReduce programmer may not know how many iterations will occur, but
Spark itself is smart enough to evict RDDs if they don’t actually get reused.

39
RDDs -- Summary
RDD are partitioned, locality aware, distributed collections
 RDD are immutable
RDD are data structures that:
 Either point to a direct data source (e.g. HDFS)
 Apply some transformations to its parent RDD(s) to generate new
data elements
Computations on RDDs
 Represented by lazily evaluated lineage DAGs composed by
chained RDDs

40
Lifetime of a Job in Spark

41
Anatomy of a Spark Application

Cluster Manager
(YARN/Mesos)

42
Iterative Algorithms: Spark vs MapReduce

43
Today’s Topics
• Motivation
• Spark Basics
• Spark Programming

44
Spark Programming (1)
Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3

sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)

# Use existing Hadoop InputFormat (Java/Scala only)

sc.hadoopFile(keyClass, valClass, inputFmt, conf)
45
Spark Programming (2)
Basic Transformations

nums = sc.parallelize([1, 2, 3])

# Pass each element through a function

squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements qualifying some metric

even = squares.filter(lambda x: x % 2 == 0) // {4}

46
Spark Programming (3)
Basic Actions
nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection

nums.collect() # => [1, 2, 3]

# Return first K elements

nums.take(2) # => [1, 2]

# Count number of elements

nums.count() # => 3

# Merge elements with an associative function

nums.reduce(lambda x, y: x + y) # => 6 47
Spark Programming (4)
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations operate on RDDs of
key-value pairs

Python: pair = (a, b)

pair[0] # => a
pair[1] # => b

Scala: val pair = (a, b)

pair._1 // => a
pair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b);

pair._1 // => a
pair._2 // => b
48
Spark Programming (5)
Some Key-Value Operations

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

49
Example: Word Count
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)

50
Example: Spark Streaming

Represents streams as a series of RDDs over time (typically sub second intervals, but it is
configurable)

val spammers = sc.sequenceFile(“hdfs://spammers.seq”)

sc.twitterStream(...)
.filter(t => t.text.contains(“Santa Clara University”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()

51
Spark: Combining Libraries (Unified Pipeline)
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)

# Train a machine learning model

model = KMeans.train(points, 10)

# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 52
Spark: Setting the Level of Parallelism
All the pair RDD operations take an optional second parameter for
number of tasks

words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)

53
Summary

Spark is a powerful “manager” for big data computing.

It centers on a job scheduler for Hadoop (MapReduce) that is smart about where
to run each task: co-locate task with data.
The data objects are “RDDs”: a kind of recipe for generating a file from an
underlying data collection. RDD caching allows Spark to run mostly from
memory-mapped data, for speed.

• Online tutorials: spark.apache.org/docs/latest

Lecture 25
No ratings yet
Lecture 25
59 pages
SPARK
No ratings yet
SPARK
66 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Overview of SPARK Technology and RDDs
No ratings yet
Overview of SPARK Technology and RDDs
39 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Apache Spark vs Dask: Big Data Tools
No ratings yet
Apache Spark vs Dask: Big Data Tools
55 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
125 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Apache Spark Big Data Framework Overview
No ratings yet
Apache Spark Big Data Framework Overview
58 pages
SPARK
No ratings yet
SPARK
47 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
MapReduce vs. Spark: Big Data Processing
No ratings yet
MapReduce vs. Spark: Big Data Processing
21 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Spark
No ratings yet
Spark
51 pages
Spark Programming Fundamentals Guide
No ratings yet
Spark Programming Fundamentals Guide
54 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Learn by Doing It
No ratings yet
Learn by Doing It
9 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Understanding RDDs in Apache Spark
No ratings yet
Understanding RDDs in Apache Spark
14 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark
No ratings yet
Spark
96 pages
Spark Programming and RDDs Overview
No ratings yet
Spark Programming and RDDs Overview
59 pages
Unit V
No ratings yet
Unit V
35 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
19 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
SPARK
No ratings yet
SPARK
35 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Spark and RDD Presentation
No ratings yet
Spark and RDD Presentation
64 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Before Reading This User Manual
No ratings yet
Before Reading This User Manual
2 pages
Dell PowerFlex for AWS Outposts Solutions
No ratings yet
Dell PowerFlex for AWS Outposts Solutions
2 pages
Diagnostic Test ENG 6 2nd Q
100% (1)
Diagnostic Test ENG 6 2nd Q
3 pages
ME Module 4 PDF
No ratings yet
ME Module 4 PDF
57 pages
6 Env
No ratings yet
6 Env
61 pages
Group 19 Homo& Hetro
No ratings yet
Group 19 Homo& Hetro
20 pages
Computer Architecture Revision 2013 Exam
No ratings yet
Computer Architecture Revision 2013 Exam
4 pages
VLSI Design Lab Manual for ECE Students
100% (1)
VLSI Design Lab Manual for ECE Students
15 pages
Ariba Spend Management-Windows
No ratings yet
Ariba Spend Management-Windows
4 pages
Sample Test Cases For Automation. Banking
No ratings yet
Sample Test Cases For Automation. Banking
4 pages
JDBC Interview Questions With Answers PDF
No ratings yet
JDBC Interview Questions With Answers PDF
8 pages
OS Exp-7a, 7b, 7c
No ratings yet
OS Exp-7a, 7b, 7c
11 pages
Hierarchical Modulation: - The Transmission of Two Independent DVB-T Multiplexes On A Single Frequency
No ratings yet
Hierarchical Modulation: - The Transmission of Two Independent DVB-T Multiplexes On A Single Frequency
21 pages
ProPeers - Connect Ask and Grow
No ratings yet
ProPeers - Connect Ask and Grow
2 pages
Sdoquezon Adm SHS12 C Mil M4
No ratings yet
Sdoquezon Adm SHS12 C Mil M4
20 pages
Functional Specification Document: Module: EWM Development Description: EWM IBGI Processes
No ratings yet
Functional Specification Document: Module: EWM Development Description: EWM IBGI Processes
9 pages
Offiwiz File
No ratings yet
Offiwiz File
20 pages
Digital Radiographic Quality Control Guide
No ratings yet
Digital Radiographic Quality Control Guide
150 pages
FDS KGRL
No ratings yet
FDS KGRL
137 pages
iPhone 11 Pro Max Quality Dimensions Analysis
No ratings yet
iPhone 11 Pro Max Quality Dimensions Analysis
26 pages
Simple Equations Notes
No ratings yet
Simple Equations Notes
3 pages
Asian Girl Six Us8
No ratings yet
Asian Girl Six Us8
7 pages
MFJ-1270X Manual V2
No ratings yet
MFJ-1270X Manual V2
16 pages
Modbus to SNMP Gateway Overview
No ratings yet
Modbus to SNMP Gateway Overview
2 pages
NCERT Solutions For Class 8 Maths Chapter 14 Factorisation Exercise 14.3
No ratings yet
NCERT Solutions For Class 8 Maths Chapter 14 Factorisation Exercise 14.3
5 pages
Manual TP Link Archer c60
No ratings yet
Manual TP Link Archer c60
103 pages
AR for Human-Robot Collaboration
No ratings yet
AR for Human-Robot Collaboration
19 pages
IOT Workshop
No ratings yet
IOT Workshop
25 pages
Internship Report: IT at CBO Bank
No ratings yet
Internship Report: IT at CBO Bank
21 pages

Introduction To Spark

Uploaded by

Introduction To Spark

Uploaded by

CS5412 / Lecture 25 Kishore Pusukuri,

Apache Spark and RDDs Spring 2019

HDFS & MapReduce: Running on the same set of nodes 

Processing Spark Spark Other

Resource Spark Core Data

Spark centers on Resilient Distributed Dataset, RDDs, that capture the

REPL: Repeat/Evaluate/Print Loop

take(n) – return an array of the first n elements

collect()– return an array of all elements

saveAsTextFile(file) – save to text file(s)

Output Action “triggers” computation, pull model

lines = spark.textFile(“hdfs://...”) HadoopRDD

Requires cooperation between your hints to Spark when you

These are then organized into a DAG.

Developer anticipates which will be costly to recompute and

# Load text file from local FS, HDFS, or S3

# Use existing Hadoop InputFormat (Java/Scala only)

nums = sc.parallelize([1, 2, 3])

# Pass each element through a function

# Keep elements qualifying some metric

# Retrieve RDD contents as a local collection

# Return first K elements

# Count number of elements

# Merge elements with an associative function

Python: pair = (a, b)

Scala: val pair = (a, b)

Java: Tuple2 pair = new Tuple2(a, b);

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

val spammers = sc.sequenceFile(“hdfs://spammers.seq”)

# Train a machine learning model

Spark is a powerful “manager” for big data computing.

• Online tutorials: spark.apache.org/docs/latest

You might also like