0% found this document useful (0 votes)

39 views1 page

PySpark RDD Quick Reference Guide

This document is a cheat sheet for using PySpark RDD (Resilient Distributed Dataset) operations, including functions for retrieving information, reshaping data, applying functions, and performing aggregations. It provides code snippets for common tasks such as counting, grouping, filtering, and sorting RDDs, as well as initializing Spark and loading data. The document serves as a quick reference for data scientists working with PySpark.

Uploaded by

ARNAB DUTTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views1 page

PySpark RDD Quick Reference Guide

Uploaded by

ARNAB DUTTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

> Retrieving RDD Information

> Reshaping Data

Basic Information Re ducing

Python For Data Science

>>> rdd.getNumPartitions() #List the number of partitions

[('a', ),(' ',2)]

>>> rdd.reduce(lam da a, b
b
>>> rdd.reduceByKey(lam da x,y : x y).collect()
9 b
b: a
+ #Merge the rdd values for each key

+ b) #Merge the rdd values

>>> rdd.count() #Count RDD instances 3

('a',7,'a',2,' ',2) b
>>> rdd.countByKey() #Count RDD instances by key

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

Grouping by
>>> rdd3.groupBy(lam da x: x b % 2) #Return RDD of grouped values

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})

.mapValues(list)

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

{'a': 2,'b': 2}

>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

.mapValues(list)

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

.collect()

True [('a',[7,2]),(' ',[2])] b

Aggregating
Spark S ummary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.max() #Maximum value of RDD elements

#Aggregate RDD elements of each partition and then the results

99
>>> rdd3.aggregate((0,0),seqOp,combOp)

PySpark is the Spark Python API that exposes

>>> rdd3.min() #Minimum value of RDD elements
(4950,100)

the Spark programming model to Python. #Mean value of RDD elements

#Aggregate values of each RDD key

>>> rdd3.mean() >>> rdd.aggregateByKey((0,0),seqop,combop).collect()

49.5
[('a',(9,2)), ('b',(2,1))]

v #Standard deviation of RDD elements

>>> rdd3.stde () #Aggregate the elements of each partition, and then the results

8 866070047722118

2 . >>> rdd3.fold(0,add)

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

4950

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

>>> rdd.foldByKey(0, add).collect()

([0,33,66,99],[33,33,34])

SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
[('a',9),('b',2)]

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

Inspect SparkContext > Mathematical Operations
#Apply a function to each RDD element

>>> sc.version #Retrieve SparkContext version

>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
b
>>> rdd.su tract(rdd2).collect() #Return each rdd value not contained in rdd2

>>> sc.pythonVer #Retrieve Python version

[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]
b
[(' ',2),('a',7)]

>>> sc.master #Master URL to connect to

#Apply a function to each RDD element and flatten the result
#Return each (key,value) pair of rdd2 with no matching key in rdd

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd2.subtractByKey(rdd).collect()

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> rdd5.collect()
[('d', 1)]

>>> sc.appName #Return application name

['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2
>>> sc.applicationId #Retrieve application ID
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys

>>> sc.defaultParallelism #Return default level of parallelism

>>> rdd4.flatMapValues(lambda x: x).collect()

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

> Sort
C onfiguration
>>> from pyspark import SparkConf, SparkContext
> Selecting Data b
>>> rdd2.sortBy(lam da x: x[1]).collect()
b
[('d',1),(' ',1),('a',2)]

#Sort RDD by given function

>>> conf = (SparkConf()

" " >>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

.setMaster( local )

Getting b
[('a',2),(' ',1),('d',1)]
.setAppName("My app")

.set("spark.executor.memory", "1g"))

>>> rdd.collect() #Return a list with all RDD elements

>>> sc = SparkContext(conf = conf)

[('a', 7), ('a', 2), (' ', 2)]
b
>>> rdd.take(2) #Take first 2 RDD elements

Using The Shell

[('a', 7), ('a', 2)]

>>> rdd.first() #Take first RDD element

> Repartitioning
('a', 7)

>>> rdd.top(2) #Take top 2 RDD elements

4 #New RDD with 4 partitions

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. >>> rdd.repartition( )
b
[(' ', 2), ('a', 7)]
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-files code.py

Samplin g
>>> rdd3.sample( alse, F 0.15, 81).collect() #Return sampled subset of rdd3

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
4
[3, ,27,31, 40,41,42,43,60,76,79,80,86,97]
r untime path by passing a comma-separated list to --py-files.
Filterin g > Saving
>>> rdd.filter(lam da x: b "a" in x).collect() #Filter the RDD

[('a',7),('a',2)]
v T F
>>> rdd.sa eAs ext ile( rdd.txt )
" "
> Loading Data 5
>>> rdd .distinct().collect()
['a',2,' ',7]
b
#Return distinct RDD values
v H
>>> rdd.sa eAs adoop ile( F "hdfs://namenodehost/parent/child",

’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys

Para e ll lized Collections ['a', 'a', ' '] b

>>> rdd = sc.parallelize([('a',7),('a',2),(' ',2)])

b > Stopping SparkContext
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))

> Iterating >>> sc.stop()
>>> rdd 4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])
>>> def g(x): print(x)

>>> rdd.foreac (g) h #Apply a function to all RDD elements

External Data ('a', 7)

b
(' ', 2)

> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any Hadoop-supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles() $ ./bin/spark-submit / / / h /
examples src main pyt on pi.py

>> > textFile = sc.textFile("/my/directory/*.txt")

>> > textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark RDD Functions Overview
No ratings yet
PySpark RDD Functions Overview
1 page
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
RDD Actions and Transformations in Spark
No ratings yet
RDD Actions and Transformations in Spark
18 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Spark Shell and RDD Operations Guide
No ratings yet
Spark Shell and RDD Operations Guide
1 page
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
No ratings yet
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
5 pages
Py Spark
No ratings yet
Py Spark
19 pages
Understanding RDD Operations in Spark
No ratings yet
Understanding RDD Operations in Spark
20 pages
PySpark RDD: Transformations & Operations
No ratings yet
PySpark RDD: Transformations & Operations
40 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Understanding Apache Hadoop and Spark
No ratings yet
Understanding Apache Hadoop and Spark
14 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Analyzing Large Datasets with Spark
No ratings yet
Analyzing Large Datasets with Spark
11 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Introduction To RDD
No ratings yet
Introduction To RDD
39 pages
Apache Spark Overview and Examples
No ratings yet
Apache Spark Overview and Examples
36 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Spark Operations: Transformations & Actions
No ratings yet
Spark Operations: Transformations & Actions
6 pages
Understanding Spark RDDs and Operations
No ratings yet
Understanding Spark RDDs and Operations
105 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Function Spark
No ratings yet
Function Spark
10 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
Python Spark Basics and Examples
No ratings yet
Python Spark Basics and Examples
28 pages
SPARK
No ratings yet
SPARK
35 pages
Pyspark
No ratings yet
Pyspark
4 pages
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
Spark RDD Transformations & Actions Guide
No ratings yet
Spark RDD Transformations & Actions Guide
4 pages
Spark Shell Commands and RDD Examples
No ratings yet
Spark Shell Commands and RDD Examples
61 pages
RDD
No ratings yet
RDD
4 pages
Apache Spark Overview and Use Cases
No ratings yet
Apache Spark Overview and Use Cases
19 pages
PySpark RDD Transformations and Actions
No ratings yet
PySpark RDD Transformations and Actions
24 pages
Databricks Setup Guide
No ratings yet
Databricks Setup Guide
7 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
33 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Spark RDD Data Processing Techniques
No ratings yet
Spark RDD Data Processing Techniques
7 pages
Introduction To PySpark
100% (1)
Introduction To PySpark
21 pages
Understanding Spark RDD Basics
No ratings yet
Understanding Spark RDD Basics
2 pages
Spark
No ratings yet
Spark
160 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Cloudlab Exercise 11 Lesson 11
No ratings yet
Cloudlab Exercise 11 Lesson 11
2 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Product
0% (1)
Product
50 pages
ECA - Bio - Nov 2009
No ratings yet
ECA - Bio - Nov 2009
5 pages
LTM 1125 e 1140
No ratings yet
LTM 1125 e 1140
372 pages
Alebrije
No ratings yet
Alebrije
9 pages
Net Cash Flow Calculation Examples
No ratings yet
Net Cash Flow Calculation Examples
1 page
Clements High School November News
No ratings yet
Clements High School November News
17 pages
Module 5 Paper 5 Trevor Wilson
No ratings yet
Module 5 Paper 5 Trevor Wilson
4 pages
(Ebook PDF) Essentials of Investments 11th Edition by Zvi Bodie
No ratings yet
(Ebook PDF) Essentials of Investments 11th Edition by Zvi Bodie
384 pages
The Routledge Handbook On The History of Development (Corinna R. Unger (Editor), Iris Borowy (Editor) Etc.) (Z-Library)
No ratings yet
The Routledge Handbook On The History of Development (Corinna R. Unger (Editor), Iris Borowy (Editor) Etc.) (Z-Library)
387 pages
AI 900 Practice Questions Guide
No ratings yet
AI 900 Practice Questions Guide
17 pages
English Verb Agreement Exercises
No ratings yet
English Verb Agreement Exercises
2 pages
Flygt NP 3231/665 Pump Specs
No ratings yet
Flygt NP 3231/665 Pump Specs
8 pages
See Sub-Rule (1) Of: Form 'F' Gratuity Nomination Form
No ratings yet
See Sub-Rule (1) Of: Form 'F' Gratuity Nomination Form
2 pages
List of P & I Files For NBA
0% (1)
List of P & I Files For NBA
2 pages
London Landmarks Listening and Activities
No ratings yet
London Landmarks Listening and Activities
3 pages
PG1 Documents of CPTU Bangladesh
No ratings yet
PG1 Documents of CPTU Bangladesh
20 pages
Agreement
No ratings yet
Agreement
11 pages
DND House Style Guide (Updated January 2019)
No ratings yet
DND House Style Guide (Updated January 2019)
11 pages
Free Writing 101 Ebook
No ratings yet
Free Writing 101 Ebook
104 pages
Grade 9 Proportion Problems
No ratings yet
Grade 9 Proportion Problems
4 pages
Introduction To Matter: Chemistry For Engineers (CH011IU) - 1
No ratings yet
Introduction To Matter: Chemistry For Engineers (CH011IU) - 1
35 pages
Biography
No ratings yet
Biography
5 pages
Charity and Sylvia A Same Sex Marriage in Early America 1st Edition Rachel Hope Cleves Ebook PDF Online
100% (2)
Charity and Sylvia A Same Sex Marriage in Early America 1st Edition Rachel Hope Cleves Ebook PDF Online
156 pages
Assignment 2 Paragraph Writing
No ratings yet
Assignment 2 Paragraph Writing
2 pages
Principles of Effective Teaching
No ratings yet
Principles of Effective Teaching
2 pages
Isozaki Arata - MA - The Japanese Sense of Place PDF
100% (7)
Isozaki Arata - MA - The Japanese Sense of Place PDF
33 pages
MFD EX CompEX - HANDOUT
No ratings yet
MFD EX CompEX - HANDOUT
211 pages
Section 89 CPC
100% (1)
Section 89 CPC
3 pages
Drilled Cast-In-Place Pile Design Based On ACI 318-02
No ratings yet
Drilled Cast-In-Place Pile Design Based On ACI 318-02
15 pages

PySpark RDD Quick Reference Guide

Uploaded by

PySpark RDD Quick Reference Guide

Uploaded by

> Retrieving RDD Information

> Reshaping Data

Python For Data Science

[('a', ),(' ',2)]

+ b) #Merge the rdd values

>>> rdd.count() #Count RDD instances 3

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

True [('a',[7,2]),(' ',[2])] b

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.max() #Maximum value of RDD elements

PySpark is the Spark Python API that exposes

the Spark programming model to Python. #Mean value of RDD elements

#Aggregate values of each RDD key

>>> rdd3.mean() >>> rdd.aggregateByKey((0,0),seqop,combop).collect()

v #Standard deviation of RDD elements

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

>>> sc.version #Retrieve SparkContext version

>>> sc.pythonVer #Retrieve Python version

>>> sc.master #Master URL to connect to

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> sc.appName #Return application name

>>> sc.defaultParallelism #Return default level of parallelism

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

#Sort RDD by given function

>>> conf = (SparkConf()

" " >>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

>>> rdd.collect() #Return a list with all RDD elements

>>> sc = SparkContext(conf = conf)

Using The Shell

>>> rdd.first() #Take first RDD element

>>> rdd.top(2) #Take top 2 RDD elements

$ ./bin/pyspark --master local[4] --py-files code.py

Para e ll lized Collections ['a', 'a', ' '] b

>>> rdd = sc.parallelize([('a',7),('a',2),(' ',2)])

>>> rdd3 = sc.parallelize(range(100))

>>> rdd.foreac (g) h #Apply a function to all RDD elements

External Data ('a', 7)

>> > textFile = sc.textFile("/my/directory/*.txt")

>> > textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

You might also like