2 - Intro to PySpark RDD
2 - Intro to PySpark RDD
PySpark RDD
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K
Upendra Devisetty
Science Analyst, CyVerse
What is RDD?
RDD = Resilient Distributed Datasets
numRDD = sc.parallelize([1,2,3,4])
type(helloRDD)
<class 'pyspark.rdd.PipelinedRDD'>
fileRDD = sc.textFile("README.md")
type(fileRDD)
<class 'pyspark.rdd.PipelinedRDD'>
textFile() method
Upendra Devisetty
Science Analyst, CyVerse
Overview of PySpark operations
RDD = sc.parallelize([1,2,3,4])
RDD_map = RDD.map(lambda x: x * x)
RDD = sc.parallelize([1,2,3,4])
RDD_filter = RDD.filter(lambda x: x > 2)
inputRDD = sc.textFile("logs.txt")
errorRDD = inputRDD.filter(lambda x: "error" in x.split())
warningsRDD = inputRDD.filter(lambda x: "warnings" in x.split())
combinedRDD = errorRDD.union(warningsRDD)
take(N)
first()
count()
RDD_map.collect()
[1, 4, 9, 16]
RDD_map.take(2)
[1, 4]
RDD_map.first()
[1]
RDD_flatmap.count()
Upendra Devisetty
Science Analyst, CyVerse
Introduction to pair RDDs in PySpark
Real life datasets are usually key/value pairs
Each row is a key and maps to one or more values
Pair RDD is a special data structure to work with this kind of datasets
RDD1.join(RDD2).collect()
[('Neymar', (24, 120)), ('Ronaldo', (32, 80)), ('Messi', (34, 100))]
Upendra Devisetty
Science Analyst, CyVerse
reduce() action
reduce(func) action is used for aggregating the elements of a regular RDD
The function should be commutative (changing the order of the operands does not change
the result) and associative
x = [1,3,4,6]
RDD = sc.parallelize(x)
RDD.reduce(lambda x, y : x + y)
14
RDD.saveAsTextFile("tempFile")
RDD.coalesce(1).saveAsTextFile("tempFile")
collectAsMap()
('a', 2)
('b', 1)
{1: 2, 3: 4}