SlideShare a Scribd company logo
WITH
Gökhan Atıl
GÖKHAN ATIL
➤ Database Administrator
➤ Oracle ACE Director (2016)

ACE (2011)
➤ 10g/11g and R12 Oracle Certified Professional (OCP)
➤ Co-author of Expert Oracle Enterprise Manager 12c
➤ Founding Member and Vice President of TROUG
➤ Blogger (since 2008) gokhanatil.com
➤ Twitter: @gokhanatil
2
APACHE SPARK WITH PYTHON
➤ Introduction to Apache Spark
➤ Why Python (PySpark) instead of Scala?
➤ Spark RDD
➤ SQL and DataFrames
➤ Spark Streaming
➤ Spark Graphx
➤ Spark MLlib (Machine Learning)
3
INTRODUCTION TO APACHE SPARK
➤ A fast and general engine for large-scale data processing
➤ Top-Level Apache Project since 2014.
➤ Response to limitations in the MapReduce
➤ Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk
➤ Implemented in Scala programming language, supports
Java, Scala, Python, R
➤ Runs on Hadoop, Mesos, Kubernetes, standalone, cloud
4
DOWNLOAD AND RUN ON YOUR PC
➤ https://spark.apache.org/downloads.html
➤ Extract and Spark is ready:
tar -xzf spark-2.3.0-bin-hadoop2.7.tgz
spark-2.3.0-bin-hadoop2.7/bin/spark
➤ You can also use PIP:
pip install pyspark
5
PYSPARK AND SPARK-SUBMIT
➤ PySpark is the interface that gives access to Spark using the
Python programming language
➤ The spark-submit script in Spark’s bin directory is used to
launch applications on a cluster
spark-submit example1.py
6
WHY PYTHON INSTEAD OF SCALA?
➤ If you know Scala, then use Scala!
➤ Learning curve: Python is comparatively easier to learn
➤ Easy to use: Code readability, maintainability and familiarity is
far better with Python
➤ Libraries: Python comes with great libraries for data analysis,
statistics and visualization (numpy, pandas, matplotlib etc...)
➤ Performance:  Scala is faster then Python but if your Python
code just calls Spark libraries, the differences in performance is
minimal (*)
7
Reminder: Any new feature added in Spark API will be
available in Scala first
RESILIENT DISTRIBUTED DATASET (RDD)
➤ RDDs are the core data structure in Spark
➤ Distributed, resilient, immutable, can store unstructured and
structured data, lazy evaluated
8
node 1
RDD
partition 1
node 2
RDD
partition 2
node 3
RDD
partition 3
RDD
RDD TRANSFORMATIONS AND ACTIONS
sc.textFile(*)
9
RDD T1 T2 T3 ACTION
LAZY EVALUATATION
SPARK CONTEXT
.collect().reduceByKey(*).filter(*).map(*)
TRANSFORMATIONS ACTIONS
➤ map
➤ filter
➤ flatMap
➤ mapPartitions
➤ reduceByKey
➤ union
➤ intersection
➤ join
10
➤ collect
➤ count
➤ first
➤ take
➤ takeSample
➤ takeOrdered
➤ saveAsTextFile
➤ foreach
HOW TO CREATE RDD IN PYSPARK
➤ Referencing a dataset in an external storage system:
rdd = sc.textFile( ... )
➤ Parallelizing already existing collection:
rdd = sc.parallelize( ... )
➤ Creating RDD from already existing RDDs:
rdd2 = rdd1.map( ... )
11
USERS.CSV (MOVIELENS DATABASE)
id | age | gender | occupation | zip
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
12
M = 670
F = 273
EXAMPLE #1: USE RDD TO GROUP DATA FROM CSV
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
print sc.textFile( "users.csv" ) 
.map( lambda x: (x.split("|")[2], 1) ) 
.reduceByKey(lambda x,y:x+y).collect()
sc.stop()
13
M, 1
M, 1
F, 1
M, 1
[(u'M', 670), (u'F', 273)]
SPARK SQL AND DATAFRAMES
14
Catalyst
RDD
DataFrames/DataSetsSQL
SPARKSQL
MLlib GraphFrames
Structured
Streaming
➤ Spark SQL is Apache Spark's module for working with
structured data
DATAFRAMES AND DATASETS
➤ DataFrame is a distributed collection of "structured" data, organized
into named columns.
➤ Spark DataSets are statically typed, while Python is a dynamically
typed programming language so Python supports only DataFrames.
15
EXAMPLE #2: USE DATAFRAME TO GROUP DATA FROM CSV
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" ) 
.groupby( "gender" ).count().show()
sc.stop()
16
DATAFRAME VERSUS RDD
17
?
CATALYST OPTIMIZER
➤ Spark SQL uses Catalyst optimizer to optimize query plans.
➤ Supports cost-based optimization since Spark 2.2
18
SQL
DataFrame
DataSet
Query
Plan
Optimized
Query Plan
RDD
Code Generation
CONVERSION BETWEEN RDD AND DATAFRAME
➤ An RDD can be converted to DataFrame using
createDataFrame or toDF method:
rdd = sc.parallelize([("osman",21),("ahmet",25)])
df = rdd.toDF( "name STRING, age INT" )
df.show()
➤ You can access underlying RDD of a DataFrame using rdd
property:
df.rdd.collect()
[Row(name=u'osman',age=21),Row(name=u'ahmet',age=25)]
19
EXAMPLE #3: CREATE TEMPORARY VIEWS FROM DATAFRAMES
spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" ) 
.createOrReplaceTempView( "users" )
spark.sql( "select count(*) from users" ).show()
spark.sql( "select case when age < 25 then '-25' 
when age between 25 and 39 then '25-40' 
when age >= 40 then '40+' end age_group, 
count(*) from users group by age_group order by 1" ).show()
20
EXAMPLE #4: READ AND WRITE DATA
df = spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" )
df.write.saveAsTable("users")
df .write.save("users.json", format="json", mode="overwrite")
spark.sql("SELECT gender, count(*) FROM 
json.`users.json` GROUP BY gender").show()
21
HIVE
SPARK STREAMING (DSTREAMS)
➤ Scalable, high-throughput, fault-tolerant stream processing of
live data streams
➤ Supports: File, Socket, Kafka, Flume, Kinesis
➤ Spark Streaming receives live input data streams and divides
the data into batches
22
EXAMPLE #5: DISCRETIZED STREAMS (DSTREAMS)
ssc = StreamingContext(sc, 1)
stream_data = ssc.textFileStream("file:///tmp/stream") 
.map( lambda x: x.split(","))
stream_data.pprint()
ssc.start()
ssc.awaitTermination()
23
EXAMPLE #5: OUTPUT
24
STRUCTURED STREAMING
➤ Stream processing engine built on the Spark SQL engine
➤ Supports File and Kafka sources for production; Socket and
Rate sources for testing
25
EXAMPLE #6: STRUCTURED STREAMING
stream_data = spark.readStream 
.load( format="csv",path="/tmp/stream/*.csv",
schema="name string, points int" ) 
.groupBy("name").sum("points").orderBy( "sum(points)",
ascending=0 )
stream_data.writeStream.start( format="console",
outputMode="complete" ).awaitTermination()
26
EXAMPLE #6: OUTPUT
27
GRAPHX (GRAPHFRAMES)
➤ GraphX is a new component in Spark for graphs and graph-
parallel computation.
28
EXAMPLE #7: GRAPHFRAMES
vertex =
spark.createDataFrame([
(1, "Ahmet"),
(2, "Mehmet"),
(3, "Cengiz"),
(4, "Osman")],
["id", "name"])
edges =
spark.createDataFrame([
( 1, 2, "friend" ),
( 2, 1, "friend" ),
( 2, 3, "friend" ),
( 3, 2, "friend" ),
( 2, 4, "friend" ),
( 4, 2, "friend" ),
( 3, 4, "friend" ),
( 4, 3, "friend" )],
["src","dst", "relation"])
29
EXAMPLE #7: GRAPHFRAMES
pyspark --packages graphframes:graphframes:0.5.0-spark2.1-
s_2.11
import graphframes as gf
g = gf.GraphFrame(vertex, edges)
g.shortestPaths([4]).show()
30
1
2
3
4
MLLIB (MACHINE LEARNING)
➤ Supports common ML Algorithms such as classification,
regression, clustering, and collaborative filtering
➤ Featurization:
➤ Feature extraction (TF-IDF, Word2Vec, CountVectorizer ...)
➤ Transformation (Tokenizer, StopWordsRemover ...)
➤ Selection (VectorSlicer, RFormula ... )
➤ Pipelines: combine multiple algorithms into a single pipeline,
or workflow
➤ DataFrame-based API is primary API
31
EXAMPLE #8: ALTERNATING LEAST SQUARES (ALS)
def parseratings( x ):
v = x.split("::")
return (int(v[0]), int(v[1]), float(v[2]))
ratings = sc.textFile("ratings.dat").map(parseratings) 
.toDF( ["user", "id", "rating"] )
als = ALS(userCol="user", itemCol="id", ratingCol="rating")
model = als.fit(ratings)
model.recommendForAllUsers(10).show()
32
EXAMPLE #8 OUTPUT
33
Blog: www.gokhanatil.com Twitter: @gokhanatil

More Related Content

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Apache spark
Apache sparkApache spark
Apache spark
shima jafari
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 

Similar to Introduction to Spark with Python (20)

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMRVancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
Naukri.com
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Pyspark tutorial
Pyspark tutorialPyspark tutorial
Pyspark tutorial
HarikaReddy115
 
Pyspark tutorial
Pyspark tutorialPyspark tutorial
Pyspark tutorial
HarikaReddy115
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMRVancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
Naukri.com
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Ad

More from Gokhan Atil (15)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Gokhan Atil
 
SQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day IstanbulSQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day Istanbul
Gokhan Atil
 
EM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLIEM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLI
Gokhan Atil
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
Gokhan Atil
 
Essential Linux Commands for DBAs
Essential Linux Commands for DBAsEssential Linux Commands for DBAs
Essential Linux Commands for DBAs
Gokhan Atil
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
Gokhan Atil
 
Enterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLIEnterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLI
Gokhan Atil
 
EMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG GermanyEMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG Germany
Gokhan Atil
 
Oracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash CourseOracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash Course
Gokhan Atil
 
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yoluTROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
Gokhan Atil
 
Oracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIGOracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIG
Gokhan Atil
 
Oracle 12c Database In-Memory
Oracle 12c Database In-MemoryOracle 12c Database In-Memory
Oracle 12c Database In-Memory
Gokhan Atil
 
Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?
Gokhan Atil
 
Enterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH AnalyticsEnterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH Analytics
Gokhan Atil
 
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12cUsing APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
Gokhan Atil
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Gokhan Atil
 
SQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day IstanbulSQL or noSQL - Oracle Cloud Day Istanbul
SQL or noSQL - Oracle Cloud Day Istanbul
Gokhan Atil
 
EM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLIEM13c: Write Powerful Scripts with EMCLI
EM13c: Write Powerful Scripts with EMCLI
Gokhan Atil
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
Gokhan Atil
 
Essential Linux Commands for DBAs
Essential Linux Commands for DBAsEssential Linux Commands for DBAs
Essential Linux Commands for DBAs
Gokhan Atil
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
Gokhan Atil
 
Enterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLIEnterprise Manager: Write powerful scripts with EMCLI
Enterprise Manager: Write powerful scripts with EMCLI
Gokhan Atil
 
EMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG GermanyEMCLI Crash Course - DOAG Germany
EMCLI Crash Course - DOAG Germany
Gokhan Atil
 
Oracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash CourseOracle Enterprise Manager 12c: EMCLI Crash Course
Oracle Enterprise Manager 12c: EMCLI Crash Course
Gokhan Atil
 
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yoluTROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
TROUG & Turkey JUG Semineri: Veriye erişimin en hızlı yolu
Gokhan Atil
 
Oracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIGOracle 12c Database In Memory DBA SIG
Oracle 12c Database In Memory DBA SIG
Gokhan Atil
 
Oracle 12c Database In-Memory
Oracle 12c Database In-MemoryOracle 12c Database In-Memory
Oracle 12c Database In-Memory
Gokhan Atil
 
Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?Oracle DB Standard Edition: Başka Bir Arzunuz?
Oracle DB Standard Edition: Başka Bir Arzunuz?
Gokhan Atil
 
Enterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH AnalyticsEnterprise Manager 12c ASH Analytics
Enterprise Manager 12c ASH Analytics
Gokhan Atil
 
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12cUsing APEX to Create a Mobile User Interface for Enterprise Manager 12c
Using APEX to Create a Mobile User Interface for Enterprise Manager 12c
Gokhan Atil
 
Ad

Recently uploaded (20)

The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...
Prachi Desai
 
Automating Map Production With FME and Python
Automating Map Production With FME and PythonAutomating Map Production With FME and Python
Automating Map Production With FME and Python
Safe Software
 
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI SearchAgentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Maxim Salnikov
 
FME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable InsightsFME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable Insights
Safe Software
 
Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...
Rishab Acharya
 
Rebuilding Cadabra Studio: AI as Our Core Foundation
Rebuilding Cadabra Studio: AI as Our Core FoundationRebuilding Cadabra Studio: AI as Our Core Foundation
Rebuilding Cadabra Studio: AI as Our Core Foundation
Cadabra Studio
 
Artificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across IndustriesArtificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across Industries
SandeepKS52
 
COBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM CertificateCOBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM Certificate
VICTOR MAESTRE RAMIREZ
 
Topic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptxTopic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptx
marutnand8
 
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
BradBedford3
 
Boost Student Engagement with Smart Attendance Software for Schools
Boost Student Engagement with Smart Attendance Software for SchoolsBoost Student Engagement with Smart Attendance Software for Schools
Boost Student Engagement with Smart Attendance Software for Schools
Visitu
 
14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework
Angelo Theodorou
 
iOS Developer Resume 2025 | Pramod Kumar
iOS Developer Resume 2025 | Pramod KumariOS Developer Resume 2025 | Pramod Kumar
iOS Developer Resume 2025 | Pramod Kumar
Pramod Kumar
 
Key AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence CompaniesKey AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence Companies
Mypcot Infotech
 
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdfTop 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Trackobit
 
Scalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple DevicesScalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple Devices
Scalefusion
 
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdfThe Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
Varsha Nayak
 
zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17
zOSCommserver
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Prachi Desai
 
The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...
Prachi Desai
 
Automating Map Production With FME and Python
Automating Map Production With FME and PythonAutomating Map Production With FME and Python
Automating Map Production With FME and Python
Safe Software
 
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI SearchAgentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Maxim Salnikov
 
FME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable InsightsFME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable Insights
Safe Software
 
Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...
Rishab Acharya
 
Rebuilding Cadabra Studio: AI as Our Core Foundation
Rebuilding Cadabra Studio: AI as Our Core FoundationRebuilding Cadabra Studio: AI as Our Core Foundation
Rebuilding Cadabra Studio: AI as Our Core Foundation
Cadabra Studio
 
Artificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across IndustriesArtificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across Industries
SandeepKS52
 
COBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM CertificateCOBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM Certificate
VICTOR MAESTRE RAMIREZ
 
Topic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptxTopic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptx
marutnand8
 
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
BradBedford3
 
Boost Student Engagement with Smart Attendance Software for Schools
Boost Student Engagement with Smart Attendance Software for SchoolsBoost Student Engagement with Smart Attendance Software for Schools
Boost Student Engagement with Smart Attendance Software for Schools
Visitu
 
14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework
Angelo Theodorou
 
iOS Developer Resume 2025 | Pramod Kumar
iOS Developer Resume 2025 | Pramod KumariOS Developer Resume 2025 | Pramod Kumar
iOS Developer Resume 2025 | Pramod Kumar
Pramod Kumar
 
Key AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence CompaniesKey AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence Companies
Mypcot Infotech
 
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdfTop 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Trackobit
 
Scalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple DevicesScalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple Devices
Scalefusion
 
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdfThe Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
Varsha Nayak
 
zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17
zOSCommserver
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Prachi Desai
 

Introduction to Spark with Python

  • 2. GÖKHAN ATIL ➤ Database Administrator ➤ Oracle ACE Director (2016)
 ACE (2011) ➤ 10g/11g and R12 Oracle Certified Professional (OCP) ➤ Co-author of Expert Oracle Enterprise Manager 12c ➤ Founding Member and Vice President of TROUG ➤ Blogger (since 2008) gokhanatil.com ➤ Twitter: @gokhanatil 2
  • 3. APACHE SPARK WITH PYTHON ➤ Introduction to Apache Spark ➤ Why Python (PySpark) instead of Scala? ➤ Spark RDD ➤ SQL and DataFrames ➤ Spark Streaming ➤ Spark Graphx ➤ Spark MLlib (Machine Learning) 3
  • 4. INTRODUCTION TO APACHE SPARK ➤ A fast and general engine for large-scale data processing ➤ Top-Level Apache Project since 2014. ➤ Response to limitations in the MapReduce ➤ Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ➤ Implemented in Scala programming language, supports Java, Scala, Python, R ➤ Runs on Hadoop, Mesos, Kubernetes, standalone, cloud 4
  • 5. DOWNLOAD AND RUN ON YOUR PC ➤ https://spark.apache.org/downloads.html ➤ Extract and Spark is ready: tar -xzf spark-2.3.0-bin-hadoop2.7.tgz spark-2.3.0-bin-hadoop2.7/bin/spark ➤ You can also use PIP: pip install pyspark 5
  • 6. PYSPARK AND SPARK-SUBMIT ➤ PySpark is the interface that gives access to Spark using the Python programming language ➤ The spark-submit script in Spark’s bin directory is used to launch applications on a cluster spark-submit example1.py 6
  • 7. WHY PYTHON INSTEAD OF SCALA? ➤ If you know Scala, then use Scala! ➤ Learning curve: Python is comparatively easier to learn ➤ Easy to use: Code readability, maintainability and familiarity is far better with Python ➤ Libraries: Python comes with great libraries for data analysis, statistics and visualization (numpy, pandas, matplotlib etc...) ➤ Performance:  Scala is faster then Python but if your Python code just calls Spark libraries, the differences in performance is minimal (*) 7 Reminder: Any new feature added in Spark API will be available in Scala first
  • 8. RESILIENT DISTRIBUTED DATASET (RDD) ➤ RDDs are the core data structure in Spark ➤ Distributed, resilient, immutable, can store unstructured and structured data, lazy evaluated 8 node 1 RDD partition 1 node 2 RDD partition 2 node 3 RDD partition 3 RDD
  • 9. RDD TRANSFORMATIONS AND ACTIONS sc.textFile(*) 9 RDD T1 T2 T3 ACTION LAZY EVALUATATION SPARK CONTEXT .collect().reduceByKey(*).filter(*).map(*)
  • 10. TRANSFORMATIONS ACTIONS ➤ map ➤ filter ➤ flatMap ➤ mapPartitions ➤ reduceByKey ➤ union ➤ intersection ➤ join 10 ➤ collect ➤ count ➤ first ➤ take ➤ takeSample ➤ takeOrdered ➤ saveAsTextFile ➤ foreach
  • 11. HOW TO CREATE RDD IN PYSPARK ➤ Referencing a dataset in an external storage system: rdd = sc.textFile( ... ) ➤ Parallelizing already existing collection: rdd = sc.parallelize( ... ) ➤ Creating RDD from already existing RDDs: rdd2 = rdd1.map( ... ) 11
  • 12. USERS.CSV (MOVIELENS DATABASE) id | age | gender | occupation | zip 1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 6|42|M|executive|98101 7|57|M|administrator|91344 8|36|M|administrator|05201 12 M = 670 F = 273
  • 13. EXAMPLE #1: USE RDD TO GROUP DATA FROM CSV from pyspark import SparkContext sc = SparkContext.getOrCreate() print sc.textFile( "users.csv" ) .map( lambda x: (x.split("|")[2], 1) ) .reduceByKey(lambda x,y:x+y).collect() sc.stop() 13 M, 1 M, 1 F, 1 M, 1 [(u'M', 670), (u'F', 273)]
  • 14. SPARK SQL AND DATAFRAMES 14 Catalyst RDD DataFrames/DataSetsSQL SPARKSQL MLlib GraphFrames Structured Streaming ➤ Spark SQL is Apache Spark's module for working with structured data
  • 15. DATAFRAMES AND DATASETS ➤ DataFrame is a distributed collection of "structured" data, organized into named columns. ➤ Spark DataSets are statically typed, while Python is a dynamically typed programming language so Python supports only DataFrames. 15
  • 16. EXAMPLE #2: USE DATAFRAME TO GROUP DATA FROM CSV from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) .groupby( "gender" ).count().show() sc.stop() 16
  • 18. CATALYST OPTIMIZER ➤ Spark SQL uses Catalyst optimizer to optimize query plans. ➤ Supports cost-based optimization since Spark 2.2 18 SQL DataFrame DataSet Query Plan Optimized Query Plan RDD Code Generation
  • 19. CONVERSION BETWEEN RDD AND DATAFRAME ➤ An RDD can be converted to DataFrame using createDataFrame or toDF method: rdd = sc.parallelize([("osman",21),("ahmet",25)]) df = rdd.toDF( "name STRING, age INT" ) df.show() ➤ You can access underlying RDD of a DataFrame using rdd property: df.rdd.collect() [Row(name=u'osman',age=21),Row(name=u'ahmet',age=25)] 19
  • 20. EXAMPLE #3: CREATE TEMPORARY VIEWS FROM DATAFRAMES spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) .createOrReplaceTempView( "users" ) spark.sql( "select count(*) from users" ).show() spark.sql( "select case when age < 25 then '-25' when age between 25 and 39 then '25-40' when age >= 40 then '40+' end age_group, count(*) from users group by age_group order by 1" ).show() 20
  • 21. EXAMPLE #4: READ AND WRITE DATA df = spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) df.write.saveAsTable("users") df .write.save("users.json", format="json", mode="overwrite") spark.sql("SELECT gender, count(*) FROM json.`users.json` GROUP BY gender").show() 21 HIVE
  • 22. SPARK STREAMING (DSTREAMS) ➤ Scalable, high-throughput, fault-tolerant stream processing of live data streams ➤ Supports: File, Socket, Kafka, Flume, Kinesis ➤ Spark Streaming receives live input data streams and divides the data into batches 22
  • 23. EXAMPLE #5: DISCRETIZED STREAMS (DSTREAMS) ssc = StreamingContext(sc, 1) stream_data = ssc.textFileStream("file:///tmp/stream") .map( lambda x: x.split(",")) stream_data.pprint() ssc.start() ssc.awaitTermination() 23
  • 25. STRUCTURED STREAMING ➤ Stream processing engine built on the Spark SQL engine ➤ Supports File and Kafka sources for production; Socket and Rate sources for testing 25
  • 26. EXAMPLE #6: STRUCTURED STREAMING stream_data = spark.readStream .load( format="csv",path="/tmp/stream/*.csv", schema="name string, points int" ) .groupBy("name").sum("points").orderBy( "sum(points)", ascending=0 ) stream_data.writeStream.start( format="console", outputMode="complete" ).awaitTermination() 26
  • 28. GRAPHX (GRAPHFRAMES) ➤ GraphX is a new component in Spark for graphs and graph- parallel computation. 28
  • 29. EXAMPLE #7: GRAPHFRAMES vertex = spark.createDataFrame([ (1, "Ahmet"), (2, "Mehmet"), (3, "Cengiz"), (4, "Osman")], ["id", "name"]) edges = spark.createDataFrame([ ( 1, 2, "friend" ), ( 2, 1, "friend" ), ( 2, 3, "friend" ), ( 3, 2, "friend" ), ( 2, 4, "friend" ), ( 4, 2, "friend" ), ( 3, 4, "friend" ), ( 4, 3, "friend" )], ["src","dst", "relation"]) 29
  • 30. EXAMPLE #7: GRAPHFRAMES pyspark --packages graphframes:graphframes:0.5.0-spark2.1- s_2.11 import graphframes as gf g = gf.GraphFrame(vertex, edges) g.shortestPaths([4]).show() 30 1 2 3 4
  • 31. MLLIB (MACHINE LEARNING) ➤ Supports common ML Algorithms such as classification, regression, clustering, and collaborative filtering ➤ Featurization: ➤ Feature extraction (TF-IDF, Word2Vec, CountVectorizer ...) ➤ Transformation (Tokenizer, StopWordsRemover ...) ➤ Selection (VectorSlicer, RFormula ... ) ➤ Pipelines: combine multiple algorithms into a single pipeline, or workflow ➤ DataFrame-based API is primary API 31
  • 32. EXAMPLE #8: ALTERNATING LEAST SQUARES (ALS) def parseratings( x ): v = x.split("::") return (int(v[0]), int(v[1]), float(v[2])) ratings = sc.textFile("ratings.dat").map(parseratings) .toDF( ["user", "id", "rating"] ) als = ALS(userCol="user", itemCol="id", ratingCol="rating") model = als.fit(ratings) model.recommendForAllUsers(10).show() 32