SlideShare a Scribd company logo
Walaa Assy
Giza Systems
Software
Developer
SPARK
LIGHTNING-FAST UNIFIED ANALYTICS
ENGINE
HOW DO WE HANDLE
EVER GROWING DATA
THAT HAS BECOME BIG
DATA?
ļ‚”Basics of Spark
ļ‚”Core API
ļ‚” Cluster Managers
ļ‚”Spark Maintenance
ļ‚”Libraries
ļ‚§ - SQL
ļ‚§ - Streaming
ļ‚§ - Mllib
ļ‚§ GraphX
ļ‚”Troubleshooting /
ļ‚”Future of Spark
AGENDA
Apache spark - Architecture , Overview & libraries
ļ‚” Readability
ļ‚” Expressiveness
ļ‚” Fast
ļ‚” Testability
ļ‚” Interactive
ļ‚” Fault Tolerant
ļ‚” Unify Big Data
Spark officially sets a new record in large scale sorting, spark
does make computations on disk it makes use of cached data
in memory
WHY SPARK? TINIER CODE LEADS TO ..
ļ‚” Map reduce has very narrow scope especially in batch
processing
ļ‚” Each problem needed a new api to solve
EXPLOSION OF MAP REDUCE
Apache spark - Architecture , Overview & libraries
A UNIFIED PLATFORM FOR BIG DATA
SPARK PROGRAMMING LANGUAGES
ļ‚” The most basic abstraction of spark
ļ‚” Spark operations are two main categories:
ļ‚§ Transformations [lazily evalutaed only storing the intent]
ļ‚§ Actions
ļ‚” val textFile = sc.textFile("file:///spark/README.md")
ļ‚” textFile.first // action
RDD [RESILIETNT DISTRIBUTION DATASET]
HELLO BIG DATA
ļ‚” sudo yum install wget
ļ‚” sudo wget https://downloads.lightbend.com/scala/2.13.0-
M4/scala-2.13.0-M4.tgz
ļ‚” tar xvf scala-2.13.0-M4.tgz
ļ‚” sudo mv scala-2.13.0-M4 /usr/lib
ļ‚” sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala
ļ‚” export PATH=$PATH:/usr/lib/scala/bin
SCALA INSTALLATION STEPS
ļ‚” sudo wget
https://www.apache.org/dyn/closer.lua/spark/spark-
2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
ļ‚” tar xvf spark-2.3.1-bin-hadoop2.7.tgz
ļ‚” ln -s spark-2.3.1-bin-hadoop2.7 spark
ļ‚” export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7
ļ‚” export PATH=$PATH:$SPARK_HOME/bin
SPARK INSTALLATION – CENTOS 7
SPARK MECHANISM
ļ‚” collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel…
ļ‚” A collection similar to a list or an array from a user level
ļ‚” processed in parallel to fasten computation time with no
failure tolerance
ļ‚” RDD is immutable
ļ‚” Transformations are lazy and stored in a DAG
ļ‚” Actions trigger DAGs
ļ‚” DAGS are like linear graph of tasks
ļ‚” Each action will trigger a fresh execution of the graph
RDD
INPUT DATASETS TYPES
Apache spark - Architecture , Overview & libraries
ļ‚” Map
ļ‚” Flatmap
ļ‚” Filter
ļ‚” Distinct
ļ‚” Sample
ļ‚” Union
ļ‚” Inttersection
ļ‚” Subtract
ļ‚” Cartesian
Transformations return RDDs
TRANSFORMATIONS IN MAP REDUCE
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
ļ‚” Collect()
ļ‚” Count()
ļ‚” Take(num)
ļ‚” takeOrdered(num)(ordering)
ļ‚” Reduce(function)
ļ‚” Aggregate(zeroValue)(seqOp,compOp)
ļ‚” Foreach(function)
ļ‚” Actions return different types according to each action
saveAsObjectFile(path)
saveAsTextFile(path) // saves as text file
External connector
foreach(T => Unit) // one object at a time
ļ‚” - foreachPartition(Iterator[T] => Unit) // one partition at a time
ACTIONS IN SPARK
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
ļ‚” Sql like pairing
ļ‚” Join
ļ‚” fullOuterJoin
ļ‚” leftJoin
ļ‚” rightJoin
ļ‚” Pair Saving
ļ‚§ saveAs(NewAPI)HadoopFile
ļ‚§ - path
ļ‚§ - keyClass
ļ‚§ - valueClass
ļ‚§ - outputFormatClass
ļ‚§
saveAs(NewAPI)HadoopData
Set
ļ‚§ - conf
ļ‚§ saveAsSequenceFile
ļ‚§ Pair Saving
ļ‚§ - saveAsHadoopFile(path,
keyClass, valueClass,
SequenceFileOutputFormat)
PAIR METHODS- CONTD
ļ‚” Works Like a distributed kernel
ļ‚” Built in a basic spark manager
ļ‚” Haddop cluster manager yarn
ļ‚” Apache mesos standalone
PRIMARY CLUSTER MANAGER
SPARK-SUBMIT DEMO
SPARK SQL
ļ‚” Spark SQL is Apache Spark's module for working with
structured or semi data.
ļ‚” It is meant to be used by non big data users
ļ‚” As Spark continues to grow, we want to enable wider
audiences beyond ā€œBig Dataā€ engineers to leverage the power
of distributed processing.
Databricks blog (http://bit.ly/17NM70s)
SPARK SQL
ļ‚” Seamlessly mix SQL queries with Spark programs
Spark SQL lets you query structured data inside Spark programs,
using either SQL or a familiar DataFrame API
ļ‚” Connect to any data source the same way.
ļ‚” It executes SQL queries.
ļ‚” We can read data from existing Hive installation using
SparkSQL.
ļ‚” When we run SQL within another programming language we
will get the result as Dataset/DataFrame.
SPARK SQL FEATURES
Apache spark - Architecture , Overview & libraries
DataFrames and SQL provide a common way to access a variety
of data sources, including Hive, Avro, Parquet, ORC, JSON, and
JDBC. You can even join data across these sources.
ļ‚” Run SQL or HiveQL queries on existing warehouses.[Hive
Integration]
ļ‚” Connect through JDBC or ODBC.[Standard Connectivity]
ļ‚” It is includes with spark
DATAFRAMES
ļ‚” Spark 1.3 release. It is a distributed collection of data
ordered into named columns. Concept wise it is equal to the
table in a relational database or a data frame in R/Python.
We can create DataFrame using:
ļ‚” Structured data files
ļ‚” Tables in Hive
ļ‚” External databases
ļ‚” Using existing RDD
SPARK DATAFRAME IS
Data frames = schem RDD
EXAMPLES
SPARK SQL COMPETITION
ļ‚” Hive
ļ‚” Parquet
ļ‚” Json
ļ‚” Avro
ļ‚” Amazon red shift
ļ‚” Csv
ļ‚” Others
It is recommended as a starting point for any spark application
As it adds
ļ‚§ Predicate push down
ļ‚§ Column pruning
ļ‚§ Can use SQL & RDD
SPARK SQL DATA SOURCES
SPARK STREAMING
ļ‚” Big & fast data
ļ‚” Gigabytes per second
ļ‚” Real time fraud detection
ļ‚” Marketing
ļ‚” makes it easy to build scalable fault-tolerant streaming
applications.
SPARK STREAMING
SPARK STREAMING COMPETITORS
Streaming data
• Kafka
• Flume
• Twitter
• Hadoop hdfs
• Others
• live logs, system telemetry data, IoT device
data, etc.)
SPARK MLIB
ļ‚” MLlib is a standard component of Spark providing machine
learning primitives on top of Spark.
SPARK MLIB
ļ‚” MATLAB
ļ‚” R
EASY TO USE BUT NOT SCALABLE
ļ‚” MAHOUT
ļ‚” GRAPHLAB
Scalable but at the cost ease
ļ‚” Org.apache.spark.mlib
Rdd based algoritms
ļ‚” Org.aoache.spark.ml
ļ‚” Pipleline api built on top of dataframes
SPARK MLIB COMPETITION
ļ‚” Loding the data
ļ‚” Extracting features
ļ‚” Training the data
ļ‚” Testing
ļ‚” the data
ļ‚” The new pipeline allows tuning testing and early failure
detection
MACHINE LEARNING FLOW
ļ‚” Algorithms
Classifcation ex: naïve bayes
Regression
Linear
Logistic
Filtering by als ,k squares
Clustering by k-means
Dimensional reduction by SVD singular value decomposition
ļ‚” Feature extraction and transformations
Tf-idf : term frequency- inverse document frequency
ALGRITHMS IN MLIB
ļ‚” Spam filtering
ļ‚” Fraud detection
ļ‚” Recommendation analysis
ļ‚” Speech recognition
PRACTICAL USE
ļ‚” Word to vector algorithm
ļ‚” This algorithm takes an input text and outputs a set of vectors
representing a dictionary of words [to see word similarity]
ļ‚” We cache the rdds because mlib will have multiple passes o
the same data so this memory cache can reduce processing
time alot
ļ‚” breeze numerical processing library used inside of spark
ļ‚” It has ability to perform mathematical operations on vectors
MLIB DEMO
SPARK GRAPHX
ļ‚” GraphX is Apache Spark's API for graphs and graph-parallel
computation.
ļ‚” Page ranking
ļ‚” Producing evaluations
ļ‚” It can be used in genetic analysis
ļ‚” ALGORITHMS
ļ‚§ PageRank
ļ‚§ Connected components
ļ‚§ Label propagation
ļ‚§ SVD++
ļ‚§ Strongly connected components
ļ‚§ Triangle count
GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED
WORLD
COMPETITONS
End-to-end PageRank performance (20 iterations,
3.7B edges)
ļ‚” Joints each had unique id
ļ‚” Each vertex can has properties of user defined type and store
metal data
ARCHITECTURE
ļ‚” Arrows are relations that can store metadata data known as
edges which is a long type
ļ‚” A graph is built of two RDDs one containing the collection of
edges and the collection of vertices
ļ‚” Another component is edge triplet is an object which exposes
the relation between each vertex and edge containing all the
information for each connection
WHO IS USING SPARK?
Apache spark - Architecture , Overview & libraries
ļ‚” http://spark.apache.org
ļ‚” Tutorials: http://ampcamp.berkeley.edu
ļ‚” Spark Summit: http://spark-summit.org
ļ‚” Github: https://github.com/apache/spark
ļ‚” https://data-flair.training/blogs/spark-sql-tutorial/
REFERENCES

More Related Content

What's hot (20)

Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Ƈetin ƇAVDAR
Ā 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
Ā 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
Ā 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Ā 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
Ā 
Spark
SparkSpark
Spark
Koushik Mondal
Ā 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Ā 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
Ā 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Ā 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
Ā 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
Ā 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
Ā 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
Ā 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
Ā 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
Ā 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
Ā 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
Ā 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
Ā 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
Ā 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
Ā 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
Ā 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
Ā 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Ā 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
Ā 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Ā 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
Ā 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Ā 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
Ā 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
Ā 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
Ā 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
Ā 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
Ā 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
Ā 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
Ā 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
Ā 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
Ā 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
Ā 

Similar to Apache spark - Architecture , Overview & libraries (20)

Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
Ā 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
GauravBiswas9
Ā 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
Ā 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
Ā 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
Ā 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Ā 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
Ā 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
Ā 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Andrey Vykhodtsev
Ā 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
Ā 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
Ā 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
Infinity Tech Solutions
Ā 
Apache spark
Apache sparkApache spark
Apache spark
Hitesh Dua
Ā 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
Ā 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
Ā 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
Ā 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
Ā 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
Ā 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
Ā 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
Ā 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
Ā 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
GauravBiswas9
Ā 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
Ā 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
Ā 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
Ā 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Ā 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
Ā 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
Ā 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Andrey Vykhodtsev
Ā 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
Ā 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
Ā 
Apache spark
Apache sparkApache spark
Apache spark
Hitesh Dua
Ā 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
Ā 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
Ā 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
Ā 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
Ā 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
Ā 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
Ā 
Ad

Recently uploaded (20)

Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
Ā 
STKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 versionSTKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 version
Dr. Jimmy Schwarzkopf
Ā 
Cyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptxCyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptx
Ghimire B.R.
Ā 
Dev Dives: System-to-system integration with UiPath API Workflows
Dev Dives: System-to-system integration with UiPath API WorkflowsDev Dives: System-to-system integration with UiPath API Workflows
Dev Dives: System-to-system integration with UiPath API Workflows
UiPathCommunity
Ā 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
Ā 
Measuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI SuccessMeasuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI Success
Nikki Chapple
Ā 
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Peter Bittner
Ā 
AI Emotional Actors: ā€œWhen Machines Learn to Feel and Perform"
AI Emotional Actors:  ā€œWhen Machines Learn to Feel and Perform"AI Emotional Actors:  ā€œWhen Machines Learn to Feel and Perform"
AI Emotional Actors: ā€œWhen Machines Learn to Feel and Perform"
AkashKumar809858
Ā 
Agentic AI - The New Era of Intelligence
Agentic AI - The New Era of IntelligenceAgentic AI - The New Era of Intelligence
Agentic AI - The New Era of Intelligence
Muzammil Shah
Ā 
New Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDBNew Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
Ā 
Gihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai TechnologyGihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai Technology
zainkhurram1111
Ā 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
Ā 
Offshore IT Support: Balancing In-House and Offshore Help Desk Technicians
Offshore IT Support: Balancing In-House and Offshore Help Desk TechniciansOffshore IT Support: Balancing In-House and Offshore Help Desk Technicians
Offshore IT Support: Balancing In-House and Offshore Help Desk Technicians
john823664
Ā 
The case for on-premises AI
The case for on-premises AIThe case for on-premises AI
The case for on-premises AI
Principled Technologies
Ā 
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyesEnd-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
ThousandEyes
Ā 
European Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility TestingEuropean Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility Testing
Julia Undeutsch
Ā 
Co-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using ProvenanceCo-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using Provenance
Paul Groth
Ā 
LSNIF: Locally-Subdivided Neural Intersection Function
LSNIF: Locally-Subdivided Neural Intersection FunctionLSNIF: Locally-Subdivided Neural Intersection Function
LSNIF: Locally-Subdivided Neural Intersection Function
Takahiro Harada
Ā 
Create Your First AI Agent with UiPath Agent Builder
Create Your First AI Agent with UiPath Agent BuilderCreate Your First AI Agent with UiPath Agent Builder
Create Your First AI Agent with UiPath Agent Builder
DianaGray10
Ā 
UiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build PipelinesUiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build Pipelines
UiPathCommunity
Ā 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
Ā 
STKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 versionSTKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 version
Dr. Jimmy Schwarzkopf
Ā 
Cyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptxCyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptx
Ghimire B.R.
Ā 
Dev Dives: System-to-system integration with UiPath API Workflows
Dev Dives: System-to-system integration with UiPath API WorkflowsDev Dives: System-to-system integration with UiPath API Workflows
Dev Dives: System-to-system integration with UiPath API Workflows
UiPathCommunity
Ā 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
Ā 
Measuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI SuccessMeasuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI Success
Nikki Chapple
Ā 
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Peter Bittner
Ā 
AI Emotional Actors: ā€œWhen Machines Learn to Feel and Perform"
AI Emotional Actors:  ā€œWhen Machines Learn to Feel and Perform"AI Emotional Actors:  ā€œWhen Machines Learn to Feel and Perform"
AI Emotional Actors: ā€œWhen Machines Learn to Feel and Perform"
AkashKumar809858
Ā 
Agentic AI - The New Era of Intelligence
Agentic AI - The New Era of IntelligenceAgentic AI - The New Era of Intelligence
Agentic AI - The New Era of Intelligence
Muzammil Shah
Ā 
New Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDBNew Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
Ā 
Gihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai TechnologyGihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai Technology
zainkhurram1111
Ā 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
Ā 
Offshore IT Support: Balancing In-House and Offshore Help Desk Technicians
Offshore IT Support: Balancing In-House and Offshore Help Desk TechniciansOffshore IT Support: Balancing In-House and Offshore Help Desk Technicians
Offshore IT Support: Balancing In-House and Offshore Help Desk Technicians
john823664
Ā 
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyesEnd-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
ThousandEyes
Ā 
European Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility TestingEuropean Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility Testing
Julia Undeutsch
Ā 
Co-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using ProvenanceCo-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using Provenance
Paul Groth
Ā 
LSNIF: Locally-Subdivided Neural Intersection Function
LSNIF: Locally-Subdivided Neural Intersection FunctionLSNIF: Locally-Subdivided Neural Intersection Function
LSNIF: Locally-Subdivided Neural Intersection Function
Takahiro Harada
Ā 
Create Your First AI Agent with UiPath Agent Builder
Create Your First AI Agent with UiPath Agent BuilderCreate Your First AI Agent with UiPath Agent Builder
Create Your First AI Agent with UiPath Agent Builder
DianaGray10
Ā 
UiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build PipelinesUiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build Pipelines
UiPathCommunity
Ā 
Ad

Apache spark - Architecture , Overview & libraries

  • 2. HOW DO WE HANDLE EVER GROWING DATA THAT HAS BECOME BIG DATA?
  • 3. ļ‚”Basics of Spark ļ‚”Core API ļ‚” Cluster Managers ļ‚”Spark Maintenance ļ‚”Libraries ļ‚§ - SQL ļ‚§ - Streaming ļ‚§ - Mllib ļ‚§ GraphX ļ‚”Troubleshooting / ļ‚”Future of Spark AGENDA
  • 5. ļ‚” Readability ļ‚” Expressiveness ļ‚” Fast ļ‚” Testability ļ‚” Interactive ļ‚” Fault Tolerant ļ‚” Unify Big Data Spark officially sets a new record in large scale sorting, spark does make computations on disk it makes use of cached data in memory WHY SPARK? TINIER CODE LEADS TO ..
  • 6. ļ‚” Map reduce has very narrow scope especially in batch processing ļ‚” Each problem needed a new api to solve EXPLOSION OF MAP REDUCE
  • 8. A UNIFIED PLATFORM FOR BIG DATA
  • 10. ļ‚” The most basic abstraction of spark ļ‚” Spark operations are two main categories: ļ‚§ Transformations [lazily evalutaed only storing the intent] ļ‚§ Actions ļ‚” val textFile = sc.textFile("file:///spark/README.md") ļ‚” textFile.first // action RDD [RESILIETNT DISTRIBUTION DATASET]
  • 12. ļ‚” sudo yum install wget ļ‚” sudo wget https://downloads.lightbend.com/scala/2.13.0- M4/scala-2.13.0-M4.tgz ļ‚” tar xvf scala-2.13.0-M4.tgz ļ‚” sudo mv scala-2.13.0-M4 /usr/lib ļ‚” sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala ļ‚” export PATH=$PATH:/usr/lib/scala/bin SCALA INSTALLATION STEPS
  • 13. ļ‚” sudo wget https://www.apache.org/dyn/closer.lua/spark/spark- 2.3.1/spark-2.3.1-bin-hadoop2.7.tgz ļ‚” tar xvf spark-2.3.1-bin-hadoop2.7.tgz ļ‚” ln -s spark-2.3.1-bin-hadoop2.7 spark ļ‚” export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7 ļ‚” export PATH=$PATH:$SPARK_HOME/bin SPARK INSTALLATION – CENTOS 7
  • 15. ļ‚” collection of elements partitioned across the nodes of the cluster that can be operated on in parallel… ļ‚” A collection similar to a list or an array from a user level ļ‚” processed in parallel to fasten computation time with no failure tolerance ļ‚” RDD is immutable ļ‚” Transformations are lazy and stored in a DAG ļ‚” Actions trigger DAGs ļ‚” DAGS are like linear graph of tasks ļ‚” Each action will trigger a fresh execution of the graph RDD
  • 18. ļ‚” Map ļ‚” Flatmap ļ‚” Filter ļ‚” Distinct ļ‚” Sample ļ‚” Union ļ‚” Inttersection ļ‚” Subtract ļ‚” Cartesian Transformations return RDDs TRANSFORMATIONS IN MAP REDUCE
  • 24. ļ‚” Collect() ļ‚” Count() ļ‚” Take(num) ļ‚” takeOrdered(num)(ordering) ļ‚” Reduce(function) ļ‚” Aggregate(zeroValue)(seqOp,compOp) ļ‚” Foreach(function) ļ‚” Actions return different types according to each action saveAsObjectFile(path) saveAsTextFile(path) // saves as text file External connector foreach(T => Unit) // one object at a time ļ‚” - foreachPartition(Iterator[T] => Unit) // one partition at a time ACTIONS IN SPARK
  • 28. ļ‚” Sql like pairing ļ‚” Join ļ‚” fullOuterJoin ļ‚” leftJoin ļ‚” rightJoin ļ‚” Pair Saving ļ‚§ saveAs(NewAPI)HadoopFile ļ‚§ - path ļ‚§ - keyClass ļ‚§ - valueClass ļ‚§ - outputFormatClass ļ‚§ saveAs(NewAPI)HadoopData Set ļ‚§ - conf ļ‚§ saveAsSequenceFile ļ‚§ Pair Saving ļ‚§ - saveAsHadoopFile(path, keyClass, valueClass, SequenceFileOutputFormat) PAIR METHODS- CONTD
  • 29. ļ‚” Works Like a distributed kernel ļ‚” Built in a basic spark manager ļ‚” Haddop cluster manager yarn ļ‚” Apache mesos standalone PRIMARY CLUSTER MANAGER
  • 32. ļ‚” Spark SQL is Apache Spark's module for working with structured or semi data. ļ‚” It is meant to be used by non big data users ļ‚” As Spark continues to grow, we want to enable wider audiences beyond ā€œBig Dataā€ engineers to leverage the power of distributed processing. Databricks blog (http://bit.ly/17NM70s) SPARK SQL
  • 33. ļ‚” Seamlessly mix SQL queries with Spark programs Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API ļ‚” Connect to any data source the same way. ļ‚” It executes SQL queries. ļ‚” We can read data from existing Hive installation using SparkSQL. ļ‚” When we run SQL within another programming language we will get the result as Dataset/DataFrame. SPARK SQL FEATURES
  • 35. DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources. ļ‚” Run SQL or HiveQL queries on existing warehouses.[Hive Integration] ļ‚” Connect through JDBC or ODBC.[Standard Connectivity] ļ‚” It is includes with spark DATAFRAMES
  • 36. ļ‚” Spark 1.3 release. It is a distributed collection of data ordered into named columns. Concept wise it is equal to the table in a relational database or a data frame in R/Python. We can create DataFrame using: ļ‚” Structured data files ļ‚” Tables in Hive ļ‚” External databases ļ‚” Using existing RDD SPARK DATAFRAME IS Data frames = schem RDD
  • 39. ļ‚” Hive ļ‚” Parquet ļ‚” Json ļ‚” Avro ļ‚” Amazon red shift ļ‚” Csv ļ‚” Others It is recommended as a starting point for any spark application As it adds ļ‚§ Predicate push down ļ‚§ Column pruning ļ‚§ Can use SQL & RDD SPARK SQL DATA SOURCES
  • 41. ļ‚” Big & fast data ļ‚” Gigabytes per second ļ‚” Real time fraud detection ļ‚” Marketing ļ‚” makes it easy to build scalable fault-tolerant streaming applications. SPARK STREAMING
  • 42. SPARK STREAMING COMPETITORS Streaming data • Kafka • Flume • Twitter • Hadoop hdfs • Others • live logs, system telemetry data, IoT device data, etc.)
  • 44. ļ‚” MLlib is a standard component of Spark providing machine learning primitives on top of Spark. SPARK MLIB
  • 45. ļ‚” MATLAB ļ‚” R EASY TO USE BUT NOT SCALABLE ļ‚” MAHOUT ļ‚” GRAPHLAB Scalable but at the cost ease ļ‚” Org.apache.spark.mlib Rdd based algoritms ļ‚” Org.aoache.spark.ml ļ‚” Pipleline api built on top of dataframes SPARK MLIB COMPETITION
  • 46. ļ‚” Loding the data ļ‚” Extracting features ļ‚” Training the data ļ‚” Testing ļ‚” the data ļ‚” The new pipeline allows tuning testing and early failure detection MACHINE LEARNING FLOW
  • 47. ļ‚” Algorithms Classifcation ex: naĆÆve bayes Regression Linear Logistic Filtering by als ,k squares Clustering by k-means Dimensional reduction by SVD singular value decomposition ļ‚” Feature extraction and transformations Tf-idf : term frequency- inverse document frequency ALGRITHMS IN MLIB
  • 48. ļ‚” Spam filtering ļ‚” Fraud detection ļ‚” Recommendation analysis ļ‚” Speech recognition PRACTICAL USE
  • 49. ļ‚” Word to vector algorithm ļ‚” This algorithm takes an input text and outputs a set of vectors representing a dictionary of words [to see word similarity] ļ‚” We cache the rdds because mlib will have multiple passes o the same data so this memory cache can reduce processing time alot ļ‚” breeze numerical processing library used inside of spark ļ‚” It has ability to perform mathematical operations on vectors MLIB DEMO
  • 51. ļ‚” GraphX is Apache Spark's API for graphs and graph-parallel computation. ļ‚” Page ranking ļ‚” Producing evaluations ļ‚” It can be used in genetic analysis ļ‚” ALGORITHMS ļ‚§ PageRank ļ‚§ Connected components ļ‚§ Label propagation ļ‚§ SVD++ ļ‚§ Strongly connected components ļ‚§ Triangle count GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED WORLD
  • 52. COMPETITONS End-to-end PageRank performance (20 iterations, 3.7B edges)
  • 53. ļ‚” Joints each had unique id ļ‚” Each vertex can has properties of user defined type and store metal data ARCHITECTURE
  • 54. ļ‚” Arrows are relations that can store metadata data known as edges which is a long type ļ‚” A graph is built of two RDDs one containing the collection of edges and the collection of vertices
  • 55. ļ‚” Another component is edge triplet is an object which exposes the relation between each vertex and edge containing all the information for each connection
  • 56. WHO IS USING SPARK?
  • 58. ļ‚” http://spark.apache.org ļ‚” Tutorials: http://ampcamp.berkeley.edu ļ‚” Spark Summit: http://spark-summit.org ļ‚” Github: https://github.com/apache/spark ļ‚” https://data-flair.training/blogs/spark-sql-tutorial/ REFERENCES