SlideShare a Scribd company logo
Processing Large Data with Apache Spark -- HasGeek
Agenda
Big Data
Overview
Spark
Overview
Spark
Internals
Spark
Libraries
BIG DATA OVERVIEW
Big Data -- Digital Data growth…
V-V-V
Legacy Architecture Pain Points
• Report arrival latency quite high - Hours to perform joins,
aggregate data
• Existing frameworks cannot do both
• Either, stream processing of 100s of MB/s with low latency
• Or, batch processing of TBs of data with high latency
• Expressibility of business logic in Hadoop MR is challenging
SPARK OVERVIEW
Why
Spark?
Why Spark
Separate, fast, Map-Reduce-like engine
In-memory data storage for very fast iterative queries
Better Fault Tolerance
Combine SQL, Streaming and complex analytics
Runs on Hadoop, Mesos, standalone, or in the cloud
Data sources -> HDFS, Cassandra, HBase and S3
In Memory - Spark vs Hadoop
Improve efficiency over MapReduce
100x in memory , 2-10x in disk
Up to 40x faster than Hadoop
Spark In & Out
RDBMS
Streaming
SQL
GraphX
BlinkDB
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon
MLlib
Ref: http://training.databricks.com/intro.pdf
Spark Streaming + SQL
Streaming
SQL
Benchmarking & Best Facts
SPARK INSIDE – AROUND RDD
Resilient Distributed Data (RDD)
Immutable + Distributed+ Catchable+ Lazy evaluated
 Distributed collections of objects
 Can be cached in memory across cluster nodes
 Manipulated through various parallel operations
RDD Types
RDD
RDD Operation
Memory and Persistent
Dependencies Types
Spark Cluster Overview
o Application
o Driver program
o Cluster manage
o Worker node
o Job
o Stage
o Executor
o Task
Job Flow
Task Scheduler , DAG
• Pipelines functions within a stage
• Cache-aware data reuse & locality
• Partitioning-aware to avoid
shuffles
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)
Fault Recovery & Checkpoints
• Efficient fault recovery using Lineage
• log one operation to apply to many elements (lineage)
• Recomputed lost partitions on failure
• Checkpoint RDDs to prevent long lineage chains during fault
recovery
QUICK DEMO
SPARK STACAK DETAILS
Spark SQL
• Seamlessly mix SQL queries with Spark programs
• Load and query data from a variety of sources
• Standard Connectivity through (J)ODBC
• Hive Compatibility
Data Frames
• A distributed collection of data organized into named columns
• Like a table in a relational database
Spark SQL
Resilient Distributed Datasets
Spark
JDBC Console
User Programs
(Java, Scala, Python)
Catalyst Optimizer
DataFrame API
Figur e 1: I nter faces to Spar k SQL , and inter action with Spar k.
3.1 DataFr ame API
The main abstraction in Spark SQL’s API is a DataFrame, a dis-
tributed collection of rows with a homogeneous schema. A DataFrame
is equivalent to a table in a relational database, and can also be
manipulated in similar ways to the “ native” distributed collections
as well
maps an
to creat
Spark S
the quer
ports us
Using
data fro
tional d
3.3 D
Users c
domain
Python
operato
aggrega
jects in
expressi
of fema
empl oy
. j oi
SparkR
• New R language for Spark and SparkSQL
• Exposes existing Spark functionality in
an R-friendly syntax view the DataFrame API
Spark Streaming
File systems
Databases
Dashboards
Flume
HDFS
Kinesis
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
Chop up the live stream into batches of X seconds. DStream is represented by
a continuous series of RDDs
MLib
• Scalable Machine learning library
• Iterative computing -> High Quality algorithm 100x faster than
hadoop
MLib Algorithms
ML Pipeline
• Feature Extraction
• Normalization
• Dimensionality reduction
• Model training
GraphX
• Spark’s API For Graph and Graph-parallel computation
• Graph abstraction: a directed multigraph with properties attached
to each vertex and edge
• Seamlessly works with both graph and collections
GraphX Framework & Algorithms
Algorithms
Spark Packages
Users & Distributors…
Thanks to Apache Spark by….
Started using it in our projects…
Contribute to their open source community…
Socialize Spark ..
Backup Slides
SPARK CLUSTER
Cluster Support
• Standalone – a simple cluster manager included with Spark that makes it easy to set
up a cluster
• Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and
service applications
• Hadoop YARN – the resource manager in Hadoop 2
Spark On Mesos
Spark on YARN
Processing Large Data with Apache Spark -- HasGeek
Data Science Process
Data Science in Practice
• Data Collection
• Munging
• Analysis
• Visualization
• Decision
Real Time Feedback
SQL Optimization (Catalyst)
Project Tungsten
• Memory Management and Binary Processing: leveraging application semantics to
manage memory explicitly and eliminate the overhead of JVM object model and
garbage collection
• Cache-aware computation: algorithms and data structures to exploit memory
hierarchy
• Code generation: using code generation to exploit modern compilers and CPUs
BDAS - Berkeley Data Analytics
Stackhttps://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.
Optimization
• groupBy is costlier – use mapr() or reduceByKey()
• RDD storage level MEMOR_ONLY is better
Optimization Code Example
RDDs vs Distributed Shared Mem
DAG Visualization
Spark + Akka+Spray
Spark R Architecture
PySpark
GraphX representation
Links References
• Spark
• Spark Submit 2015
• Spark External Projects
• Spark Central
Project Tungsten Roadmap
TACHYON
• Tachyon is a memory-centric distributed storage system enabling
reliable data sharing at memory-speed across cluster frameworks,
such as Spark and MapReduce. It achieves high performance by
leveraging lineage information and using memory aggressively.
Tachyon caches working set files in memory, thereby avoiding going
to disk to load datasets that are frequently read. This enables
different jobs/queries and frameworks to access cached files at
memory speed.
Blink DB
Processing Large Data with Apache Spark -- HasGeek
Batches…
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes
them using RDD operations
• Finally, the processed results of the RDD operations are
returned in batches
Micro Batch
Dstream (Discretized Streams)
DStream is represented by a continuous series of RDDs
Window Operation & Checkpoint
Streaming
• Scalable high-throughput
streaming process of live data
• Integrate with many sources
• Fault-tolerant- Stateful
exactly-once semantics out of
box
• Combine streaming with
batch and interactive queries
Spark streaming
data streams
Receiv
ers
batches
as RDDs
results as
RDDs
Streaming Fault Tolerance
Spark Streaming UI
Micro Batch (Near Real Time)
Micro Batch
Spark with Storm
Spark + Cassandra
Big Data Landscape
100 opensourceBig Dataarchitecturepapers

More Related Content

What's hot (20)

Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Andrew Lamb
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
Databricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Databricks
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Andrew Lamb
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
Databricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Databricks
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 

Similar to Processing Large Data with Apache Spark -- HasGeek (20)

Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
Venkata Naga Ravi
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
Darko Marjanovic
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
Darko Marjanovic
 
Ad

More from Venkata Naga Ravi (11)

Microservices with Docker
Microservices with Docker Microservices with Docker
Microservices with Docker
Venkata Naga Ravi
 
Quick Trip with Docker
Quick Trip with DockerQuick Trip with Docker
Quick Trip with Docker
Venkata Naga Ravi
 
Flocker
FlockerFlocker
Flocker
Venkata Naga Ravi
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
Venkata Naga Ravi
 
Go Lang
Go LangGo Lang
Go Lang
Venkata Naga Ravi
 
Kubernetes
KubernetesKubernetes
Kubernetes
Venkata Naga Ravi
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
Venkata Naga Ravi
 
Software Defined Network - SDN
Software Defined Network - SDNSoftware Defined Network - SDN
Software Defined Network - SDN
Venkata Naga Ravi
 
Virtual Container - Docker
Virtual Container - Docker Virtual Container - Docker
Virtual Container - Docker
Venkata Naga Ravi
 
Java 8 Lambda and Streams
Java 8 Lambda and StreamsJava 8 Lambda and Streams
Java 8 Lambda and Streams
Venkata Naga Ravi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Ad

Recently uploaded (20)

Palo Alto Networks Cybersecurity Foundation
Palo Alto Networks Cybersecurity FoundationPalo Alto Networks Cybersecurity Foundation
Palo Alto Networks Cybersecurity Foundation
VICTOR MAESTRE RAMIREZ
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
 
STKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 versionSTKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 version
Dr. Jimmy Schwarzkopf
 
The case for on-premises AI
The case for on-premises AIThe case for on-premises AI
The case for on-premises AI
Principled Technologies
 
Dev Dives: System-to-system integration with UiPath API Workflows
Dev Dives: System-to-system integration with UiPath API WorkflowsDev Dives: System-to-system integration with UiPath API Workflows
Dev Dives: System-to-system integration with UiPath API Workflows
UiPathCommunity
 
New Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDBNew Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 ADr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr. Jimmy Schwarzkopf
 
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptxECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
Jasper Oosterveld
 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
 
Microsoft Build 2025 takeaways in one presentation
Microsoft Build 2025 takeaways in one presentationMicrosoft Build 2025 takeaways in one presentation
Microsoft Build 2025 takeaways in one presentation
Digitalmara
 
SDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhereSDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhere
Adtran
 
AI Trends - Mary Meeker
AI Trends - Mary MeekerAI Trends - Mary Meeker
AI Trends - Mary Meeker
Razin Mustafiz
 
Data Virtualization: Bringing the Power of FME to Any Application
Data Virtualization: Bringing the Power of FME to Any ApplicationData Virtualization: Bringing the Power of FME to Any Application
Data Virtualization: Bringing the Power of FME to Any Application
Safe Software
 
Grannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI ExperiencesGrannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI Experiences
Lauren Parr
 
Fortinet Certified Associate in Cybersecurity
Fortinet Certified Associate in CybersecurityFortinet Certified Associate in Cybersecurity
Fortinet Certified Associate in Cybersecurity
VICTOR MAESTRE RAMIREZ
 
Cyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptxCyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptx
Ghimire B.R.
 
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyesEnd-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
ThousandEyes
 
Co-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using ProvenanceCo-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using Provenance
Paul Groth
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
Palo Alto Networks Cybersecurity Foundation
Palo Alto Networks Cybersecurity FoundationPalo Alto Networks Cybersecurity Foundation
Palo Alto Networks Cybersecurity Foundation
VICTOR MAESTRE RAMIREZ
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
 
STKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 versionSTKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 version
Dr. Jimmy Schwarzkopf
 
Dev Dives: System-to-system integration with UiPath API Workflows
Dev Dives: System-to-system integration with UiPath API WorkflowsDev Dives: System-to-system integration with UiPath API Workflows
Dev Dives: System-to-system integration with UiPath API Workflows
UiPathCommunity
 
New Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDBNew Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 ADr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr. Jimmy Schwarzkopf
 
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptxECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
Jasper Oosterveld
 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
 
Microsoft Build 2025 takeaways in one presentation
Microsoft Build 2025 takeaways in one presentationMicrosoft Build 2025 takeaways in one presentation
Microsoft Build 2025 takeaways in one presentation
Digitalmara
 
SDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhereSDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhere
Adtran
 
AI Trends - Mary Meeker
AI Trends - Mary MeekerAI Trends - Mary Meeker
AI Trends - Mary Meeker
Razin Mustafiz
 
Data Virtualization: Bringing the Power of FME to Any Application
Data Virtualization: Bringing the Power of FME to Any ApplicationData Virtualization: Bringing the Power of FME to Any Application
Data Virtualization: Bringing the Power of FME to Any Application
Safe Software
 
Grannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI ExperiencesGrannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI Experiences
Lauren Parr
 
Fortinet Certified Associate in Cybersecurity
Fortinet Certified Associate in CybersecurityFortinet Certified Associate in Cybersecurity
Fortinet Certified Associate in Cybersecurity
VICTOR MAESTRE RAMIREZ
 
Cyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptxCyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptx
Ghimire B.R.
 
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyesEnd-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
ThousandEyes
 
Co-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using ProvenanceCo-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using Provenance
Paul Groth
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 

Processing Large Data with Apache Spark -- HasGeek

Editor's Notes

  • #7: https://spark-summit.org/2013/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf
  • #11: SparkR was introduced just a few days ago with Spark 1.4. This is Spark’s first new language API since PySpark was added in 2012. SparkR is based on Spark’s parallel DataFrame abstraction. Users can create SparkR DataFrames from “local” R data frames, or from any Spark data source such as Hive, HDFS, Parquet or JSON. SparkR DataFrames support all Spark DataFrame operations including aggregation, filtering, grouping, summary statistics, and other analytical functions. They also supports mixing-in SQL queries, and converting query results to and from DataFrames. Because SparkR uses the Spark’s parallel engine underneath, operations take advantage of multiple cores or multiple machines, and can scale to data sizes much larger than standalone R programs. The new DataFrames API was inspired by data frames in R and Pandas in Python. DataFrames integrate with Python, Java, Scala and R and give you state of the art optimization through the Spark SQL Catalyst optimizer. DataFrames are just a distributed collection of data organized into named columns and can be made from tables in Hive, external databases or existing RDDs. SQL: Spark’s module for working with structured data made of rows and columns. Spark SQL used to work against a special type of RDD called a SchemaRDD, but that is now being replaced with DataFrames. Spark SQL reuses the Hive frontend and metastore, which gives you full compatibility with existing Hive data, queries and UDFs. This allows you to run unmodified Hive queries on existing data warehouses. There is also standard connectivity through JDBC or ODBC via a Simba driver. Tableau uses this Simba driver to send queries down to Spark SQL to run at scale. Streaming: makes it easy to build scalable fault-tolerant streaming applications with stateful exactly-once symantics out of the box. Streaming allows you to reuse the same code for batch processing and stream processing. In 2012, Spark Streaming was able to process over 60 million records per second on 100 nodes at sub-second processing latency, which makes it 2 – 4x faster than comparable systems like Apache Storm on Yahoo’s S4. Netflix is one of the big users of Spark Streaming. (60 million / 100 = 600k) Spark Streaming is able to process 100,000-500,000 records/node/sec. This is much faster than Storm and comparable to other Stream processing systems. Sigmoid was able to consume 480,000 records per second per node machines using Kafka as a source. Kafka: Kafka basically acts as a buffer for incoming data. It is a high-throughput distributed messaging system. So Kafka maintains feeds of messages in categories called topics that get pushed or published into Kafka by producers. Then consumers like Spark Streaming can subscribe to topics and consume the feed of published messages. Each node in a Kafka cluster is called a broker. More than 75% of the time we see Kafka being used instead of Flume. Flume: Distributed log collection and aggregation service for moving large amounts of log data from many different sources to a centralized data store. So with Flume, data from external sources like web servers is consumed by a Flume source. When a Flume source receives an event, it stores it into one or more channels. The channels will keep the event until its consumed by a Flume sink. So, when Flume pushes the data into the sink, that’s where the data is buffered until Spark Streaming pulls the data from the sink. MLlib + GraphX: Mllib is Spark’s scalable machine learning library consisting of common algorithms and utilities including classification, regression, clustering, collaborative filtering, dimensionality reduction. MLlib’s datatypes are vectors and matrices and some of the underlying linear algebra operations on them are provided by Breeze and jblas. The major algorithmic components in Mllib are statistics (like max, min, mean, variance, # of non-zeroes, correlations (Pearson’s and Spearman’s correlations), Stratified Sampling, Hypothesis testing, Random Data Generation, Classification & Regression (like linear models, SVMs, logistic regression, linear regression, naïve Bayes, decision trees, random forests, gradient-boosted trees), Collaborative filtering (ALS), Clustering (K-means), Dimensionality reduction (Singular value decomposition/SVD and Principal Component Analysis/PCA), Feature extraction and transformation, optimization (like stochastic gradient descent and limited memory BFGS). Tachyon: memory based distributed storage system that allows data sharing across cluster frameworks like Spark or Hadoop MapReduce. Project has 60 contributors from 20 institutions. Has a Java-like API similar to that of java.io.File class providing InputStream and OutputStream interfaces. Tachyon also implements the Hadoop FileSystem interface, to allow frameworks that can read from Hadoop Input Formats like MapReduce or Spark to read the data. Tachyon has some interesting features for data in tables… like native support for multi-columned data with the option to put only hot columns in memory to save space. BlinkDB: is an approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade off query accuracy for response time by running queries on data samples and presenting results annotated with error bars. BlinkDB was demoed in 2012 on a 100 node Amazon EC2 cluster answering a range of queries on 17 TBs of data in less than 2 seconds (which is over 200x faster than Hive) with an error of 2 – 10%. To do this, BlinkDB uses an offline sampling module that creates uniform and stratified samples from underlying data. Two of the big users of BlinkDB are Conviva and Facebook. Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
  • #13: http://opensource.com/business/15/1/apache-spark-new-world-record Organizations from around the world often build dedicated sort machines (specialized software and sometimes specialized hardware) to compete in this benchmark.. Spark actually tied for 1st place with a team from University of California San Diego who have been working on creating a specialized sorting system called TritonSort. Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs.  Named after Jim Gray, the benchmark workload is resource intensive by any measure: sorting 100 TB of data following the strict rules generates 500 TB of disk I/O and 200 TB of network I/O. Requires read and write of 500 TB of disk I/O and 200 TB of network (b/c you have to replicate the output to make it fault taulerant) First time a system based on a public cloud system has won Engineering Investment in Spark: - Sort-based shuffle (SPARK-2045) - Netty native network transport (SPARK-2468) - External shuffle service (SPARK-3796) Clever Application level Techniques: - GC and cache friendly memory layout - Pipelining More info: http://sortbenchmark.org http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • #15: Rresilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
  • #17: Transformations (eg: map, filter, group by) : Create a new dataset from an existing one Actions ( eg: count, collect, save) : Return a value to the driver program after running a computation on the dataset
  • #18: Spark is persisting (or caching) a dataset in memory across operations MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes. Kryo serialization: Spark can also use the Kryo library (version 2) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance. Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
  • #19: The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations. Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use: mapPartitions to sort each partition using, for example, .sorted repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning sortBy to make a globally ordered RDD
  • #27: The new DataFrames API was inspired by data frames in R and Pandas in Python. DataFrames integrate with Python, Java, Scala and R and give you state of the art optimization through the Spark SQL Catalyst optimizer. DataFrames are just a distributed collection of data organized into named columns and can be made from tables in Hive, external databases or existing RDDs. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. Spark SQL supports operating on a variety of data sources through the DataFrame interface. A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table
  • #28: SparkR was introduced just a few days ago with Spark 1.4. This is Spark’s first new language API since PySpark was added in 2012. SparkR is based on Spark’s parallel DataFrame abstraction. Users can create SparkR DataFrames from “local” R data frames, or from any Spark data source such as Hive, HDFS, Parquet or JSON. SparkR DataFrames support all Spark DataFrame operations including aggregation, filtering, grouping, summary statistics, and other analytical functions. They also supports mixing-in SQL queries, and converting query results to and from DataFrames. Because SparkR uses the Spark’s parallel engine underneath, operations take advantage of multiple cores or multiple machines, and can scale to data sizes much larger than standalone R programs.
  • #29: Chop up the live stream into batches of X seconds. Spark treats each batch of data as RDDs and processes them using RDD operations. Finally, the processed results of the RDD operations are returned in batches Streaming: makes it easy to build scalable fault-tolerant streaming applications with stateful exactly-once symantics out of the box. Streaming allows you to reuse the same code for batch processing and stream processing. In 2012, Spark Streaming was able to process over 60 million records per second on 100 nodes at sub-second processing latency, which makes it 2 – 4x faster than comparable systems like Apache Storm on Yahoo’s S4. Netflix is one of the big users of Spark Streaming. (60 million / 100 = 600k) Spark Streaming is able to process 100,000-500,000 records/node/sec. This is much faster than Storm and comparable to other Stream processing systems. Sigmoid was able to consume 480,000 records per second per node machines using Kafka as a source.
  • #32: https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html A pipeline consists of a sequence of stages. There are two basic types of pipeline stages: Transformer and Estimator. A Transformer takes a dataset as input and produces an augmented dataset as output. E.g., a tokenizer is a Transformer that transforms a dataset with text into an dataset with tokenized words. An Estimator must be first fit on the input dataset to produce a model, which is a Transformer that transforms the input dataset. E.g., logistic regression is an Estimator that trains on a dataset with labels and features and produces a logistic regression model. In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X.
  • #33: GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
  • #34: PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly. The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected components can approximate clusters. A vertex is part of a triangle when it has two adjacent vertices with an edge between them.
  • #35: Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content.
  • #36: http://www.aerospike.com/blog/what-the-spark-introduction/
  • #44: https://en.wikipedia.org/wiki/Data_analysis
  • #45: http://www.business-software.com/wp-content/uploads/2014/09/Spark-Storm.jpg
  • #46: http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science Logical Optimization The logical optimization phase applies standard rule-based opti- mizations to the logical plan. Constant folding, Predicate pushdown Projection pruning null propagation Boolean ex- pression simplification
  • #47: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
  • #53: http://apachesparkcentral.com/page/7/
  • #56: GraphX adopts a vertex-cut approach to distributed graph partitioning. Rather than splitting graphs along edges, GraphX partitions the graph along vertices which can reduce both the communication and storage overhead.
  • #59: http://tachyon-project.org/
  • #63: https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • #65: Chop up the live stream into batches of X seconds. Spark treats each batch of data as RDDs and processes them using RDD operations. Finally, the processed results of the RDD operations are returned in batches Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ. Since Spark Streaming is built on top of Spark, users can apply Spark's in-built machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on data streams
  • #67: https://www.sigmoid.com/fault-tolerant-streaming-workflows/
  • #73: https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan