Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
This document provides an introduction to Apache Spark, including:
- An overview of what Spark is and the types of problems it can solve
- A brief look at the Spark API through the word count example
- Details on Spark's core abstractions of RDDs and how transformations and actions work
- Potential pitfalls of using groupByKey and how reduceByKey is preferable
- Resources for learning more about Spark including books and video tutorials
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Getting started contributing to Apache SparkHolden Karau
Are you interested in contributing to Apache Spark? This workshop and associated slides walk through the basics of contributing to Apache Spark as a developer. This advice is based on my 3 years of contributing to Apache Spark but should not be considered official in any way.
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
- Spark ML pipelines involve estimators that are trained on datasets to produce immutable transformers.
- A transformer must define transformSchema() to validate the input schema, transform() to do the work, and copy() for cloning.
- Configurable transformers take parameters like inputCol and outputCol to allow configuration for meta algorithms.
- Estimators are similar but fit() returns a model instead of directly transforming.
Introduction to and Extending Spark MLHolden Karau
This document discusses extending Spark ML pipelines with custom estimators and transformers. It begins with an overview of Spark ML and the pipeline API. Then it demonstrates how to build a simple hardcoded word count transformer and configurable transformer. It discusses important aspects like transforming the input schema, parameters, and model fitting. The document provides guidance on configuration, persistence, serving models, and resources for learning more about custom Spark ML components.
Extending spark ML for custom models now with python!Holden Karau
Are you interested in adding your own custom algorithms to Spark ML? This is the talk for you! See the companion examples in the High Performance Spark, and Sparkling ML project.
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
This document summarizes a presentation on extending Spark ML pipelines. It discusses how pipeline stages can be estimators or transformers, with estimators needing to be trained to produce transformers. Pipeline stages must provide transformSchema and copy methods and can have configuration parameters. The document provides an example of a simple transformer and how to make it configurable. It also briefly discusses how to create an estimator by adding a fit method.
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
This document discusses scaling Apache Spark applications and some of the unintended consequences that can arise. It covers Spark's core abstractions of RDDs and DataFrames for distributed data and computation. It explains how Spark's lazy evaluation model and use of deterministic partitioning can impact reusing data and operations like groupByKey. It also discusses challenges that can arise from Spark's support for arbitrary functions and working with non-JVM languages like Python.
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.
Topics include:
Working with key/value data
Replacing groupByKey for awesomeness
Key skew: your data probably has it and how to survive
Effective caching and checkpointing
Considerations for noisy clusters
Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
How to make our code testable
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
The document discusses Apache Spark Datasets and how they compare to RDDs and DataFrames. Some key points:
- Datasets provide better performance than RDDs due to a smarter optimizer, more efficient storage formats, and faster serialization. They also offer simplicity advantages over RDDs for things like windowed operations and multi-column aggregates.
- Datasets allow mixing of functional and relational styles more easily than RDDs or DataFrames. The optimizer has more information from Datasets' schemas and can perform optimizations like partial aggregation.
- Datasets address some of the limitations of DataFrames, making it easier to write UDFs and handle iterative algorithms. They provide a typed API compared to the untyped
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
The document discusses accelerating big data processing beyond just the Java Virtual Machine (JVM). It introduces Rachel Warren and Holden Karau, the presenters. It then covers the current state of PySpark and its performance limitations due to serialization between Python and the JVM. Future improvements discussed include using Apache Arrow to accelerate UDFs, Dask for pure Python processing, and Apache Beam for additional languages. The presenters promote their new book on high performance Spark and take questions at the end.
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Many popular big data technologies (such as Apache Spark, BEAM, Flink, and Kafka) are built in the JVM, and many interesting tools are built in other languages (ranging from Python to CUDA). For simple operations the cost of copying the data can quickly dominate, and in complex cases can limit our ability to take advantage of specialty hardware. This talk explores how improved formats are being integrated to reduce these hurdles to co-operation.
Many popular big data technologies (such as Apache Spark, BEAM, and Flink) are built in the JVM, and many interesting AI tools are built in other languages, and some requiring copying to the GPU. As many folks have experienced, while we may wish that we spend all of our time playing with cool algorithms -- we often need to spend more of our time working on data prep. Having to copy our data slowly between the JVM and the target language of computation can remove much of the benefit of being able to access our specialized tooling. Thankfully, as illustrated in the soon to be released Spark 2.3, Apache Arrow and related tools offer the ability to reduce this overhead. This talk will explore how Arrow is being integrated into Spark, and how it can be integrated into other systems, but also limitations and places where Apache Arrow will not magically save us.
Link: https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you. This talk covers a number of important topics for making scalable Apache Spark programs – from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
This document provides a summary of a presentation on scaling Apache Spark. It discusses techniques for reusing RDDs through caching, persistence levels and checkpointing. It also covers best practices for working with key-value data to avoid problems from groupByKey, and using Spark SQL and accumulators. Finally, it previews bringing code generation to Spark ML to improve performance.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
Description
This talk assumes you have a basic understanding of Spark (if not check out one of the intro videos on youtube - http://bit.ly/hkPySpark ) and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs - this is the talk for you.
Abstract
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames and traditional RDDs with Python. Looking at Spark 2.0; we examine how to mix functional transformations with relational queries for performance using the new (to PySpark) Dataset API. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
This document provides instructions for extending Spark ML pipelines by building new pipeline stages. It discusses the key components needed to build estimators and transformers, including implementing transformSchema, fit/transform methods, and parameter configuration. Examples are given of building a simple string indexer estimator and transformer. The document also briefly mentions additional features like persistence and serving that could be added.
This document summarizes a presentation on extending Spark ML pipelines. It discusses how pipeline stages can be estimators or transformers, with estimators needing to be trained to produce transformers. Pipeline stages must provide transformSchema and copy methods and can have configuration parameters. The document provides an example of a simple transformer and how to make it configurable. It also briefly discusses how to create an estimator by adding a fit method.
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
This document discusses scaling Apache Spark applications and some of the unintended consequences that can arise. It covers Spark's core abstractions of RDDs and DataFrames for distributed data and computation. It explains how Spark's lazy evaluation model and use of deterministic partitioning can impact reusing data and operations like groupByKey. It also discusses challenges that can arise from Spark's support for arbitrary functions and working with non-JVM languages like Python.
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.
Topics include:
Working with key/value data
Replacing groupByKey for awesomeness
Key skew: your data probably has it and how to survive
Effective caching and checkpointing
Considerations for noisy clusters
Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
How to make our code testable
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
The document discusses Apache Spark Datasets and how they compare to RDDs and DataFrames. Some key points:
- Datasets provide better performance than RDDs due to a smarter optimizer, more efficient storage formats, and faster serialization. They also offer simplicity advantages over RDDs for things like windowed operations and multi-column aggregates.
- Datasets allow mixing of functional and relational styles more easily than RDDs or DataFrames. The optimizer has more information from Datasets' schemas and can perform optimizations like partial aggregation.
- Datasets address some of the limitations of DataFrames, making it easier to write UDFs and handle iterative algorithms. They provide a typed API compared to the untyped
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
The document discusses accelerating big data processing beyond just the Java Virtual Machine (JVM). It introduces Rachel Warren and Holden Karau, the presenters. It then covers the current state of PySpark and its performance limitations due to serialization between Python and the JVM. Future improvements discussed include using Apache Arrow to accelerate UDFs, Dask for pure Python processing, and Apache Beam for additional languages. The presenters promote their new book on high performance Spark and take questions at the end.
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Many popular big data technologies (such as Apache Spark, BEAM, Flink, and Kafka) are built in the JVM, and many interesting tools are built in other languages (ranging from Python to CUDA). For simple operations the cost of copying the data can quickly dominate, and in complex cases can limit our ability to take advantage of specialty hardware. This talk explores how improved formats are being integrated to reduce these hurdles to co-operation.
Many popular big data technologies (such as Apache Spark, BEAM, and Flink) are built in the JVM, and many interesting AI tools are built in other languages, and some requiring copying to the GPU. As many folks have experienced, while we may wish that we spend all of our time playing with cool algorithms -- we often need to spend more of our time working on data prep. Having to copy our data slowly between the JVM and the target language of computation can remove much of the benefit of being able to access our specialized tooling. Thankfully, as illustrated in the soon to be released Spark 2.3, Apache Arrow and related tools offer the ability to reduce this overhead. This talk will explore how Arrow is being integrated into Spark, and how it can be integrated into other systems, but also limitations and places where Apache Arrow will not magically save us.
Link: https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you. This talk covers a number of important topics for making scalable Apache Spark programs – from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
This document provides a summary of a presentation on scaling Apache Spark. It discusses techniques for reusing RDDs through caching, persistence levels and checkpointing. It also covers best practices for working with key-value data to avoid problems from groupByKey, and using Spark SQL and accumulators. Finally, it previews bringing code generation to Spark ML to improve performance.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
Description
This talk assumes you have a basic understanding of Spark (if not check out one of the intro videos on youtube - http://bit.ly/hkPySpark ) and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs - this is the talk for you.
Abstract
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames and traditional RDDs with Python. Looking at Spark 2.0; we examine how to mix functional transformations with relational queries for performance using the new (to PySpark) Dataset API. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
This document provides instructions for extending Spark ML pipelines by building new pipeline stages. It discusses the key components needed to build estimators and transformers, including implementing transformSchema, fit/transform methods, and parameter configuration. Examples are given of building a simple string indexer estimator and transformer. The document also briefly mentions additional features like persistence and serving that could be added.
This document provides an introduction and overview of machine learning with Spark ML. It discusses the speaker and TAs, previews the topics that will be covered which include Spark's ML APIs, running an example with one API, model save/load, and serving options. It also briefly describes the different pieces of Spark including SQL, streaming, languages APIs, MLlib, and community packages. The document provides examples of loading data with Spark SQL and Spark CSV, constructing a pipeline with transformers and estimators, training a decision tree model, adding more features to the tree, and cross validation. Finally, it discusses serving models and exporting models to PMML format.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinZenika
Pour ce mois de mars, nous vous proposons une thématique Big Data autour de Spark et du Machine Learning !
Nous attaquerons par une présentation d'Apache Spark 1.5 : son architecture distribuée et ses possibilités n'auront plus de secret pour vous.
Nous enchaînerons ensuite avec les fondamentaux du Machine Learning : vocabulaire (pour enfin comprendre ce que raconte les data scientists / dataminer ! ), usages et explication des algorithmes les plus populaires ... Promis la présentation ne comporte pas de formules de maths barbares ;)
Puis nous mettrons en pratique ces deux présentations en développant ensemble votre première application prédictive avec Apache Spark et Apache Zeppelin !
Simplifying Big Data Analytics with Apache SparkDatabricks
Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It improves on MapReduce by allowing data to be kept in memory across jobs, enabling faster iterative jobs. Spark consists of a core engine along with libraries for SQL, streaming, machine learning, and graph processing. The document discusses new APIs in Spark including DataFrames, which provide a tabular interface like in R/Python, and data sources, which allow plugging external data systems into Spark. These changes aim to make Spark easier for data scientists to use at scale.
End-to-end Data Pipeline with Apache SparkDatabricks
This document discusses Apache Spark, a fast and general cluster computing system. It summarizes Spark's capabilities for machine learning workflows, including feature preparation, model training, evaluation, and production use. It also outlines new high-level APIs for data science in Spark, including DataFrames, machine learning pipelines, and an R interface, with the goal of making Spark more similar to single-machine libraries like SciKit-Learn. These new APIs are designed to make Spark easier to use for machine learning and interactive data analysis.
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
This document provides an introduction and overview of Apache Spark with Python (PySpark). It discusses key Spark concepts like RDDs, DataFrames, Spark SQL, Spark Streaming, GraphX, and MLlib. It includes code examples demonstrating how to work with data using PySpark for each of these concepts.
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”
Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
This document discusses Spark ML pipelines for machine learning workflows. It begins with an introduction to Spark MLlib and the various algorithms it supports. It then discusses how ML workflows can be complex, involving multiple data sources, feature transformations, and models. Spark ML pipelines allow specifying the entire workflow as a single pipeline object. This simplifies debugging, re-running on new data, and parameter tuning. The document provides an example text classification pipeline and demonstrates how data is transformed through each step via DataFrames. It concludes by discussing upcoming improvements to Spark ML pipelines.
Presented at the MLConf in Seattle, this presentation offers a quick introduction to Apache Spark, followed by an overview of two novel features for data science
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work, of course).
Even if you don’t have your own machine learning algorithms that you want to implement, this session will give you an inside look at how the ML APIs are built. It will also help you make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects.
The examples in this talk will be presented in Scala, but any non-standard syntax will be explained.
Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).
Spark is an open source cluster computing framework that allows processing of large datasets across clusters of computers using a simple programming model. It provides high-level APIs in Java, Scala, Python and R.
Typical machine learning workflows in Spark involve loading data, preprocessing, feature engineering, training models, evaluating performance, and tuning hyperparameters. Spark MLlib provides algorithms for common tasks like classification, regression, clustering and collaborative filtering.
The document provides an example of building a spam filtering application in Spark. It involves reading email data, extracting features using tokenization and hashing, training a logistic regression model, evaluating performance on test data, and tuning hyperparameters via cross validation.
Strata NYC 2015 - What's coming for the Spark communityDatabricks
In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
On the technical side, I’ll discuss two key initiatives ahead for Spark. The first is a tighter integration of Spark’s libraries through shared primitives such as the data frame API. The second is across-the-board performance optimizations that exploit schema information embedded in Spark’s newer APIs. These initiatives are both designed to make Spark applications easier to write and faster to run.
On the community side, this talk will focus on the growing ecosystem of extensions, tools, and integrations evolving around Spark. I’ll survey popular language bindings, data sources, notebooks, visualization libraries, statistics libraries, and other community projects. Extensions will be a major point of growth in the future, and this talk will discuss how we can position the upstream project to help encourage and foster this growth.
Ever wondered how to inject your dashboards with the power of Python? This presentation will show how combining Tableau with Python can unlock advanced analytics, predictive modeling, and automation that’ll make your dashboards not just smarter—but practically psychic
How to Choose the Right Online Proofing Softwareskalatskayaek
This concise guide walks you through the essential factors to evaluate when selecting an online proofing solution. Learn how to compare collaboration features, file-format support, review workflows, integrations, security, and pricing—helping you choose the right proofing software that streamlines feedback, accelerates approvals, and keeps your creative projects on track. Visit cwaysoftware.com for more information and to explore Cway Software’s proofing tools.
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil PolyakovFwdays
Kernel is currently the leading producer of sunflower oil and one of the largest agroholdings in Ukraine. What business challenges are they addressing, and why is ML a must-have? This talk explores the development of the data science team at Kernel—from early experiments in Google Colab to building minimal in-house infrastructure and eventually scaling up through an infrastructure partnership with De Novo. The session will highlight their work on crop yield forecasting, the positive results from testing on H100, and how the speed gains enabled the team to solve more business tasks.
Internal Architecture of Database Management SystemsM Munim
A Database Management System (DBMS) is software that allows users to define, create, maintain, and control access to databases. Internally, a DBMS is composed of several interrelated components that work together to manage data efficiently, ensure consistency, and provide quick responses to user queries. The internal architecture typically includes modules for query processing, transaction management, and storage management. This assignment delves into these key components and how they collaborate within a DBMS.
Tableau Cloud - what to consider before making the move update 2025.pdfelinavihriala
Thinking of moving your data infrastructure to the cloud? This presentation will break down the critical things to consider—performance, security, scalability, and those "gotchas" nobody talks about. Think of this as your roadmap to a successful (and smooth!) migration.
The final presentation of our time series forecasting project for the "Data Science for Society and Business" Master's program at Constructor University Bremen
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Designer
Copy Link & Paste in Google👉👉👉 https://alipc.pro/dl/
Glary Utilities Pro Crack Glary Utilities Pro Crack Free Download is an amazing collection of system tools and utilities to fix, speed up, maintain and protect your PC.
Chapter4.1.pptx you can come to the house and statisticsSotheaPheng
Ad
Getting started with Apache Spark in Python - PyLadies Toronto 2016
1. Intro to Apache Spark
w/ML & Python
Lightning fast cluster computing with Python
For PyLadies Toronto 2016 :)
2. Who am I?
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
3. Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● Want to learn about using PySpark for distributed computing
● Don’t overly mind a grab-bag of topics
Lori Erickson
4. What we are going to explore together!
● What is Spark?
● Spark’s primary distributed collection
● Word count
● How PySpark works
● Machine Learning with PySpark
Ryan McGilchrist
5. Companion notebook funtimes:
● Small companion IJupyter notebook to explore with:
○ http://bit.ly/hkMLExample
● If you want to use it you will access to Apache Spark
○ Install from http://spark.apache.org
○ Or get access to one of the online notebook environments (IBM
BlueMix, DataBricks Cloud, Microsoft Spark HDInsights Cluster
Notebook, etc.)
David DeHetre
6. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
8. What is Spark?
● General purpose distributed system
○ With a really nice API
● Apache project (one of the most active)
● Must faster than Hadoop Map/Reduce
● Has Python APIs
Bernhard Latzko
9. What is Spark?
● General purpose distributed system
○ With a really nice API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
10. The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
11. SparkContext: entry to the world
● Can be used to create RDDs from many input sources
○ Native collections, local & remote FS
○ Any Hadoop Data Source
● Also create counters & accumulators
● Automatically created in the shells (called sc)
● Specify master & app name when creating
○ Master can be local[*], spark:// , yarn, etc.
○ app name should be human readable and make sense
● etc.
Petfu
l
12. RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
● Distributed collection
● Recomputed on node failure
● Distributes data & work across the cluster
● Lazily evaluated (transformations & actions)
Helen Olney
13. Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
14. Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
15. Some common transformations & actions
Transformations (lazy)
● map
● filter
● flatMap
● reduceByKey
● join
● cogroup
Actions (eager)
● count
● reduce
● collect
● take
● saveAsTextFile
● saveAsHadoop
● countByValue
Photo by Steve
Photo by Dan G
17. Why lazy evaluation?
● Allows pipelining procedures
○ Less passes over our data, extra happiness
● Can skip materializing intermediate results which are
really really big*
● Figuring out where our code fails becomes a little
trickier
18. So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
19. So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
function
20. So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
read
read
read
21. So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
cached
cached
cached
counts
22. Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
23. So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
24. Why should we consider Spark SQL?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
● Integrated into the ML Pipeline API
Rikki's Refuge
26. Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
27. What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ Hive
● Available as packages
○ csv*
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at
http://spark-packages.org/?q=tags%3A%22Data%20Sources%22
Michael Coghlan
*pre-2.0 package, 2.0+ built in hopefully
28. Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict
29. Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung
30. Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung
31. So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
32. What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
33. Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
34. And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
35. Cross-validation
because saving a test set is effort
● Automagically* fit your model params
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
○ (not in Python yet so skipping for now)
Jonathan Kotta
36. Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
PROcarterse Follow
37. And the next book…..
First five chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
38. And some upcoming talks:
● September
○ This workshop (yay!)
○ New York City Strata Conf (Structured Streaming & Machine Learning)
● October
○ PyData DC - Making Spark go fast in Python (vroom vroom)
○ Salt Lake City Spark Meetup - TBD
○ London - OSCON - Getting Started Contributing to Spark
● December
○ Strata Singapore (Introduction to Datasets)
39. k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout ([email protected]) if you feel comfortable doing
so :)