This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
PyTorch, the popular open-source ML framework, has continued to evolve rapidly since the introduction of PyTorch 1.0, which brought an accelerated workflow from research to production.
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.
The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
Accelerating Data Processing in Spark SQL with Pandas UDFsDatabricks
Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases.
Advanced Natural Language Processing with Apache Spark NLPDatabricks
NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.
To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:
1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.
2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.
At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.
Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
This document discusses connecting Python to the Spark ecosystem. It covers the PyData and Spark ecosystems, package management for Python libraries in a cluster, leveraging Spark with tools like Sparkonda and Conda, and using Python with Spark including with NLTK, scikit-learn, TensorFlow, and Dask. Future directions include Apache Arrow for efficient data structures and leveraging alternative computing frameworks like TensorFlow.
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. We have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, the best parameters and the best features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. Our evaluation with real open data demonstrates that our system could explore hundreds of predictive models and discovers the highly-accurate predictive model in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints.
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
Data compression is a key aspect in big data processing frameworks, such as Apache Hadoop and Spark, because compression enables the size of the input, shuffle and output data to be reduced, thus potentially speeding up overall processing time by orders of magnitude, especially for large-scale systems. However, since many compression algorithms with good compression ratio are also very CPU-intensive, developers are often forced to use algorithms that are less CPU-intensive at the cost of reduced compression ratio.
In this session, you’ll learn about a field-programmable gate array (FPGA)-based approach for accelerating data compression in Spark. By opportunistically offloading compute-heavy, compression tasks to the FPGA, the CPU is freed to perform other tasks, resulting in an improved overall performance for end-user applications. In contrast to existing GPU methods for acceleration, this approach affords more performance/energy efficiency, which can translate to significant savings in power and cooling costs, especially for large datacenters. In addition, this implementation offers the benefit of reconfigurability, allowing for the FPGA to be rapidly reprogrammed with a different algorithm to meet system or user requirements.
Using the Intel Xeon+FPGA platform, Ojika will share how they ported Swif (simplified workload-intuitive framework) to Spark, and the method used to enable an end-to-end, FPGA-aware Spark deployment. Swif is an in-house framework developed to democratize and simplify the deployment of FPGAs in heterogeneous datacenters. Using Swif’s application programmable interface (API), he’ll describe how system architects and software developers can seamlessly integrate FPGAs into their Spark workflow, and in particular, deploy FPGA-based compression schemes that achieve improved performance compared to software-only approaches. In general, Swif’s software stack, along with the underlying Xeon+FPGA hardware platform, provides a workload-centric processing environment that streamlines the process of offloading CPU-intensive tasks to shared FPGA resources, while providing improved system throughput and high resource utilization.
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
This document summarizes a presentation on extending Spark SQL Data Sources APIs with join push down. The presentation discusses how join push down can significantly improve query performance by reducing data transfer and exploiting data source capabilities like indexes. It provides examples of join push down in enterprise data pipelines and SQL acceleration use cases. The presentation also outlines the challenges of network speeds and exploiting data source capabilities, and how join push down addresses these challenges. Future work discussed includes building a cost model for global optimization across data sources.
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2mAKgJi.
Ian Nowland and Joel Barciauskas talk about the challenges Datadog faces as the company has grown its real-time metrics systems that collect, process, and visualize data to the point they now handle trillions of points per day. They also talk about how the architecture has evolved, and what they are looking to in the future as they architect for a quadrillion points per day. Filmed at qconnewyork.com.
Ian Nowland is the VP Engineering Metrics and Alerting at Datadog. Joel Barciauskas currently leads Datadog's distribution metrics team, providing accurate, low latency percentile measures for customers across their infrastructure.
- Oracle Database 11g Release 2 provides many advanced features to lower IT costs including in-memory processing, automated storage management, database compression, and real application testing capabilities.
- It allows for online application upgrades using edition-based redefinition which allows new code and data changes to be installed without disrupting the existing system.
- Oracle provides multiple upgrade paths from prior database versions to 11g to allow for predictable performance and a safe upgrade process.
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
Accelerating Data Processing in Spark SQL with Pandas UDFsDatabricks
Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases.
Advanced Natural Language Processing with Apache Spark NLPDatabricks
NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.
To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:
1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.
2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.
At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.
Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
This document discusses connecting Python to the Spark ecosystem. It covers the PyData and Spark ecosystems, package management for Python libraries in a cluster, leveraging Spark with tools like Sparkonda and Conda, and using Python with Spark including with NLTK, scikit-learn, TensorFlow, and Dask. Future directions include Apache Arrow for efficient data structures and leveraging alternative computing frameworks like TensorFlow.
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. We have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, the best parameters and the best features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. Our evaluation with real open data demonstrates that our system could explore hundreds of predictive models and discovers the highly-accurate predictive model in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints.
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
Data compression is a key aspect in big data processing frameworks, such as Apache Hadoop and Spark, because compression enables the size of the input, shuffle and output data to be reduced, thus potentially speeding up overall processing time by orders of magnitude, especially for large-scale systems. However, since many compression algorithms with good compression ratio are also very CPU-intensive, developers are often forced to use algorithms that are less CPU-intensive at the cost of reduced compression ratio.
In this session, you’ll learn about a field-programmable gate array (FPGA)-based approach for accelerating data compression in Spark. By opportunistically offloading compute-heavy, compression tasks to the FPGA, the CPU is freed to perform other tasks, resulting in an improved overall performance for end-user applications. In contrast to existing GPU methods for acceleration, this approach affords more performance/energy efficiency, which can translate to significant savings in power and cooling costs, especially for large datacenters. In addition, this implementation offers the benefit of reconfigurability, allowing for the FPGA to be rapidly reprogrammed with a different algorithm to meet system or user requirements.
Using the Intel Xeon+FPGA platform, Ojika will share how they ported Swif (simplified workload-intuitive framework) to Spark, and the method used to enable an end-to-end, FPGA-aware Spark deployment. Swif is an in-house framework developed to democratize and simplify the deployment of FPGAs in heterogeneous datacenters. Using Swif’s application programmable interface (API), he’ll describe how system architects and software developers can seamlessly integrate FPGAs into their Spark workflow, and in particular, deploy FPGA-based compression schemes that achieve improved performance compared to software-only approaches. In general, Swif’s software stack, along with the underlying Xeon+FPGA hardware platform, provides a workload-centric processing environment that streamlines the process of offloading CPU-intensive tasks to shared FPGA resources, while providing improved system throughput and high resource utilization.
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
This document summarizes a presentation on extending Spark SQL Data Sources APIs with join push down. The presentation discusses how join push down can significantly improve query performance by reducing data transfer and exploiting data source capabilities like indexes. It provides examples of join push down in enterprise data pipelines and SQL acceleration use cases. The presentation also outlines the challenges of network speeds and exploiting data source capabilities, and how join push down addresses these challenges. Future work discussed includes building a cost model for global optimization across data sources.
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2mAKgJi.
Ian Nowland and Joel Barciauskas talk about the challenges Datadog faces as the company has grown its real-time metrics systems that collect, process, and visualize data to the point they now handle trillions of points per day. They also talk about how the architecture has evolved, and what they are looking to in the future as they architect for a quadrillion points per day. Filmed at qconnewyork.com.
Ian Nowland is the VP Engineering Metrics and Alerting at Datadog. Joel Barciauskas currently leads Datadog's distribution metrics team, providing accurate, low latency percentile measures for customers across their infrastructure.
- Oracle Database 11g Release 2 provides many advanced features to lower IT costs including in-memory processing, automated storage management, database compression, and real application testing capabilities.
- It allows for online application upgrades using edition-based redefinition which allows new code and data changes to be installed without disrupting the existing system.
- Oracle provides multiple upgrade paths from prior database versions to 11g to allow for predictable performance and a safe upgrade process.
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
Data is the foundation of any meaningful corporate initiative. Fully master the necessary data, and you’re more than halfway to success. That’s why leverageable (i.e., multiple use) artifacts of the enterprise data environment are so critical to enterprise success.
Build them once (keep them updated), and use again many, many times for many and diverse ends. The data warehouse remains focused strongly on this goal. And that may be why, nearly 40 years after the first database was labeled a “data warehouse,” analytic database products still target the data warehouse.
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
En esta sesión voy a contar las decisiones técnicas que tomamos al desarrollar QuestDB, una base de datos Open Source para series temporales compatible con Postgres, y cómo conseguimos escribir más de cuatro millones de filas por segundo sin bloquear o enlentecer las consultas.
Hablaré de cosas como (zero) Garbage Collection, vectorización de instrucciones usando SIMD, reescribir en lugar de reutilizar para arañar microsegundos, aprovecharse de los avances en procesadores, discos duros y sistemas operativos, como por ejemplo el soporte de io_uring, o del balance entre experiencia de usuario y rendimiento cuando se plantean nuevas funcionalidades.
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
Data gravity is a reality when dealing with massive amounts and globally distributed systems. Processing this data requires distributed analytics processing across InterCloud. In this presentation we will share our real world experience with storing, routing, and processing big data workloads on Cisco Cloud Services and Amazon Web Services clouds.
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multi-faceted needs by offering multi-function Data Management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They need a worry-free experience with the architecture and its components.
Anil Desai presented on monitoring and optimizing SQL Server performance. The presentation covered monitoring tools like SQL Profiler and Performance Monitor, using the Database Engine Tuning Advisor to analyze workloads and optimize physical database structures, application design tips to improve performance, and troubleshooting common problems like blocking, locking, and deadlocks. The presentation provided an overview of SQL Server monitoring and performance optimization techniques.
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.
The document discusses Project Voldemort, a distributed key-value storage system developed at LinkedIn. It provides an overview of Voldemort's motivation and features, including high availability, horizontal scalability, and consistency guarantees. It also describes LinkedIn's use of Voldemort and Hadoop for applications like event logging, online lookups, and batch processing of large datasets.
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
This document provides a summary of migrating to ClickHouse for analytics use cases. It discusses the author's background and company's requirements, including ingesting 10 billion events per day and retaining data for 3 months. It evaluates ClickHouse limitations and provides recommendations on schema design, data ingestion, sharding, and SQL. Example queries demonstrate ClickHouse performance on large datasets. The document outlines the company's migration timeline and challenges addressed. It concludes with potential future integrations between ClickHouse and MySQL.
Ssis Best Practices Israel Bi U Ser Group Itay Braunsqlserver.co.il
This document provides best practices and recommendations for SQL Server Integration Services (SSIS). It discusses topics such as logging package runtime information, establishing performance baselines, package configuration, lookup optimization, data profiling, resource utilization, and network optimization. The document also provides tips on narrowing data types, sorting data, using SQL for set operations, and change data capture functionality.
The document discusses the benefits and challenges of running big data workloads on cloud native platforms. Some key points discussed include:
- Big data workloads are migrating to the cloud to take advantage of scalability, flexibility and cost effectiveness compared to on-premises solutions.
- Enterprise cloud platforms need to provide centralized management and monitoring of multiple clusters, secure data access, and replication capabilities.
- Running big data on cloud introduces challenges around storage, networking, compute resources, and security that systems need to address, such as consistency issues with object storage, network throughput reductions, and hardware variations across cloud vendors.
- The open source community is helping users address these challenges to build cloud native data architectures
Building a High Performance Analytics PlatformSantanu Dey
The document discusses using flash memory to build a high performance data platform. It notes that flash memory is faster than disk storage and cheaper than RAM. The platform utilizes NVMe flash drives connected via PCIe for high speed performance. This allows it to provide in-memory database speeds at the cost and density of solid state drives. It can scale independently by adding compute nodes or storage nodes. The platform offers a unified database for both real-time and analytical workloads through common APIs.
This document provides recommendations for preparing infrastructure for the future. It suggests choosing a long-term supported Linux distribution like RHEL, CentOS, or Ubuntu LTS. Infrastructure should be designed to be scalable, robust, manageable, resilient, and cost-effective. Services should be split across redundant servers and high availability/failover implemented. Automation, configuration management, and continuous integration/deployment are also recommended. Comprehensive monitoring of infrastructure and applications is important for capacity planning and issue detection. The document provides additional tips regarding caching, security, change management, backups, and ongoing re-architecting.
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseTim Vaillancourt
This document provides best practices for scaling e-commerce databases for Black Friday and Cyber Monday. It discusses scaling both synchronous and asynchronous applications, efficiently using data at scale through techniques like caching, queues, and counters. It also covers scaling out through techniques like sharding, pre-sharding, and kill switches. Testing performance and capacity, as well as asking the right questions at development time are also recommended.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Serverless SQL provides a serverless analytics platform that allows users to analyze data stored in object storage without having to manage infrastructure. Key features include seamless elasticity, pay-per-query consumption, and the ability to analyze data directly in object storage without having to move it. The platform includes serverless storage, data ingest, data transformation, analytics, and automation capabilities. It aims to create a sharing economy for analytics by allowing various users like developers, data engineers, and analysts flexible access to data and analytics.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Machine Learning CI/CD for Email Attack DetectionDatabricks
Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models.
In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)apidays
CIAM in the wild: What we learned while scaling from 1.5 to 3 million users
Michael Gruen, VP of Engineering at Layr
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...apidays
Spring Modulith Design for Microservices
Renjith Ramachandran, Senior Solutions Architect at BJS Wholesale Club
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...apidays
Building Agentic Workflows with FDC3 Intents
Nick Kolba, Co-founder & CEO at Connectifi
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGeTomas Moser
Meet David Bianco, Staff Strategist with Splunk’s elite SURGe team, live in Prague. Get ready for an engaging deep dive into the cutting edge of cybersecurity—straight from the experts driving Splunk’s global security research.
学校原版文凭补办(UWTSD毕业证书)威尔士三一圣大卫大学毕业证购买毕业证【q微1954292140】威尔士三一圣大卫大学offer/学位证、留信官方学历认证(永久存档真实可查)采用学校原版纸张、特殊工艺完全按照原版一比一制作【q微1954292140】Buy University of Wales Trinity Saint David Diploma购买美国毕业证,购买英国毕业证,购买澳洲毕业证,购买加拿大毕业证,以及德国毕业证,购买法国毕业证(q微1954292140)购买荷兰毕业证、购买瑞士毕业证、购买日本毕业证、购买韩国毕业证、购买新西兰毕业证、购买新加坡毕业证、购买西班牙毕业证、购买马来西亚毕业证等。包括了本科毕业证,硕士毕业证。
主营项目:
1、真实教育部国外学历学位认证《英国毕业文凭证书快速办理威尔士三一圣大卫大学学位证书成绩单代办服务》【q微1954292140】《论文没过威尔士三一圣大卫大学正式成绩单》,教育部存档,教育部留服网站100%可查.
2、办理UWTSD毕业证,改成绩单《UWTSD毕业证明办理威尔士三一圣大卫大学文凭认证》【Q/WeChat:1954292140】Buy University of Wales Trinity Saint David Certificates《正式成绩单论文没过》,威尔士三一圣大卫大学Offer、在读证明、学生卡、信封、证明信等全套材料,从防伪到印刷,从水印到钢印烫金,高精仿度跟学校原版100%相同.
3、真实使馆认证(即留学人员回国证明),使馆存档可通过大使馆查询确认.
4、留信网认证,国家专业人才认证中心颁发入库证书,留信网存档可查.
英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)UWTSD文凭【q微1954292140】高仿真还原英国文凭证书和外壳,定制英国威尔士三一圣大卫大学成绩单和信封。研究生学历信息UWTSD毕业证【q微1954292140】文凭详解细节威尔士三一圣大卫大学offer/学位证英国毕业证、留信官方学历认证(永久存档真实可查)采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决威尔士三一圣大卫大学学历学位认证难题。
【办理威尔士三一圣大卫大学成绩单Buy University of Wales Trinity Saint David Transcripts】
购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单(q微1954292140)新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率,以及是作为留信认证申请材料的一部分。
威尔士三一圣大卫大学成绩单能够体现您的的学习能力,包括威尔士三一圣大卫大学课程成绩、专业能力、研究能力。(q微1954292140)具体来说,成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分,因此,成绩单不仅是学生学术能力的证明,也是评估学生是否适合某个教育项目的重要依据!
Buy University of Wales Trinity Saint David Diploma《正式成绩单论文没过》有文凭却得不到认证。又该怎么办???英国毕业证购买,英国文凭购买,【q微1954292140】英国文凭购买,英国文凭定制,英国文凭补办。专业在线定制英国大学文凭,定做英国本科文凭,【q微1954292140】复制英国University of Wales Trinity Saint David completion letter。在线快速补办英国本科毕业证、硕士文凭证书,购买英国学位证、威尔士三一圣大卫大学Offer,英国大学文凭在线购买。
特殊原因导致无法毕业,也可以联系我们帮您办理相关材料:
1:在威尔士三一圣大卫大学挂科了,不想读了,成绩不理想怎么办?
2:打算回国了,找工作的时候,需要提供认证《UWTSD成绩单购买办理威尔士三一圣大卫大学毕业证书范本》
帮您解决在英国威尔士三一圣大卫大学未毕业难题(University of Wales Trinity Saint David)文凭购买、毕业证购买、大学文凭购买、大学毕业证购买、买文凭、日韩文凭、英国大学文凭、美国大学文凭、澳洲大学文凭、加拿大大学文凭(q微1954292140)新加坡大学文凭、新西兰大学文凭、爱尔兰文凭、西班牙文凭、德国文凭、教育部认证,买毕业证,毕业证购买,买大学文凭,【q微1954292140】学位证1:1完美还原海外各大学毕业材料上的工艺:水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。《威尔士三一圣大卫大学扫描件文凭定做英国毕业证书办理UWTSD购买毕业证》
购买日韩毕业证、英国大学毕业证、美国大学毕业证、澳洲大学毕业证、加拿大大学毕业证(q微1954292140)新加坡大学毕业证、新西兰大学毕业证、爱尔兰毕业证、西班牙毕业证、德国毕业证,回国证明,留信网认证,留信认证办理,学历认证。从而完成就业。威尔士三一圣大卫大学毕业证办理,威尔士三一圣大卫大学文凭办理,威尔士三一圣大卫大学成绩单办理和真实留信认证、留服认证、威尔士三一圣大卫大学学历认证。学院文凭定制,威尔士三一圣大卫大学原版文凭补办,成绩单如何办理,扫描件文凭定做,100%文凭复刻。
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...apidays
Fast, Repeatable, Secure: Pick 3 with FINOS CCC
Leigh Capili, Kubernetes Contributor at Control Plane
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Open Source and disrupting the travel distribution ec...apidays
Open Source and disrupting the travel distribution ecosystem
Stu Waldron, Advisor & Acting Director at OpenTravel
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Computers are still dumb by Ben Morss (DeepL)apidays
Computers are still dumb: bringing your AI magic to enterprises
Ben Morss, Developer Evangelist at DeepL
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
delta airlines new york office (Airwayscityoffice)jamespromind
Visit the Delta Airlines New York Office for personalized assistance with your travel plans. The experienced team offers guidance on ticket changes, flight delays, and more. It’s a helpful resource for those needing support beyond the airport.
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...apidays
Building Scalable AI Systems: Cloud Architecture for Performance
Sai Prasad Veluru, Software Engineer at Apple Inc
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...apidays
Why I Built Another Carbon Measurement Tool for LLMs (And What I Learned Along the Way)
Pascal Joly, Sustainability Consultant and Instructor at IT Climate Ed
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Ever wondered how to inject your dashboards with the power of Python? This presentation will show how combining Tableau with Python can unlock advanced analytics, predictive modeling, and automation that’ll make your dashboards not just smarter—but practically psychic
Best Practices for Building Robust Data Platform with Apache Spark and Delta
1. Best Practices for building Robust Data Platform
with Apache Spark & Delta
Vini Jaiswal
Spark+AI Summit - June 2020
https://www.linkedin.com/in/vinijaiswal/
2. ▪ Data Strategy
Optimizing the cost to drive Business value
▪ Performance and tuning with Delta Lake & Apache Spark
▪ Governance and security controls
Bringing it all together - A reference architecture
Agenda
4. Data Challenges
Data Warehouse limits the potential of
intelligence
Data Volume is growing rapidly
More Variety of data -> Different
applications
Need for faster processing and scalability
Data silos limits innovation
Promise of the Data Lake
1. Collect
Everything
2. Store it all in
the Data Lake
🔥
🔥🔥
3. Data
Science &
Machine
Learning
🔥
🔥
7. Ideal data lakes with
No atomicity
No quality enforcement
No consistency /
isolation
✗ Reliability - High Quality Data
● Schema Enforcement
● ACID Transactions
● Time Travel
● Open Standards, Open Source
● Powered by
● Unifies Streaming / Batch
Usual Data Lake
References: https://youtu.be/qtCxNSmTejk
8. Getting the Data Right
Audience Segmentation
CSV,
JSON, TXT…
Data
Types
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
Table
Categorization
Align with
Business
Outcomes
Is my data use
case worthy?
Is my data ready
for Analytics / ML?
10. Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
11. Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
12. Workload Type AWS
Type
Azure
Type
Recommended Use Case
Memory Optimized r5 Dsv2 Memory-intensive applications
Use Case: ML workload with data caching
Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data
Science Applications
Use Case: ETL with full file scans and no data reuse
Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO
Use Case: Analytics - Storage Optimized i3 class with
Delta IO Cache
Selection of Instance Types
Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes
Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
13. Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
14. Selection of node size
Rule of thumb
1. Fewer big instances > more small instances
a. (larger heap = larger GC)
b. Multiple executors per machine
2. Size based on the number of tasks initially, tweak later
a. Run the job with a small cluster to get idea of # of tasks
b. Observe Cluster metrics for CPU, memory and network utilization
15. Observe Spark UI & tweak the workloads
Fully cached with room to spare?
> decrease instances
Almost completely cached?
> Increase cluster size
Not even close to cached?
> Consider instance with SSD
instead of EBS or use R class
Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
16. Observe Ganglia Metrics & tweak the workloads
○ Are we compute bound?
○ Are we network bound?
○ Are we spilling a ton?
18. Performance Symptoms
Look for these 4 symptoms
Shuffle
Spill
Skew
Small Files
Can I make
Spark application run faster?
19. Use broadcast join
Review Join order
I found Shuffle, now what?
Query completion time
28 Minutes
Sort Merge Join
rows
output:
2,509,189,31
3
Before
1.8 Minutes
rows
output:
1023
After
Reference: https://spark.apache.org/docs/latest/sql-
performance-tuning.html#broadcast-hint-for-sql-queries
20. ● Increase Shuffle Partitions
(for this example: 48)
● Reduce the number of cores
spark.executor.cores < total
cores per worker
● Larger cluster - faster disk
SSDs
Shuffle Partitions = 16
I found Spill, now what?
set spark.sql.shuffle.partitions=48
More spill you can remove, larger
the impact!
21. Symptom
● Ganglia CPU usage becomes low for long time after
initial high usage
● Task duration -> Significant difference in max than
75% and 25% values
● Input Size/Records
What to do?
● Use broadcast join
● Use Skew Join
● Filter out large keys/salt keys and set
up multiple reduce steps
● Explicitly repartition the data on a
different field
I found Skew, now what?
Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
22. Adaptive Query Execution
Reduced manual effort of tuning spark.sql.shuffle.partitions
By default it is turned off, Set spark.sql.adaptive.enabled=true
Dynamically change sort-merge join into broadcast-hash join
▪ Dynamically optimizing skew joins
*Available in DBR 7.x/Spark 3.0
23. Upstream
● Fix the upstream application building tons of files
● Use a seperate tool to compact them before
processing with Spark
Changes in Spark Application
● Write your own compaction job
● Delta solves this problem!
I found a lot of small files, now what?
25. Compaction
● Improves the Read
Performance
● Solves Small Files problem
Reference: https://docs.delta.io/latest/best-practices.html#compact-files
26. ● Optimizes Apache Spark partition
● Maximizes the throughput of data being
written
● Compacts files for partitions
Auto Optimize
Auto Optimize consists of two complementary features:
Optimized Writes and Auto Compaction.
Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
29. Data Governance with Delta Lake
Create retention policy to age out and
erase raw data that may contain
personal information
High Level Aggregates
(e.g. # of users that took an action)
Historical Data Repository
● Easy to navigate
● Pseudonymization
Data Lake
Satisfy Compliance requests using
UPDATE / DELETE commands
Create tables that don't contain
personal data
Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
31. Governance - The Who/What/Where
Perform standard extraction,
transformation and loading
tasks (ETL) and apply best
coding practices including
source control, unit test, and
automation
drives product innovation
with state-of-the-art
Machine Learning models
applied to big data
Improves business process
through providing standardized
and ad-hoc business analysis.
Acts as intermediary between
Analytics and Business team
Performs automated jobs based
on Data Engineering configs.
Data Scientist Data Engineer Data/Business
Analyst Automated Jobs
Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
32. Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
33. Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
Data Strategy
Cost Optimization &
Performance Tuning
Business Value
Security