SparkCruise: Automatic Computation Reuse in Apache Spark

Jun 29, 20200 likes399 views

Queries in production workloads and interactive data analytics are often overlapping, i.e., multiple queries share parts of the computation.

SparkCruise:
Automatic Computation Reuse in Spark
Abhishek Roy, Priyanka Gomatam
Microsoft

Agenda
1. SparkCruise
▪ Workload Optimization
▪ Computation Reuse Problem
▪ System Design
▪ Demonstration and Results
2. Workload Insights Notebook
▪ Motivation
▪ Demonstration

The Overwhelmed DBA
Database Administrators (DBA) tune the
queries with sophisticated admin tools
Schedule maintenance tasks (e.g., collect
statistics, create views) in offline cycles
DBAs and cloud developers can tune but lack
the tools to optimize massive workloads
No offline cycles in serverless systems
Cloud database servicesOn-premise databases
Workload
DS
DBA

Workload Optimization
Learn from past query workloads
Feedback LoopWhy Workload Optimization
Optimizer
Workload-based
feedback loop in Spark
From Query Optimizer
to Workload Optimizer
Excellent query optimizers +CBO +AQE
Production
Workloads
Many optimization opportunities
Reduce total cost for user
Adapt to changing workloads
Continuous feedback loop to optimizer
Query Workloads

Computation Reuse Problem
Overlapping queries
▪ Parts of computation are
duplicated across queries
▪ Overlapping queries
▪ 95% in TPC-DS
▪ 60% in production
Workloads are not fixed
but recurring
▪ Queries are run repeatedly, e.g., hourly, or daily, over
changing data possibly with different parameters
σ
Q1
σ
γ
Q2

Reuse with SparkCruise
σ
Q1 Q2
σ
γ
Q1
σ
Q2
γ
Workload on Day 2
Workload on Day 1
SparkCruise = ON

Catalyst Optimizer
SQL
SparkCruise Design
INGESTIONANALYSISFEEDBACK
ANNOTATED
QUERY PLANS
OPTIMIZER
EXTENSIONS

TPC-DS Results
Wall Clock Time ↓ 29%
CPU Time ↓ 31%
Geomean ↓ 33%
Running Time Ratio on
TPC-DS workload

SparkCruise Summary
Workload-based feedback loop in Spark query engine
Select high utility common computations
Automatic materialization and reuse
Works with out-of-the-box Spark 2.3+
Hands-free computation reuse system for Spark

Workload Insights Notebook
Users can analyze individual queries, but
not the complete workload
Per Query View of Spark History Server
Will SparkCruise benefit my workload?
Inform users about
Redundancies in their workload
Potential savings from SparkCruise

Workload Insights Notebook
Different entities in Spark
Creates tabular representation
of workload
Available for instant querying
Metadata Metrics

Resources
Software
SparkCruise available as an experimental feature on Azure HDInsight
To be released on Azure Synapse
Papers
Peregrine: Workload Optimization for Cloud Query Engines. SoCC 2019.
SparkCruise: Handsfree Computation Reuse in Spark. VLDB 2019 (Demo).
Contact
Abhishek Roy (abhishek.roy atmicrosoftdotcom)
Priyanka Gomatam (priyanka.gomatam atmicrosoftdotcom)

This document discusses approaches to processing data in Spark for a credit card rewards use case at Capital One. It describes filtering data at each stage versus enriching the data. Enriching avoids issues with filtering like difficulty debugging and back tracing data by keeping enriched data from each stage together rather than filtering it out. Enriching provides more insight into why data drops out at each stage and is now used successfully in Capital One's production Spark job that processes millions of transactions daily to award customer rewards.

Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificDatabricks

Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks

Willump: Optimizing Feature Computation in ML InferenceDatabricks

Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks

The Revolution Will be StreamedDatabricks

Democratizing PySpark for Mobile Game PublishingDatabricks

Democratizing PySpark for Mobile Game Publishing at Zynga Zynga aimed to standardize on PySpark and make it accessible to all analytics teams. This was done through trainings, documentation, and mentoring. Teams are now responsible for data products in production using features like Pandas UDFs. Democratizing PySpark resulted in novel applications like propensity modeling, anomaly detection, and reinforcement learning models in production.

On Improving Broadcast Joins in Apache Spark SQLDatabricks

- The document discusses improving broadcast joins in Apache Spark SQL, which are more efficient than shuffle joins when the broadcasted data fits in memory. - Experimenting with increasing the broadcast threshold showed that executor-side broadcasting performs better than driver-side broadcasting by avoiding data shuffling to the driver. - Comparing the cost models of shuffle joins and broadcast joins showed that shuffle joins perform better with more cores while broadcast joins perform better when the size difference between tables is larger. - Applying these techniques to joins in Workday HR customer data pipelines showed that increasing the broadcast threshold did not always improve performance due to the presence of self-joins and outer joins.

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningDatabricks

Productionizing Machine Learning with a Microservices ArchitectureDatabricks

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsDatabricks

Building a Real-Time Feature Store at iFoodDatabricks

Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDatabricks

The document discusses deploying a model trained in Azure Databricks onto Azure Machine Learning. It covers model training in Databricks, packaging the model and storing it in Azure Blob Storage, registering the model with Azure ML, deploying it to an Azure Kubernetes Service cluster, and serving it as a web service. Demo sections show training a model for semantic type detection in Databricks and deploying it using Azure ML. The goal is to make model deployment and consumption seamless across Azure services.

Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Superworkflow of Graph Neural Networks with K8S and FugueDatabricks

This document introduces a superworkflow for running Node2Vec on graphs using the Fugue framework on Kubernetes. It describes the Node2Vec algorithm and different steps in the superworkflow, including graph creation and indexing, random walks, Word2Vec preprocessing, and embedding training. The superworkflow provides advantages like parallelizing steps, efficient resource usage through auto-persist and checkpointing. Benchmark results show the superworkflow reduces runtime significantly compared to Spark MLlib, such as reducing a 100M node graph embedding from 6,800 CPU hours to 100 CPU hours and 16 GPU hours. Open source links for the Node2Vec on Fugue project are also provided.

Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks

Scaling Data and ML with Apache Spark and FeastDatabricks

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Spark Summit EU talk by Josef HabdankSpark Summit

Is This Thing On? A Well State Model for the PeopleDatabricks

The document discusses using machine learning models to determine well production state (on vs off) from sensor data. It presents an existing data architecture and issues with data quality. A supervised learning model is proposed using a decision tree trained on labeled rod pump production data. The modeling workflow includes data preprocessing, feature engineering, hyperparameter tuning and grid search. Decision trees are chosen for their interpretability but the document notes larger models may perform better. Overall production state modeling could help optimize operations and outperform existing controllers.

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

Adding nodes at runtime (Upscale) to already running Spark-on-Yarn clusters is fairly easy. But taking away these nodes (Downscale) when the workload is low at some later point of time is a difficult problem. To remove a node from a running cluster, we need to make sure that it is not used for compute as well as storage. But on production workloads, we see that many of the nodes can’t be taken away because: Nodes are running some containers although they are not fully utilized i.e., containers are fragmented on different nodes. Example. – each node is running 1-2 containers/executors although they have resources to run 4 containers. Nodes have some shuffle data in the local disk which will be consumed by Spark application running on this cluster later. In this case, the Resource Manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages. In this talk, we will talk about how we can improve downscaling in Spark-on-YARN clusters under the presence of such constraints. We will cover changes in scheduling strategy for container allocation in YARN and Spark task scheduler which together helps us achieve better packing of containers. This makes sure that containers are defragmented on fewer set of nodes and thus some nodes don’t have any compute. In addition to this, we will also cover enhancements to Spark driver and External Shuffle Service (ESS) which helps us to proactively delete shuffle data which we already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data – thus freeing them from storage and hence available for reclamation for faster downscaling.

Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks

This document discusses best practices for tuning machine learning models. It covers architectural patterns like single-machine versus distributed training and training one model per group. It also discusses workflows for hyperparameter tuning including setting up full pipelines before tuning, evaluating metrics on validation data, and tracking results for reproducibility. Finally it provides tips for handling code, data, and cluster configurations for distributed hyperparameter tuning and recommends tools to use.

Operational Tips For Deploying Apache SparkDatabricks

Operational Tips for Deploying Apache Spark provides an overview of Apache Spark configuration, pipeline design best practices, and debugging techniques. It discusses how to configure Spark through command line options, programmatically, and Hadoop configs. It also covers topics like file formats, compression codecs, partitioning, and monitoring Spark jobs. The document provides tips on common issues like OutOfMemoryErrors, debugging SQL queries, and tuning shuffle partitions.

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan

As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

Track A-2 基於 Spark 的數據分析Etu Solution

Spark is a general purpose computational framework that provides more flexibility than MapReduce. It leverages distributed memory and uses directed acyclic graphs for data parallel computations while retaining MapReduce properties like scalability, fault tolerance, and data locality. Cloudera has embraced Spark and is working to integrate it into their Hadoop ecosystem through projects like Hive on Spark and optimizations in Spark Core, MLlib, and Spark Streaming. Cloudera positions Spark as the future general purpose framework for Hadoop, while other specialized frameworks may still be needed for tasks like SQL, search, and graphs.

More Related Content

What's hot (20)

Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks

The Revolution Will be StreamedDatabricks

Democratizing PySpark for Mobile Game PublishingDatabricks

On Improving Broadcast Joins in Apache Spark SQLDatabricks

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningDatabricks

Productionizing Machine Learning with a Microservices ArchitectureDatabricks

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsDatabricks

Building a Real-Time Feature Store at iFoodDatabricks

Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDatabricks

Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Superworkflow of Graph Neural Networks with K8S and FugueDatabricks

Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks

Scaling Data and ML with Apache Spark and FeastDatabricks

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Spark Summit EU talk by Josef HabdankSpark Summit

Is This Thing On? A Well State Model for the PeopleDatabricks

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks

Operational Tips For Deploying Apache SparkDatabricks

Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks

The Revolution Will be StreamedDatabricks

Democratizing PySpark for Mobile Game PublishingDatabricks

On Improving Broadcast Joins in Apache Spark SQLDatabricks

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningDatabricks

Productionizing Machine Learning with a Microservices ArchitectureDatabricks

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsDatabricks

Building a Real-Time Feature Store at iFoodDatabricks

Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDatabricks

Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Superworkflow of Graph Neural Networks with K8S and FugueDatabricks

Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks

Scaling Data and ML with Apache Spark and FeastDatabricks

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Spark Summit EU talk by Josef HabdankSpark Summit

Is This Thing On? A Well State Model for the PeopleDatabricks

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks

Operational Tips For Deploying Apache SparkDatabricks

Similar to SparkCruise: Automatic Computation Reuse in Apache Spark (20)

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan

Track A-2 基於 Spark 的數據分析Etu Solution

Apache Spark Performance is too hard. Let's make it easierDatabricks

Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.

An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN

This document provides an overview of optimizing Spark SQL performance. It begins with introducing the speaker and their background with Spark. It then discusses reading query plans, interpreting them to understand optimizations, and tuning plans by pushing down filters, avoiding implicit casts, and other techniques. It emphasizes tracking query execution through the Spark UI to analyze jobs, stages and tasks for bottlenecks. The document aims to help understand how to maximize Spark SQL performance.

Profiling & Testing with SparkRoger Rafanell Mas

Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks

This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML. Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.

Rally--OpenStack Benchmarking at ScaleMirantis

This document discusses benchmarking OpenStack at scale using Rally. Rally allows OpenStack developers and operators to generate relevant and repeatable benchmarking data on how their cloud operates under different workloads and levels of load. It provides examples of synthetic stress tests and real-life workload scenarios that can be used for benchmarking. The goals of Rally are to help identify performance bottlenecks, validate optimizations, and provide historical data for comparing cloud performance over time as OpenStack and deployments evolve.

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

Energy analytics with Apache Spark workshopQuantUniversity

Energy companies deal with huge amounts of data and Apache Spark is an ideal platform to develop machine learning applications for forecasting and pricing. In this talk, we will discuss how Apache Spark’s MLlib library can be used to build scalable analytics for clustering, classification and forecasting primarily for energy applications using electricity and weather datasets.Through a demo, we will illustrate a workflow approach to accomplish an end-to-end pipeline from data pre-processing to deployment for the above use-case using PySpark, Python etc.

Improving Reporting PerformanceDhiren Gala

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData

Tordatasci meetup-precima-retail-analytics-201901WeCloudData

The document discusses Precima's analytics processes and pipeline. It describes moving from on-premise systems like SAS and shell scripting to using AWS services like S3, Control-M, Luigi, and Redshift. It outlines considerations for pipeline design and reviews both past and current systems. The future vision involves using Databricks for data pipelines and Snowflake for queries, allowing decoupled, scalable computing and storage.

Presentación Oracle Database Migración consideraciones 10g/11g/12cRonald Francisco Vargas Quesada

Oracle Database Performance Tuning Basicsnitin anjankar

The document discusses Oracle database performance tuning. It covers identifying and resolving performance issues through tools like AWR and ASH reports. Common causes of performance problems include wait events, old statistics, incorrect execution plans, and I/O issues. The document recommends collecting specific data when analyzing problems and provides references and scripts for further tuning tasks.

What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik

The document discusses upcoming features and changes in Apache Airflow 2.0. Key points include: 1. Scheduler high availability will use an active-active model with row-level locks to allow killing a scheduler without interrupting tasks. 2. DAG serialization will decouple DAG parsing from scheduling to reduce delays, support lazy loading, and enable features like versioning. 3. Performance improvements include optimizing the DAG file processor and using a profiling tool to identify other bottlenecks. 4. The Kubernetes executor will integrate with KEDA for autoscaling and allow customizing pods through templating. 5. The official Helm chart, functional DAGs, and smaller usability changes

Analyze database system using a 3 d methodAjith Narayanan

The document discusses analyzing database systems using a 3D method for performance analysis. It introduces the 3D method, which looks at performance from the perspectives of the operating system (OS), Oracle database, and applications. The 3D method provides a holistic view of the system that can help identify issues and direct solutions. It also covers topics like time-based analysis in Oracle, how wait events are classified, and having a diagnostic framework for quick troubleshooting using tools like the Automatic Workload Repository report.

SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks

This document discusses how SQL can be used in Lucidworks Fusion for various purposes like aggregating signals to compute relevance scores, ingesting and transforming data from various sources using Spark SQL, enabling self-service analytics through tools like Tableau and PowerBI, and running experiments to compare variants. It provides examples of using SQL for tasks like sessionization with window functions, joining multiple data sources, hiding complex logic in user-defined functions, and powering recommendations. The document recommends SQL in Fusion for tasks like analytics, data ingestion, machine learning, and experimentation.

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release. The major themes for Spark 2.0 are: - Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs - Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries. - Tungsten Phase 2: Speed up Apache Spark by 10X

Maximizing Database Tuning in SAP SQL AnywhereSAP Technology

Performance Stability, Tips and Tricks and UnderscoresJitendra Singh

This document provides an overview of upgrading to Oracle Database 19c and ensuring performance stability after the upgrade. It discusses gathering statistics before the upgrade to speed up the process, using AutoUpgrade for upgrades, and various testing tools like AWR Diff Reports and SQL Performance Analyzer to check for performance regressions after the upgrade. Maintaining good statistics and thoroughly testing upgrades are emphasized as best practices for a successful upgrade.

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan

Track A-2 基於 Spark 的數據分析Etu Solution

Apache Spark Performance is too hard. Let's make it easierDatabricks

An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN

Profiling & Testing with SparkRoger Rafanell Mas

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks

Rally--OpenStack Benchmarking at ScaleMirantis

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

Energy analytics with Apache Spark workshopQuantUniversity

Improving Reporting PerformanceDhiren Gala

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData

Tordatasci meetup-precima-retail-analytics-201901WeCloudData

Presentación Oracle Database Migración consideraciones 10g/11g/12cRonald Francisco Vargas Quesada

Oracle Database Performance Tuning Basicsnitin anjankar

What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik

Analyze database system using a 3 d methodAjith Narayanan

SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Maximizing Database Tuning in SAP SQL AnywhereSAP Technology

Performance Stability, Tips and Tricks and UnderscoresJitendra Singh

More from Databricks (20)

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized PlatformDatabricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data ScienceDatabricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML MonitoringDatabricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature AggregationsDatabricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and SparkDatabricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta LakeDatabricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdfelinavihriala

apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...apidays

Unifying OpenAPI & AsyncAPI: Designing JSON Schemas+Examples for Reuse Naresh Jain, Co-founder & CEO at Specmatic Hari Krishnan, Co-founder & CTO at Specmatic apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfSiddharthSean

Chapter 5.1.pptxsertj you can get it done before the election and I willSotheaPheng

lecture 33333222234555555555555555556.pptxobsinaafilmakuush

apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays

Why an SDK is Needed to Protect APIs from Mobile Apps Pearce Erensel, Global VP of Sales at Approov Mobile Security apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

THE FRIEDMAN TEST ( Biostatics B. Pharm)JishuHaldar

The Friedman Test is a valuable non-parametric alternative to the Repeated Measures ANOVA, allowing for the comparison of three or more related groups when data is ordinal or not normally distributed. By ranking data instead of using raw values, the test overcomes the limitations of parametric tests, making it ideal for small sample sizes and real-world applications in medicine, psychology, pharmaceutical sciences, and education. However, while it effectively detects differences among groups, it does not indicate which specific groups differ, requiring further post-hoc analysis.

apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)apidays

Two tales of API Change Management from my time at Google Eric Koleda, Developer Advocate at Coda apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking Convene 360 Madison, New York May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Chapter4.1.pptx you can come to the house and statisticsSotheaPheng

2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fgmk1227103

Tableau Cloud - what to consider before making the move update 2025.pdfelinavihriala

apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...apidays

Building Scalable AI Systems: Cloud Architecture for Performance Sai Prasad Veluru, Software Engineer at Apple Inc apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...apidays

Spring Modulith Design for Microservices Renjith Ramachandran, Senior Solutions Architect at BJS Wholesale Club apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking Convene 360 Madison, New York May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Philippine-Constitution-and-Law in hospitalitykikomendoza006

apidays New York 2025 - Computers are still dumb by Ben Morss (DeepL)apidays

Computers are still dumb: bringing your AI magic to enterprises Ben Morss, Developer Evangelist at DeepL apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking Convene 360 Madison, New York May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

apidays New York 2025 - Building Green Software by Marissa Jasso & Katya Drey...apidays

Building Green Software: How Cloud-Native Platforms Can Power Sustainable App Development Katya Dreyer-Oren, Lead Software Engineer at Heroku (Salesforce) Marissa Jasso, Product Manager at Heroku (Salesforce) apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking Convene 360 Madison, New York May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays

You Can't Outrun Complexity - But You Can Orchestrate It: Lessons From Two Technical Transformations Leah Hurwich Adler, Senior Staff Product Manager at Apollo GraphQL apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

egc.pdf tài liệu tiếng Anh cho học sinh THPThuyenmy200809

apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays

The FINOS Common Domain Model for Capital Markets Tom Healey, Founder & Director at FINXIS LLC Daniel Schwartz, Managing Partner at FT Advisory LLC apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Arrays in c programing. practicals and .pptCarlos701746