This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.
Following Matei Zaharia’s keynote presentation, join this session for the nitty gritty details. Tableau is joining forces with Databricks and the Delta Lake open source community to announce Delta Sharing and the new open Delta Sharing protocol for secure data sharing. For Tableau customers, Delta Sharing simplifies and enriches data, while supporting the development of a data culture. Join this session to see a live demo of Tableau on Delta Sharing. Tableau customers can choose between 2 workflows for connection. The first workflow is called “Direct Connect,” which leverages a Tableau WDC connector. The second workflow involves using a hybrid approach for querying live on the Delta Sharing protocol and using Tableau Hyper in-memory data engine for fast data ingestion and analytical query processing.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Using Databricks as an Analysis PlatformDatabricks
Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients.
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.
Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.
In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.
Lessons learned from building practical deep learning systemsXavier Amatriain
1. There are many lessons to be learned from building practical deep learning systems, including choosing the right evaluation metrics, being thoughtful about your data and potential biases, and understanding dependencies between data, models, and systems.
2. It is important to optimize only what matters and beware of biases in your data. Simple models are often better than complex ones, and feature engineering is crucial.
3. Both supervised and unsupervised learning are important, and ensembles often perform best. Your AI infrastructure needs to support both experimentation and production.
Cloud-native Semantic Layer on Data LakeDatabricks
With larger volume and more real-time data stored in data lake, it becomes more complex to manage these data and serve analytics and applications. With different service interfaces, data caliber, performance bias on different scenarios, the business users begin to suffer low confidence on quality and efficiency to get insight from data.
H&M uses machine learning for various use cases including logistics, production, sales, marketing, and design/buying. MLOps principles like model versioning, reproducibility, scalability, and automated training are applied to manage the machine learning lifecycle. The technical stack includes Kubernetes, Docker, Azure Databricks for interactive development, Airflow for automated training, and Seldon for model serving. The goal is to apply MLOps at scale for various prediction scenarios through a continuous integration/continuous delivery pipeline.
Challenge And Evolution Of Data Orchestration at Rakuten Data SystemAlluxio, Inc.
1) Rakuten has evolved its data warehouse and lake over the past decade from a Teradata-based warehouse to incorporating HDFS and other data sources, creating complexity.
2) Challenges include too much data replication across storage systems, difficulty combining data sources, and fully coupling downstream apps to data sources.
3) Rakuten developed a data orchestration layer to unify pipelines, provide a common layer for consumption, enable caching for performance, and decouple downstream apps from data sources.
The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
The document is a presentation deck on building a supply chain twin using Neo4j and Google technologies. It discusses how supply chain data can be modeled as a graph and stored in Neo4j to power use cases like identifying product and part shortfalls, evaluating supply chain risk, and enabling scenario planning. The deck outlines an architecture that ingests supply chain data from Google BigQuery into Neo4j, then leverages Neo4j technologies like Graph Data Science, Bloom, and Keymaker to operationalize queries and deliver insights to applications.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
This document discusses Apache Arrow, an open source cross-language development platform for in-memory analytics. It provides an overview of Arrow's goals of being cross-language compatible, optimized for modern CPUs, and enabling interoperability between systems. Key components include core C++/Java libraries, integrations with projects like Pandas and Spark, and common message patterns for sharing data. The document also describes how Arrow is implemented in practice in systems like Dremio's Sabot query engine.
Deep Learning at Extreme Scale (in the Cloud) with the Apache Kafka Open Sou...Kai Wähner
How to Build a Machine Learning Infrastructure with Kafka, Connect, Streams, KSQL, etc…
This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem - and why this is a great fit for machine learning at extreme scale.
The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.
Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.
Key takeaways for the audience
• Learn how to build a Machine Learning infrastructure at extreme scale and how to productionize the built models in mission-critical real time applications
• Understand the benefits of a machine learning platform on the public cloud
• Learn about an extreme scale Machine Learning architecture around the Apache Kafka open source ecosystem including Kafka Connect, Kafka Streams and KSQL
• See a live demo for an Internet of Things use case: Sensor analytics for predictive alerting in real time
Airbnb aims to democratize data within the company by building a graph database of all internal data resources connected by relationships. This graph is queried through a search interface to help employees explore, discover, and build trust in company data. Challenges include modeling complex data dependencies and proxy nodes, merging graph updates from different sources, and designing a data-dense interface simply. Future goals are to gamify content production, deliver recommendations, certify trusted content, and analyze the information network.
Elsevier: Empowering Knowledge Discovery in Research with GraphsNeo4j
This document summarizes a presentation about enabling knowledge discovery with graphs. It discusses Elsevier's use of Neo4j's graph database to build structured search applications and power recommendations. Some key points include:
- Elsevier connects over 4 billion relationships in its graph, including references, grants, works, authors, and more to enable queries like finding all papers by an author.
- The graph helps build new product experiences across Elsevier's portfolio like enhanced author profiles and citation counts in search results.
- Graphs and embeddings provide a more precise understanding of author expertise and how their fields of study may have changed over time.
- The graph supports data science and accelerates analytics like evaluating academic impact with page rank
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
In this webinar, the presenter will take you through the most revolutionary data warehouse, Snowflake with a live demo and technical and functional discussions with a customer. Ryan Goltz from Chesapeake Energy and Tristan Handy, creator of DBT Cloud and owner of Fishtown Analytics will also be joining the webinar.
Garbage collection has largely removed the need to think about memory management when you write Java code, but there is still a benefit to understanding and minimizing the memory usage of your applications, particularly with the growing number of deployments of Java on embedded devices. This session gives you insight into the memory used as you write Java code and provides you with guidance on steps you can take to minimize your memory usage and write more-memory-efficient code. It shows you how to
• Understand the memory usage of Java code
• Minimize the creation of new Java objects
• Use the right Java collections in your application
• Identify inefficiencies in your code and remove them
Video available from Parleys.com:
https://www.parleys.com/talk/how-write-memory-efficient-java-code
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...Neo4j
Graph databases can help insurance companies address challenges like siloed data systems, identity resolution issues, and an inability to gain a full view of customers. They allow for a unified customer 360 view across different business units. Graph databases perform better than SQL for data that is interconnected, requires optimal querying of relationships, and has an evolving data model. Specifically for insurance, graphs can increase cross-sell/upsell opportunities, retention rates, and customer satisfaction while reducing costs and fraud. EY has experience implementing graph solutions for use cases like fraud detection and customer 360 projects.
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
Netflix processes trillions of events and petabytes of data a day in the Keystone data pipeline, which is built on top of Apache Flink. As Netflix has scaled up original productions annually enjoyed by more than 150 million global members, data integration across the streaming service and the studio has become a priority. Scalably integrating data across hundreds of different data stores in a way that enables us to holistically optimize cost, performance and operational concerns presented a significant challenge. Learn how we expanded the scope of the Keystone pipeline into the Netflix Data Mesh, our real-time, general-purpose, data transportation platform for moving data between Netflix systems. The Keystone Platform’s unique approach to declarative configuration and schema evolution, as well as our approach to unifying batch and streaming data and processing will be covered in depth.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
In this webinar, we’ll show you how Cloudera SDX reduces the complexity in your data management environment and lets you deliver diverse analytics with consistent security, governance, and lifecycle management against a shared data catalog.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Operating and Supporting Delta Lake in ProductionDatabricks
The document discusses strategies for optimizing and managing metadata in Delta Lake. It provides an overview of optimize, auto-optimize, and optimize write strategies and how to choose the appropriate strategy based on factors like workload, data size, and cluster resources. It also discusses Delta Lake transaction logs, configurations like log retention duration, and tips for working with Delta Lake metadata.
Cloud-native Semantic Layer on Data LakeDatabricks
With larger volume and more real-time data stored in data lake, it becomes more complex to manage these data and serve analytics and applications. With different service interfaces, data caliber, performance bias on different scenarios, the business users begin to suffer low confidence on quality and efficiency to get insight from data.
H&M uses machine learning for various use cases including logistics, production, sales, marketing, and design/buying. MLOps principles like model versioning, reproducibility, scalability, and automated training are applied to manage the machine learning lifecycle. The technical stack includes Kubernetes, Docker, Azure Databricks for interactive development, Airflow for automated training, and Seldon for model serving. The goal is to apply MLOps at scale for various prediction scenarios through a continuous integration/continuous delivery pipeline.
Challenge And Evolution Of Data Orchestration at Rakuten Data SystemAlluxio, Inc.
1) Rakuten has evolved its data warehouse and lake over the past decade from a Teradata-based warehouse to incorporating HDFS and other data sources, creating complexity.
2) Challenges include too much data replication across storage systems, difficulty combining data sources, and fully coupling downstream apps to data sources.
3) Rakuten developed a data orchestration layer to unify pipelines, provide a common layer for consumption, enable caching for performance, and decouple downstream apps from data sources.
The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
The document is a presentation deck on building a supply chain twin using Neo4j and Google technologies. It discusses how supply chain data can be modeled as a graph and stored in Neo4j to power use cases like identifying product and part shortfalls, evaluating supply chain risk, and enabling scenario planning. The deck outlines an architecture that ingests supply chain data from Google BigQuery into Neo4j, then leverages Neo4j technologies like Graph Data Science, Bloom, and Keymaker to operationalize queries and deliver insights to applications.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
This document discusses Apache Arrow, an open source cross-language development platform for in-memory analytics. It provides an overview of Arrow's goals of being cross-language compatible, optimized for modern CPUs, and enabling interoperability between systems. Key components include core C++/Java libraries, integrations with projects like Pandas and Spark, and common message patterns for sharing data. The document also describes how Arrow is implemented in practice in systems like Dremio's Sabot query engine.
Deep Learning at Extreme Scale (in the Cloud) with the Apache Kafka Open Sou...Kai Wähner
How to Build a Machine Learning Infrastructure with Kafka, Connect, Streams, KSQL, etc…
This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem - and why this is a great fit for machine learning at extreme scale.
The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.
Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.
Key takeaways for the audience
• Learn how to build a Machine Learning infrastructure at extreme scale and how to productionize the built models in mission-critical real time applications
• Understand the benefits of a machine learning platform on the public cloud
• Learn about an extreme scale Machine Learning architecture around the Apache Kafka open source ecosystem including Kafka Connect, Kafka Streams and KSQL
• See a live demo for an Internet of Things use case: Sensor analytics for predictive alerting in real time
Airbnb aims to democratize data within the company by building a graph database of all internal data resources connected by relationships. This graph is queried through a search interface to help employees explore, discover, and build trust in company data. Challenges include modeling complex data dependencies and proxy nodes, merging graph updates from different sources, and designing a data-dense interface simply. Future goals are to gamify content production, deliver recommendations, certify trusted content, and analyze the information network.
Elsevier: Empowering Knowledge Discovery in Research with GraphsNeo4j
This document summarizes a presentation about enabling knowledge discovery with graphs. It discusses Elsevier's use of Neo4j's graph database to build structured search applications and power recommendations. Some key points include:
- Elsevier connects over 4 billion relationships in its graph, including references, grants, works, authors, and more to enable queries like finding all papers by an author.
- The graph helps build new product experiences across Elsevier's portfolio like enhanced author profiles and citation counts in search results.
- Graphs and embeddings provide a more precise understanding of author expertise and how their fields of study may have changed over time.
- The graph supports data science and accelerates analytics like evaluating academic impact with page rank
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
In this webinar, the presenter will take you through the most revolutionary data warehouse, Snowflake with a live demo and technical and functional discussions with a customer. Ryan Goltz from Chesapeake Energy and Tristan Handy, creator of DBT Cloud and owner of Fishtown Analytics will also be joining the webinar.
Garbage collection has largely removed the need to think about memory management when you write Java code, but there is still a benefit to understanding and minimizing the memory usage of your applications, particularly with the growing number of deployments of Java on embedded devices. This session gives you insight into the memory used as you write Java code and provides you with guidance on steps you can take to minimize your memory usage and write more-memory-efficient code. It shows you how to
• Understand the memory usage of Java code
• Minimize the creation of new Java objects
• Use the right Java collections in your application
• Identify inefficiencies in your code and remove them
Video available from Parleys.com:
https://www.parleys.com/talk/how-write-memory-efficient-java-code
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...Neo4j
Graph databases can help insurance companies address challenges like siloed data systems, identity resolution issues, and an inability to gain a full view of customers. They allow for a unified customer 360 view across different business units. Graph databases perform better than SQL for data that is interconnected, requires optimal querying of relationships, and has an evolving data model. Specifically for insurance, graphs can increase cross-sell/upsell opportunities, retention rates, and customer satisfaction while reducing costs and fraud. EY has experience implementing graph solutions for use cases like fraud detection and customer 360 projects.
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
Netflix processes trillions of events and petabytes of data a day in the Keystone data pipeline, which is built on top of Apache Flink. As Netflix has scaled up original productions annually enjoyed by more than 150 million global members, data integration across the streaming service and the studio has become a priority. Scalably integrating data across hundreds of different data stores in a way that enables us to holistically optimize cost, performance and operational concerns presented a significant challenge. Learn how we expanded the scope of the Keystone pipeline into the Netflix Data Mesh, our real-time, general-purpose, data transportation platform for moving data between Netflix systems. The Keystone Platform’s unique approach to declarative configuration and schema evolution, as well as our approach to unifying batch and streaming data and processing will be covered in depth.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
In this webinar, we’ll show you how Cloudera SDX reduces the complexity in your data management environment and lets you deliver diverse analytics with consistent security, governance, and lifecycle management against a shared data catalog.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Operating and Supporting Delta Lake in ProductionDatabricks
The document discusses strategies for optimizing and managing metadata in Delta Lake. It provides an overview of optimize, auto-optimize, and optimize write strategies and how to choose the appropriate strategy based on factors like workload, data size, and cluster resources. It also discusses Delta Lake transaction logs, configurations like log retention duration, and tips for working with Delta Lake metadata.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Healthcare Claim Reimbursement using Apache SparkDatabricks
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net
This document contains a presentation on performance tuning strategies for Azure Databricks. It discusses techniques like enabling the Databricks disk cache, using Autoloader for ingestion, implementing dynamic and static partition pruning, leveraging file pruning using statistics, optimizing layout using Z-ordering, and additional tips around query optimization, adaptive query processing, and cluster configuration. The presentation provides technical details on how each strategy works and when it should be applied to improve query performance on Databricks.
Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow
As presented: Sajith Appukuttan, Solution Architect, Databricks
Sept 12, 2019 at Vancouver Spark Meetup
Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustData Con LA
Abstract: Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover
.All technical aspects of Delta Features
.What’s coming
.How to get started using it
.How to contribute
Bio: Michael Armbrust is committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Databricks Delta. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
This document discusses optimizing Apache Spark (PySpark) workloads in production. It provides an agenda for a presentation on various Spark topics including the primary data structures (RDD, DataFrame, Dataset), executors, cores, containers, stages and jobs. It also discusses strategies for optimizing joins, parallel reads from databases, bulk loading data, and scheduling Spark workflows with Apache Airflow. The presentation is given by a solution architect from Accionlabs, a global technology services firm focused on emerging technologies like Apache Spark, machine learning, and cloud technologies.
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
This document summarizes a presentation about scaling terabytes of data with Apache Spark and Scala. The key points are:
1) The presenter discusses how to use Apache Spark and Scala to process large scale data in a distributed manner across clusters. Spark operations like RDDs, DataFrames and Datasets are covered.
2) A case study is presented about reengineering a data processing platform for a retail business to improve performance. Changes included parallelizing jobs, tuning Spark hyperparameters, and building a fast data architecture using Spark, Kafka and data lakes.
3) Performance was improved through techniques like dynamic resource allocation in YARN, reducing memory and cores per executor to better utilize cluster resources, and processing data
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
Everybody wants to go on the “Big Data” hype cycle, “To do Scale”, to use the coolest tools in the market like Hadoop, Apache Spark, Apache Cassandra, etc.
But do they ask themselves is there really a reason for that?
In the talk we’ll make a brief overview to all of the technologies in the Big Data world nowadays and we’ll talk about the problems that really emerge when you’d like to enter the great world of Big Data handling.
Showing you the Hadoop ecosystem and Apache Spark and all of the distributed tools leading the market today, will give you all a notion of what will be the real costs entering that world.
Promise that I’ll share some stories from the trenches :)
(And about the “pool” thing...I don’t really know how to swim)
Delta from a Data Engineer's PerspectiveDatabricks
This document describes the Delta architecture, which unifies batch and streaming data processing. Delta achieves this through a continuous data flow model using structured streaming. It allows data engineers to read consistent data while being written, incrementally read large tables at scale, rollback in case of errors, replay and process historical data along with new data, and handle late arriving data without delays. Delta uses transaction logging, optimistic concurrency, and Spark to scale metadata handling for large tables. This provides a simplified solution to common challenges data engineers face.
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
This document provides an overview of optimizing Spark SQL performance. It begins with introducing the speaker and their background with Spark. It then discusses reading query plans, interpreting them to understand optimizations, and tuning plans by pushing down filters, avoiding implicit casts, and other techniques. It emphasizes tracking query execution through the Spark UI to analyze jobs, stages and tasks for bottlenecks. The document aims to help understand how to maximize Spark SQL performance.
The document discusses the Delta Architecture, which is a step beyond the Lambda Architecture. It provides a continuous data flow model to unify batch and streaming processing. Some key characteristics of the Delta Architecture include adopting a continuous data flow model, using intermediate hops for reliability and troubleshooting, optimizing storage layout based on access patterns, enabling reprocessing by clearing result tables and restarting streams, and incrementally improving data quality. The architecture helps reduce end-to-end pipeline SLAs, maintenance burden, and infrastructure costs compared to other approaches.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...apidays
Spring Modulith Design for Microservices
Renjith Ramachandran, Senior Solutions Architect at BJS Wholesale Club
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays
The Challenge is Not the Pattern, But the Best Integration
Yisrael Gross, CEO at Ammune.ai
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...apidays
Enhancing Developer Productivity with UX
Petrine Tang, UX Designer at Government Technology Agency
Faith Ang, Product Manager at Government Technology Agency
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Open Source and disrupting the travel distribution ec...apidays
Open Source and disrupting the travel distribution ecosystem
Stu Waldron, Advisor & Acting Director at OpenTravel
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...apidays
Unifying OpenAPI & AsyncAPI: Designing JSON Schemas+Examples for Reuse
Naresh Jain, Co-founder & CEO at Specmatic
Hari Krishnan, Co-founder & CTO at Specmatic
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Report based on the findings of a quantitative research conducted by the research agency New Image Marketing Group, commissioned by the NGO Detector Media, compiled by PhD in Sociology Marta Naumova.
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...apidays
Why I Built Another Carbon Measurement Tool for LLMs (And What I Learned Along the Way)
Pascal Joly, Sustainability Consultant and Instructor at IT Climate Ed
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...apidays
4 identity factors you didn't know you needed to support large organizations in your SaaS
Daizen Ikehara, Principal Developer Advocate at Auth0
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays
Building Finance Innovation Ecosystems
Umang Moondra, CEO at APIX
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays
The FINOS Common Domain Model for Capital Markets
Tom Healey, Founder & Director at FINXIS LLC
Daniel Schwartz, Managing Partner at FT Advisory LLC
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Report based on the findings of a quantitative research conducted by the research agency New Image Marketing Group, commissioned by the NGO Detector Media, compiled by PhD in Sociology Marta Naumova.
THE FRIEDMAN TEST ( Biostatics B. Pharm)JishuHaldar
The Friedman Test is a valuable non-parametric alternative to the
Repeated Measures ANOVA, allowing for the comparison of three or
more related groups when data is ordinal or not normally distributed.
By ranking data instead of using raw values, the test overcomes the
limitations of parametric tests, making it ideal for small sample sizes and
real-world applications in medicine, psychology, pharmaceutical
sciences, and education. However, while it effectively detects differences
among groups, it does not indicate which specific groups differ, requiring
further post-hoc analysis.
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays
Breaking Barriers: Lessons Learned from API Integration with Large Hotel Chains and the Role of Standardization
Constantine Nikolaou, Manager Business Solutions Architect at Booking.com
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays
Best Practices for Building Robust Data Platform with Apache Spark and Delta
1. Best Practices for building Robust Data Platform
with Apache Spark & Delta
Vini Jaiswal
Spark+AI Summit - June 2020
https://www.linkedin.com/in/vinijaiswal/
2. ▪ Data Strategy
Optimizing the cost to drive Business value
▪ Performance and tuning with Delta Lake & Apache Spark
▪ Governance and security controls
Bringing it all together - A reference architecture
Agenda
4. Data Challenges
Data Warehouse limits the potential of
intelligence
Data Volume is growing rapidly
More Variety of data -> Different
applications
Need for faster processing and scalability
Data silos limits innovation
Promise of the Data Lake
1. Collect
Everything
2. Store it all in
the Data Lake
🔥
🔥🔥
3. Data
Science &
Machine
Learning
🔥
🔥
7. Ideal data lakes with
No atomicity
No quality enforcement
No consistency /
isolation
✗ Reliability - High Quality Data
● Schema Enforcement
● ACID Transactions
● Time Travel
● Open Standards, Open Source
● Powered by
● Unifies Streaming / Batch
Usual Data Lake
References: https://youtu.be/qtCxNSmTejk
8. Getting the Data Right
Audience Segmentation
CSV,
JSON, TXT…
Data
Types
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
Table
Categorization
Align with
Business
Outcomes
Is my data use
case worthy?
Is my data ready
for Analytics / ML?
10. Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
11. Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
12. Workload Type AWS
Type
Azure
Type
Recommended Use Case
Memory Optimized r5 Dsv2 Memory-intensive applications
Use Case: ML workload with data caching
Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data
Science Applications
Use Case: ETL with full file scans and no data reuse
Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO
Use Case: Analytics - Storage Optimized i3 class with
Delta IO Cache
Selection of Instance Types
Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes
Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
13. Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
14. Selection of node size
Rule of thumb
1. Fewer big instances > more small instances
a. (larger heap = larger GC)
b. Multiple executors per machine
2. Size based on the number of tasks initially, tweak later
a. Run the job with a small cluster to get idea of # of tasks
b. Observe Cluster metrics for CPU, memory and network utilization
15. Observe Spark UI & tweak the workloads
Fully cached with room to spare?
> decrease instances
Almost completely cached?
> Increase cluster size
Not even close to cached?
> Consider instance with SSD
instead of EBS or use R class
Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
16. Observe Ganglia Metrics & tweak the workloads
○ Are we compute bound?
○ Are we network bound?
○ Are we spilling a ton?
18. Performance Symptoms
Look for these 4 symptoms
Shuffle
Spill
Skew
Small Files
Can I make
Spark application run faster?
19. Use broadcast join
Review Join order
I found Shuffle, now what?
Query completion time
28 Minutes
Sort Merge Join
rows
output:
2,509,189,31
3
Before
1.8 Minutes
rows
output:
1023
After
Reference: https://spark.apache.org/docs/latest/sql-
performance-tuning.html#broadcast-hint-for-sql-queries
20. ● Increase Shuffle Partitions
(for this example: 48)
● Reduce the number of cores
spark.executor.cores < total
cores per worker
● Larger cluster - faster disk
SSDs
Shuffle Partitions = 16
I found Spill, now what?
set spark.sql.shuffle.partitions=48
More spill you can remove, larger
the impact!
21. Symptom
● Ganglia CPU usage becomes low for long time after
initial high usage
● Task duration -> Significant difference in max than
75% and 25% values
● Input Size/Records
What to do?
● Use broadcast join
● Use Skew Join
● Filter out large keys/salt keys and set
up multiple reduce steps
● Explicitly repartition the data on a
different field
I found Skew, now what?
Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
22. Adaptive Query Execution
Reduced manual effort of tuning spark.sql.shuffle.partitions
By default it is turned off, Set spark.sql.adaptive.enabled=true
Dynamically change sort-merge join into broadcast-hash join
▪ Dynamically optimizing skew joins
*Available in DBR 7.x/Spark 3.0
23. Upstream
● Fix the upstream application building tons of files
● Use a seperate tool to compact them before
processing with Spark
Changes in Spark Application
● Write your own compaction job
● Delta solves this problem!
I found a lot of small files, now what?
25. Compaction
● Improves the Read
Performance
● Solves Small Files problem
Reference: https://docs.delta.io/latest/best-practices.html#compact-files
26. ● Optimizes Apache Spark partition
● Maximizes the throughput of data being
written
● Compacts files for partitions
Auto Optimize
Auto Optimize consists of two complementary features:
Optimized Writes and Auto Compaction.
Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
29. Data Governance with Delta Lake
Create retention policy to age out and
erase raw data that may contain
personal information
High Level Aggregates
(e.g. # of users that took an action)
Historical Data Repository
● Easy to navigate
● Pseudonymization
Data Lake
Satisfy Compliance requests using
UPDATE / DELETE commands
Create tables that don't contain
personal data
Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
31. Governance - The Who/What/Where
Perform standard extraction,
transformation and loading
tasks (ETL) and apply best
coding practices including
source control, unit test, and
automation
drives product innovation
with state-of-the-art
Machine Learning models
applied to big data
Improves business process
through providing standardized
and ad-hoc business analysis.
Acts as intermediary between
Analytics and Business team
Performs automated jobs based
on Data Engineering configs.
Data Scientist Data Engineer Data/Business
Analyst Automated Jobs
Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
32. Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
33. Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
Data Strategy
Cost Optimization &
Performance Tuning
Business Value
Security