Hands-on tutorial on installation IPFS node and creation of smart contracts that use IPFS for data storage. As an example of IPFS usage in smart contracts, we create ERC-721 NFT that reference file in IPFS.
Tools and technologies used in this tutorial:
GCP https://console.cloud.google.com/home
ApiDapp https://apidapp.com/
Etherscan https://kovan.etherscan.io/
Solidity https://solidity.readthedocs.io/en/v0.6.1/
Open Zeppelin https://openzeppelin.com/contracts/
Fine tune and deploy Hugging Face NLP modelsOVHcloud
This webinar discusses fine-tuning and deploying Hugging Face NLP models. The agenda includes an overview of Hugging Face and NLP, a demonstration of fine-tuning a model, a demonstration of deploying a model in production, and a summary. Hugging Face is presented as the most popular open source NLP library with over 4,000 models. Fine-tuning models allows them to be adapted for specific tasks and domains and is more data efficient than training from scratch. OVHcloud is highlighted as providing tools for full AI workflows from storage and processing to training and deployment.
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
This document discusses Kubeflow, an end-to-end machine learning platform for Kubernetes. It covers various Kubeflow components like Jupyter notebooks, distributed training operators, hyperparameter tuning with Katib, model serving with KFServing, and orchestrating the full ML lifecycle with Kubeflow Pipelines. It also talks about IBM's contributions to Kubeflow and shows how Watson AI Pipelines can productize Kubeflow Pipelines using Tekton.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Tupperware: Containerized Deployment at FBDocker, Inc.
Tupperware is Facebook's system for containerized deployment of services at scale. It handles provisioning machines, distributing binaries, monitoring processes, and failover to ensure services run smoothly in production. Engineers focus on application logic rather than deployment details. Tupperware uses Linux containers to isolate over 300,000 processes across 15,000+ services running on thousands of machines. It incorporates features like service discovery, logging, resource limits, and automated rollouts to efficiently manage infrastructure. Lessons learned include releasing often, using canaries for rollouts, and providing sane defaults to reduce user burden.
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
Milvus is an open-source vector database that leverages a novel data fabric to build and manage vector similarity search applications. As the world's most popular vector database, it has already been adopted in production by thousands of companies around the world, including Lucidworks, Shutterstock, and Cloudinary. With the launch of Milvus 2.0, the community aims to introduce a cloud-native, highly scalable and extendable vector similarity solution, and the key design concept is log as data.
Milvus relies on Pulsar as the log pub/sub system. Pulsar helps Milvus to reduce system complexity by loosely decoupling each micro service, making the system stateless by disaggregating log storage and computation, which also makes the system further extendable. We will introduce the overview design, the implementation details of Milvus and its roadmap in this topic.
Takeaways:
1) Get a general idea about what is a vector database and its real-world use cases.
2) Understand the major design principles of Milvus 2.0.
3) Learn how to build a complex system with the help of a modern log system like Pulsar.
An introduction to computer vision with Hugging FaceJulien SIMON
In this code-level talk, Julien will show you how to quickly build and deploy computer vision applications based on Transformer models. Along the way, you'll learn about the portfolio of open source and commercial Hugging Face solutions, and how they can help you deliver high-quality solutions faster than ever before.
Deep dive into Kubeflow Pipelines, and details about Tekton backend implementation for KFP, including compiler, logging, artifacts and lineage tracking
The document provides an overview of Vertex AI, Google Cloud's managed machine learning platform. It discusses topics such as managing datasets, building and training machine learning models using both automated and custom approaches, implementing explainable AI, and deploying models. The document also includes references to the Vertex AI documentation and contact information for further information.
Stream Processing – Concepts and FrameworksGuido Schmutz
More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. It is one thing to collect these events in the velocity they arrive, without losing any single message. An Event Hub and a data flow engine can help here. It’s another thing to do some (complex) analytics on the data. There is always the option to first store in a data sink of choice and later analyze it. Storing even a high-volume event stream is feasible and not a challenge anymore. But this adds to the end-to-end latency and it takes minutes if not hours to present results. If you need to react fast, you simply can’t afford to first store the data. You need to do process it directly on the data stream. This is called Stream Processing or Stream Analytics. In this talk I will present the important concepts, a Stream Processing solution should support and then dive into some of the most popular frameworks available on the market and how they compare.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
As enterprises build their cloud adoption strategy, conducting a cloud readiness assessment is of strategic importance. HCL believes that application readiness and compliance controls are the pre-requisites for determining a business-oriented cloud adoption strategy.
HCL’s Cloud Assessment and Readiness Tool (CART) enable organizations to assess their “Cloud-readiness” using a powerful self-driven analysis. It uses a broad array of 25 technical, business, compliances and security parameters to perform the cloud-readiness analysis across three different aspects - Technical & Architecture, Application & Security. The Tool is based on Excel based workbook with easy to use methodology for filling the data & then helps generating a standard & basic Cloud Assessment within minutes
Building an Observability platform with ClickHouseAltinity Ltd
The document discusses using ClickHouse as the backend storage for observability data in SigNoz, an open source observability platform. ClickHouse is well-suited for storing observability data due to its ability to handle wide tables, perform fast aggregation queries, and provide compression capabilities. The document outlines SigNoz's architecture, demonstrates how tracing and metrics data is modeled in ClickHouse, and shows how ClickHouse outperforms Elasticsearch for ingesting and querying logs and metrics at scale. Overall, ClickHouse is presented as an efficient and less resource-intensive solution than alternatives like Druid for storing observability data, especially for open source projects.
This document outlines the objectives, key concepts, and curriculum for a Big Data and Hadoop training module. The objectives are to understand what Big Data is, the Hadoop ecosystem and its features, career opportunities, and the training curriculum. It defines Big Data, Hadoop, and the Hadoop ecosystem. It discusses the V's of Big Data and domains where Big Data is applicable. It also outlines job roles in the Big Data industry, potential employers, career paths, and the 10-module training curriculum covering topics like Hadoop, MapReduce, Pig, Hive, HBase, Zookeeper and Oozie.
Machine learning is overhyped nowadays. There is a strong belief that this area is exclusively for data scientists with a deep mathematical background that leverage Python (scikit-learn, Theano, Tensorflow, etc.) or R ecosystem and use specific tools like Matlab, Octave or similar. Of course, there is a big grain of truth in this statement, but we, Java engineers, also can take the best of machine learning universe from an applied perspective by using our native language and familiar frameworks like Apache Spark. During this introductory presentation, you will get acquainted with the simplest machine learning tasks and algorithms, like regression, classification, clustering, widen your outlook and use Apache Spark MLlib to distinguish pop music from heavy metal and simply have fun.
Source code: https://github.com/tmatyashovsky/spark-ml-samples
Design by Yarko Filevych: http://filevych.com/
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptxNeo4j
This document discusses using knowledge graphs to ground large language models (LLMs) and improve their abilities. It begins with an overview of generative AI and LLMs, noting their opportunities but also challenges like lack of knowledge and inability to verify sources. The document then proposes using a knowledge graph like Neo4j to provide context and ground LLMs, describing how graphs can be enriched with algorithms, embeddings and other data. Finally, it demonstrates how contextual searches and responses can be improved by retrieving relevant information from the knowledge graph to augment LLM responses.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
MLOps (a compound of “machine learning” and “operations”) is a practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle. Similar to the DevOps term in the software development world, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. MLOps applies to the entire ML lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.
To watch the full presentation click here: https://info.cnvrg.io/mlopsformachinelearning
In this webinar, we’ll discuss core practices in MLOps that will help data science teams scale to the enterprise level. You’ll learn the primary functions of MLOps, and what tasks are suggested to accelerate your teams machine learning pipeline. Join us in a discussion with cnvrg.io Solutions Architect, Aaron Schneider, and learn how teams use MLOps for more productive machine learning workflows.
- Reduce friction between science and engineering
- Deploy your models to production faster
- Health, diagnostics and governance of ML models
- Kubernetes as a core platform for MLOps
- Support advanced use-cases like continual learning with MLOps
- What are Internal Developer Portal (IDP) and Platform Engineering?
- What is Backstage?
- How Backstage can help dev to build developer portal to make their job easier
Jirayut Nimsaeng
Founder & CEO
Opsta (Thailand) Co., Ltd.
Youtube Record: https://youtu.be/u_nLbgWDwsA?t=850
Dev Mountain Tech Festival @ Chiang Mai
November 12, 2022
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
The document presents a review of large language models (LLMs) for code generation. It discusses different types of LLMs including left-to-right, masked, and encoder-decoder models. Existing models for code generation like Codex, GPT-Neo, GPT-J, and CodeParrot are compared. A new model called PolyCoder with 2.7 billion parameters trained on 12 programming languages is introduced. Evaluation results show PolyCoder performs less well than comparably sized models but outperforms others on C language tasks. In general, performance improves with larger models and longer training, but training solely on code can be sufficient or advantageous for some languages.
Designing a complete ci cd pipeline using argo events, workflow and cd productsJulian Mazzitelli
https://www.youtube.com/watch?v=YmIAatr3Who
Presented at Cloud and AI DevFest GDG Montreal on September 27, 2019.
Are you looking to get more flexibility out of your CICD platform? Interested how GitOps fits into the mix? Learn how Argo CD, Workflows, and Events can be combined to craft custom CICD flows. All while staying Kubernetes native, enabling you to leverage existing observability tooling.
Apache Kafka is a high-throughput distributed messaging system that allows for both streaming and offline log processing. It uses Apache Zookeeper for coordination and supports activity stream processing and real-time pub/sub messaging. Kafka bridges the gaps between pure offline log processing and traditional messaging systems by providing features like batching, transactions, persistence, and support for multiple consumers.
Modern machine learning (ML) workloads, such as deep learning and large-scale model training, are compute-intensive and require distributed execution. Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales Python applications and ML workloads from a laptop to a cluster, with an emphasis on the unique performance challenges of ML/AI systems. It is now used in many production deployments.
This talk will cover Ray’s overview, architecture, core concepts, and primitives, such as remote Tasks and Actors; briefly discuss Ray native libraries (Ray Tune, Ray Train, Ray Serve, Ray Datasets, RLlib); and Ray’s growing ecosystem.
Through a demo using XGBoost for classification, we will demonstrate how you can scale training, hyperparameter tuning, and inference—from a single node to a cluster, with tangible performance difference when using Ray.
The takeaways from this talk are :
Learn Ray architecture, core concepts, and Ray primitives and patterns
Why Distributed computing will be the norm not an exception
How to scale your ML workloads with Ray libraries:
Training on a single node vs. Ray cluster, using XGBoost with/without Ray
Hyperparameter search and tuning, using XGBoost with Ray Tune
Inferencing at scale, using XGBoost with/without Ray
An introduction to computer vision with Hugging FaceJulien SIMON
In this code-level talk, Julien will show you how to quickly build and deploy computer vision applications based on Transformer models. Along the way, you'll learn about the portfolio of open source and commercial Hugging Face solutions, and how they can help you deliver high-quality solutions faster than ever before.
Deep dive into Kubeflow Pipelines, and details about Tekton backend implementation for KFP, including compiler, logging, artifacts and lineage tracking
The document provides an overview of Vertex AI, Google Cloud's managed machine learning platform. It discusses topics such as managing datasets, building and training machine learning models using both automated and custom approaches, implementing explainable AI, and deploying models. The document also includes references to the Vertex AI documentation and contact information for further information.
Stream Processing – Concepts and FrameworksGuido Schmutz
More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. It is one thing to collect these events in the velocity they arrive, without losing any single message. An Event Hub and a data flow engine can help here. It’s another thing to do some (complex) analytics on the data. There is always the option to first store in a data sink of choice and later analyze it. Storing even a high-volume event stream is feasible and not a challenge anymore. But this adds to the end-to-end latency and it takes minutes if not hours to present results. If you need to react fast, you simply can’t afford to first store the data. You need to do process it directly on the data stream. This is called Stream Processing or Stream Analytics. In this talk I will present the important concepts, a Stream Processing solution should support and then dive into some of the most popular frameworks available on the market and how they compare.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
As enterprises build their cloud adoption strategy, conducting a cloud readiness assessment is of strategic importance. HCL believes that application readiness and compliance controls are the pre-requisites for determining a business-oriented cloud adoption strategy.
HCL’s Cloud Assessment and Readiness Tool (CART) enable organizations to assess their “Cloud-readiness” using a powerful self-driven analysis. It uses a broad array of 25 technical, business, compliances and security parameters to perform the cloud-readiness analysis across three different aspects - Technical & Architecture, Application & Security. The Tool is based on Excel based workbook with easy to use methodology for filling the data & then helps generating a standard & basic Cloud Assessment within minutes
Building an Observability platform with ClickHouseAltinity Ltd
The document discusses using ClickHouse as the backend storage for observability data in SigNoz, an open source observability platform. ClickHouse is well-suited for storing observability data due to its ability to handle wide tables, perform fast aggregation queries, and provide compression capabilities. The document outlines SigNoz's architecture, demonstrates how tracing and metrics data is modeled in ClickHouse, and shows how ClickHouse outperforms Elasticsearch for ingesting and querying logs and metrics at scale. Overall, ClickHouse is presented as an efficient and less resource-intensive solution than alternatives like Druid for storing observability data, especially for open source projects.
This document outlines the objectives, key concepts, and curriculum for a Big Data and Hadoop training module. The objectives are to understand what Big Data is, the Hadoop ecosystem and its features, career opportunities, and the training curriculum. It defines Big Data, Hadoop, and the Hadoop ecosystem. It discusses the V's of Big Data and domains where Big Data is applicable. It also outlines job roles in the Big Data industry, potential employers, career paths, and the 10-module training curriculum covering topics like Hadoop, MapReduce, Pig, Hive, HBase, Zookeeper and Oozie.
Machine learning is overhyped nowadays. There is a strong belief that this area is exclusively for data scientists with a deep mathematical background that leverage Python (scikit-learn, Theano, Tensorflow, etc.) or R ecosystem and use specific tools like Matlab, Octave or similar. Of course, there is a big grain of truth in this statement, but we, Java engineers, also can take the best of machine learning universe from an applied perspective by using our native language and familiar frameworks like Apache Spark. During this introductory presentation, you will get acquainted with the simplest machine learning tasks and algorithms, like regression, classification, clustering, widen your outlook and use Apache Spark MLlib to distinguish pop music from heavy metal and simply have fun.
Source code: https://github.com/tmatyashovsky/spark-ml-samples
Design by Yarko Filevych: http://filevych.com/
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptxNeo4j
This document discusses using knowledge graphs to ground large language models (LLMs) and improve their abilities. It begins with an overview of generative AI and LLMs, noting their opportunities but also challenges like lack of knowledge and inability to verify sources. The document then proposes using a knowledge graph like Neo4j to provide context and ground LLMs, describing how graphs can be enriched with algorithms, embeddings and other data. Finally, it demonstrates how contextual searches and responses can be improved by retrieving relevant information from the knowledge graph to augment LLM responses.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
MLOps (a compound of “machine learning” and “operations”) is a practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle. Similar to the DevOps term in the software development world, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. MLOps applies to the entire ML lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.
To watch the full presentation click here: https://info.cnvrg.io/mlopsformachinelearning
In this webinar, we’ll discuss core practices in MLOps that will help data science teams scale to the enterprise level. You’ll learn the primary functions of MLOps, and what tasks are suggested to accelerate your teams machine learning pipeline. Join us in a discussion with cnvrg.io Solutions Architect, Aaron Schneider, and learn how teams use MLOps for more productive machine learning workflows.
- Reduce friction between science and engineering
- Deploy your models to production faster
- Health, diagnostics and governance of ML models
- Kubernetes as a core platform for MLOps
- Support advanced use-cases like continual learning with MLOps
- What are Internal Developer Portal (IDP) and Platform Engineering?
- What is Backstage?
- How Backstage can help dev to build developer portal to make their job easier
Jirayut Nimsaeng
Founder & CEO
Opsta (Thailand) Co., Ltd.
Youtube Record: https://youtu.be/u_nLbgWDwsA?t=850
Dev Mountain Tech Festival @ Chiang Mai
November 12, 2022
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
The document presents a review of large language models (LLMs) for code generation. It discusses different types of LLMs including left-to-right, masked, and encoder-decoder models. Existing models for code generation like Codex, GPT-Neo, GPT-J, and CodeParrot are compared. A new model called PolyCoder with 2.7 billion parameters trained on 12 programming languages is introduced. Evaluation results show PolyCoder performs less well than comparably sized models but outperforms others on C language tasks. In general, performance improves with larger models and longer training, but training solely on code can be sufficient or advantageous for some languages.
Designing a complete ci cd pipeline using argo events, workflow and cd productsJulian Mazzitelli
https://www.youtube.com/watch?v=YmIAatr3Who
Presented at Cloud and AI DevFest GDG Montreal on September 27, 2019.
Are you looking to get more flexibility out of your CICD platform? Interested how GitOps fits into the mix? Learn how Argo CD, Workflows, and Events can be combined to craft custom CICD flows. All while staying Kubernetes native, enabling you to leverage existing observability tooling.
Apache Kafka is a high-throughput distributed messaging system that allows for both streaming and offline log processing. It uses Apache Zookeeper for coordination and supports activity stream processing and real-time pub/sub messaging. Kafka bridges the gaps between pure offline log processing and traditional messaging systems by providing features like batching, transactions, persistence, and support for multiple consumers.
Modern machine learning (ML) workloads, such as deep learning and large-scale model training, are compute-intensive and require distributed execution. Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales Python applications and ML workloads from a laptop to a cluster, with an emphasis on the unique performance challenges of ML/AI systems. It is now used in many production deployments.
This talk will cover Ray’s overview, architecture, core concepts, and primitives, such as remote Tasks and Actors; briefly discuss Ray native libraries (Ray Tune, Ray Train, Ray Serve, Ray Datasets, RLlib); and Ray’s growing ecosystem.
Through a demo using XGBoost for classification, we will demonstrate how you can scale training, hyperparameter tuning, and inference—from a single node to a cluster, with tangible performance difference when using Ray.
The takeaways from this talk are :
Learn Ray architecture, core concepts, and Ray primitives and patterns
Why Distributed computing will be the norm not an exception
How to scale your ML workloads with Ray libraries:
Training on a single node vs. Ray cluster, using XGBoost with/without Ray
Hyperparameter search and tuning, using XGBoost with Ray Tune
Inferencing at scale, using XGBoost with/without Ray
Building an ML Platform with Ray and MLflowDatabricks
This document summarizes a talk on building an ML platform with Ray and MLflow. Ray is an open-source framework for distributed computing and machine learning. It provides libraries like Ray Tune for hyperparameter tuning and Ray Serve for model serving. MLflow is a tool for managing the machine learning lifecycle including tracking experiments, managing models, and deploying models. The talk demonstrates how to build an end-to-end ML platform by integrating Ray and MLflow for distributed training, hyperparameter tuning, model tracking, and low-latency serving.
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...Jan Margeta
In this talk we will explore Ray - a high-performance and low latency distributed execution framework which will allow you to run your Python code on multiple cores, and scale the same code from your laptop to a large cluster.
Ray uses several interesting ideas like actors, fast zero-copy shared-memory object store, or bottom-up scheduling. Moreover, on top of a succinct API, Ray builds tools to your Pandas pipelines faster, tools that find you the best hyper-parameters for your machine learning models, or train state of the art reinforcement learning algorithms, and much more. Come to the talk and learn some more.
Updated the talk with Kubernetes
https://www.pydays.at/
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAlluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Zhe Zhang (Distinguished Engineer @ NVIDIA)
In this talk, Zhe Zhang (NVIDIA, ex-Anyscale) introduced Ray and its applications in the LLM and multi-modal AI era. He shared his perspective on ML infrastructure, noting that it presents more unstructured challenges, and recommended using Ray and Alluxio as solutions for increasingly data-intensive multi-modal AI workloads.
Ray (https://github.com/ray-project/ray) is a framework developed at UC Berkeley and maintained by Anyscale for building distributed AI applications. Over the last year, the broader machine learning ecosystem has been rapidly adopting Ray as the primary framework for distributed execution. In this talk, we will overview how libraries such as Horovod (https://horovod.ai/), XGBoost, and Hugging Face Transformers, have integrated with Ray. We will then showcase how Uber leverages Ray and these ecosystem integrations to simplify critical production workloads at Uber. This is a joint talk between Anyscale and Uber.
Distributed computing and hyper-parameter tuning with RayJan Margeta
Come for a while and learn how to execute your Python computations in parallel, to seamlessly scale the training of your machine learning models from a single machine to a cluster, find the right hyper-parameters, or accelerate your Pandas pipelines with large dataframes.
In this talk, we will cover Ray in action with plenty of examples. Ray is a flexible, high-performance distributed execution framework. Ray is well suited to deep learning workflows but its utility goes far beyond that.
Ray has several interesting and unique components, such as actors, or Plasma - an in-memory object store with zero-copy reads (particularly useful for working on large objects), and includes powerful hyper-parameter tuning tools.
We will compare Ray with its alternatives such as Dask or Celery, and see when it is more convenient or where it might even completely replace them.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses:
- Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark.
- Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs.
- Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access.
- Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsBill Liu
This document discusses Ray and RLlib, a reinforcement learning library built on Ray. It provides three key points:
1. Ray is a framework for building distributed applications and services with shared memory abstraction. It allows ML workloads to scale beyond a single machine.
2. RLlib is a scalable reinforcement learning library that uses Ray. It supports a wide range of algorithms and execution models. This allows for easy implementation and comparison of RL techniques.
3. The Ray community is growing rapidly and provides resources like tutorials and Slack support to help users adopt Ray and RLlib for their distributed Python and reinforcement learning applications.
Ray The alternative to distributed frameworks.pdfAndrew Li
Ray is a distributed framework for building and running large-scale machine learning applications. It has four main components:
1. A global control store that maintains metadata and enables fault tolerance.
2. A bottom-up distributed scheduler that considers data locality while scheduling tasks.
3. An in-memory distributed object store for sharing data between tasks with low latency.
4. Tools for deploying Ray clusters on cloud platforms and Kubernetes.
Ray Serve: A new scalable machine learning model serving library on RaySimon Mo
When a machine learning model needs to served for interactive use cases, the models are either wrapped inside a Flask server or deployed using external services like Sagemaker. Both methods come with flaws. In this talk, you will learn about how ray serve uses ray to address the limitations of current approaches and enable scalable model serving.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Future of Computing is Distributed
Professor Ion Stoica, UC Berkeley RISELab
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Operationalizing Machine Learning at Scale with Sameer NoriDatabricks
Machine learning has quickly become the hot new tool in the big data ecosystem. Virtually every organization is looking to leverage machine learning and build deeper and richer predictive analytics into their applications.
How does this work though, in practice? What are the challenges organizations run into as they look to move hundreds of models into production? How can they
make the age of both data and models closer to real-time?
This session will focus on how leading practitioners have been able to scale their machine learning deployments in production with the MapR Converged Data Platform.
Use cases that will be featured include autonomous cars and analytics as a service for retail and financial services.
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
NetGuardians is executing it's Big Data Analytics Platform on three key Big Data components underneath: ElasticSearch, Apache Mesos and Apache Spark. This is a presentation of the behaviour of this software stack.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Energy companies deal with huge amounts of data and Apache Spark is an ideal platform to develop machine learning applications for forecasting and pricing. In this talk, we will discuss how Apache Spark’s MLlib library can be used to build scalable analytics for clustering, classification and forecasting primarily for energy applications using electricity and weather datasets.Through a demo, we will illustrate a workflow approach to accomplish an end-to-end pipeline from data pre-processing to deployment for the above use-case using PySpark, Python etc.
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
This document summarizes the upcoming features in Spark 2.0, including major performance improvements from Tungsten optimizations, unifying DataFrames and Datasets into a single API, and new capabilities for streaming data with Structured Streaming. Spark 2.0 aims to further simplify programming models while delivering up to 10x speedups for queries through compiler techniques that generate efficient low-level execution plans.
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
Spark is an open source engine for large-scale data processing. It provides APIs in multiple languages including Python. Spark includes libraries for SQL, streaming, machine learning and runs on single machines or clusters. The document discusses Spark's architecture, RDD API, transformations and actions, shuffle operations, persistence, and structured APIs like DataFrames and SQL.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
SAP Extended Warehouse Management (EWM) is a part of SAP S/4HANA offering advanced warehouse and logistics capabilities. It enables efficient handling of goods movement, storage, and inventory in real-time.
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGeTomas Moser
Meet David Bianco, Staff Strategist with Splunk’s elite SURGe team, live in Prague. Get ready for an engaging deep dive into the cutting edge of cybersecurity—straight from the experts driving Splunk’s global security research.
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...apidays
Enhancing Developer Productivity with UX
Petrine Tang, UX Designer at Government Technology Agency
Faith Ang, Product Manager at Government Technology Agency
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays
Why an SDK is Needed to Protect APIs from Mobile Apps
Pearce Erensel, Global VP of Sales at Approov Mobile Security
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays
Building Finance Innovation Ecosystems
Umang Moondra, CEO at APIX
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - The Future of AI Autonomy: How APIs Enable Agentic A...apidays
Aksel Yap, Principal Solution Engineer at Mulesoft from Salesforce
Jonathan Seetoh, Lead Solution Engineer at Mulesoft from Salesforce
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)apidays
CIAM in the wild: What we learned while scaling from 1.5 to 3 million users
Michael Gruen, VP of Engineering at Layr
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Building Green Software by Marissa Jasso & Katya Drey...apidays
Building Green Software: How Cloud-Native Platforms Can Power Sustainable App Development
Katya Dreyer-Oren, Lead Software Engineer at Heroku (Salesforce)
Marissa Jasso, Product Manager at Heroku (Salesforce)
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Ever wondered how to inject your dashboards with the power of Python? This presentation will show how combining Tableau with Python can unlock advanced analytics, predictive modeling, and automation that’ll make your dashboards not just smarter—but practically psychic
Report based on the findings of a quantitative research conducted by the research agency New Image Marketing Group, commissioned by the NGO Detector Media, compiled by PhD in Sociology Marta Naumova.
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...apidays
Building Agentic Workflows with FDC3 Intents
Nick Kolba, Co-founder & CEO at Connectifi
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Open Source and disrupting the travel distribution ec...apidays
Open Source and disrupting the travel distribution ecosystem
Stu Waldron, Advisor & Acting Director at OpenTravel
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
6. @deanwampler
Usage%
2012 2014 2016
Ti
05
Hence, there is a pressing need for
robust, easy to use solutions for
distributed Python
Model sizes and therefore
compute requirements
outstripping Moore’s Law
Moore’s Law (2x every 18 months)
35x every 18 months!
Python growth driven by ML/AI and
other data science workloads
12. @deanwampler
Diverse Compute Requirements Motivated Creation of Ray!
Simulator (game
engine, robot sim,
factory floor sim…)
Neural network “stuff”
And repeated play,
over and over again, to
train for achieving the
best reward
Complex agent?
16. @deanwampler
Microservices (simulators, too)
REST
API Gateway
µ-service 1 µ-service 2
µ-service 3
Production is a pain
API GatewayAPI Gateway
µ-service 1 µ-service 2µ-service 2µ-service 2µ-service 2µ-service 2
µ-service 3µ-service 3µ-service 3
! Each microservice
has a different
number of instances
for scalability &
resiliency
! But they have to be
managed explicitly
20. @deanwampler
Where Spark Excels
▪ Massive-scale data sets
▪ Uniform, records with a schema
▪ Efficient, parallelized transformations
▪ SQL
▪ Batch analytics
▪ Stream processing
▪ Intuitive, high-level abstractions for data science & engineering tasks
21. @deanwampler
Where Ray Excels
▪ Highly non-uniform data graphs
▪ Think typical “in-memory object models”, but distributed
▪ Handles distributed state intuitively
▪ Highly non-uniform task graphs
▪ Small to large scale tasks
▪ Intuitive API for the “90%” of cases
▪ Supports compute problems ranging from
▪ general services, games, and simulators
▪ to
▪ stochastic gradient descent, HPO, …
23. @deanwampler
If you’re already using these…
▪ asyncio
▪ joblib
▪ multiprocessing.Pool
▪ Use Ray’s implementations
▪ Drop-in replacements
▪ Change import statements
▪ Break the one-node limitation!
For example, from this:
from multiprocessing.pool import Pool
To this:
from ray.util.multiprocessing.pool import Pool
24. @deanwampler
Ray Community and Resources
▪ ray.io - entry point for all things Ray
▪ Tutorials: anyscale.com/academy
▪ github.com/ray-project/ray
▪ anyscale.com/events