Data Center / Cloud

NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations

Three icons, with text LLMs, Optimize, Deploy.

At NVIDIA GTC 2025, we announced NVIDIA Dynamo, a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. 

The latest v0.2 release of Dynamo includes:

  • A planner for prefill and decode GPU autoscaling.
  • Kubernetes automation for large-scale Dynamo deployments.
  • Support for AWS Elastic Fabric Adaptor (EFA) for internode data transfers on AWS.

In this post, we’ll walk through these features and how they can help you get more out of your GPU investments.

GPU autoscaling for disaggregated serving workloads

One of the key drivers behind the rapid adoption of cloud computing in the early 2000s was autoscaling—the ability to automatically adjust compute capacity based on real-time demand. By eliminating the need to provision infrastructure for peak loads in advance, autoscaling offers both cost efficiency and operational flexibility. While the concept is well established, applying it effectively to LLM inference workloads remains a significant challenge.

Traditional autoscaling relies on straightforward metrics like queries per second (QPS). However, in modern LLM-serving environments where not all inference requests are equal—particularly those using techniques like disaggregated serving—QPS alone is a poor predictor of system load. A single request with a long input sequence length (ISL) can consume significantly more resources than multiple requests with short ISLs.

Additionally, long ISLs place pressure on prefill GPUs, while long output sequence lengths (OSLs) stress decode GPUs. In a disaggregated setup, the workload distribution can shift dramatically from moment to moment, making autoscaling strategies based on naive metrics such as QPS ineffective. This complexity makes it difficult to balance load across prefill and decode GPUs. 

What’s needed is a purpose-built planning engine—one that understands LLM-specific inference patterns and disaggregated serving architectures. It must decide not only when to scale, but what kind of GPUs to scale, and in which direction. It should also support local experimentation, enabling developers to factor in demand fluctuations during prototyping and large-scale production deployments, to run alongside Kubernetes autoscaling environments.

With the v0.2 release of Dynamo, we’re introducing the first version of the NVIDIA Dynamo Planner—an inference-aware autoscaler for disaggregated serving. Developers using the Dynamo Serve CLI (Command Line Interface) can now run the GPU Planner alongside their deployments, enabling it to monitor workload patterns and dynamically manage compute resources across prefill and decode phases.

A workflow diagram showing how Dynamo Planner operates.
Figure 1. Dynamo planner uses prefill and decode specific metrics to scale up and down GPUs in disaggregated setups ensuring optimal GPU utilization and reducing inference costs

How Dynamo Planner works

  • Monitors the average KV block utilization across all decode GPUs.
  • Tracks the volume of pending prefill requests in the global prefill queue.
  • Compares these values against heuristic thresholds, which developers can manually define during deployment setup.
  • Automatically rebalances resources—either shifting GPUs between prefill and decode phases or provisioning new GPUs from a shared pool.

The next release of Dynamo will be able to run the GPU Planner alongside the NVIDIA Dynamo Deploy CLI, enabling autoscaling in large-scale Kubernetes deployments. 

Scaling LLMs from local dev to production with a single command

A key challenge facing AI inference teams is the operational complexity of transitioning LLMs from local development environments into scalable, production-ready Kubernetes deployments. While local environments are ideal for experimentation and prototyping, they rarely reflect the realities of data center-scale production systems.

Historically, the transition of applications from local to production using Kubernetes has involved a series of manual steps.: containerizing the application and its supporting components, crafting Kubernetes YAML configuration files by hand, configuring Kubernetes resources, and managing container image publishing to Kubernetes-compatible registries.that are This process is often time-consuming, error-prone, and heavily reliant on developer expertise—slowing down development cycles and delaying time to market for new LLM-enabled applications.

As LLMs evolve in size and complexity, the requirements for deploying them in production have grown significantly. Modern inference servers now demand additional components such as:

  • LLM-aware request routing for optimal request distribution across inference nodes.
  • KV cache offloading to efficiently handle large-scale inference memory demands.
  • Dynamic GPU allocation to ensure efficient and cost-effective use of compute resources.

These components are tightly integrated, which makes manually containerizing them in Kubernetes an additional challenge during production deployment.

To address these challenges, the v0.2 release features an NVIDIA Dynamo Kubernetes Operator to automate Kubernetes deployment. The operator includes image building and graph management capabilities designed to fully automate the complexities of production deployment. This enables AI inference teams to move from a Dynamo prototype on a desktop GPU or local GPU node to a scalable, data center-scale deployment across thousands of GPUs with a single CLI command.

This automation eliminates the need for manual Docker file creation and YAML configuration, and dynamically provisions and scales inference workloads across GPU-optimized Kubernetes clusters. It also seamlessly integrates with leading Kubernetes-native MLOps cloud tools, making it easier to deploy the Kubernetes cluster to all major cloud service providers.

By abstracting away the manual development and configuration processes, Dynamo empowers AI teams to focus on model innovation and delivery, rather than infrastructure setup. It dramatically reduces time to production and provides the reliability, flexibility, and scalability required for modern data center-scale LLM deployment in distributed environments.

Optimizing KV cache transfers on NVIDIA-powered Amazon EC2 instances 

A major contributor to the inference costs of LLMs is the KV cache. Although typically hidden from end users,  KV cache is managed behind the scenes and can lead to significant cost savings. Techniques such as disaggregated serving and KV cache offloading have emerged to optimize the generation and storage of KV cache. 

These methods, however, depend on low-latency transfer of data across GPU nodes—often interconnected through networking protocols like InfiniBand or Ethernet, and may involve moving data to external storage such as file systems or object stores. Integrating the wide array of networking and storage libraries into inference-serving frameworks is time-consuming and often leads to inefficient KV cache movement, increasing latency, and driving up inference costs.

To address these challenges, Dynamo includes the open-source NVIDIA Inference Transfer Library (NIXL). NIXL is a high-performance, low-latency point-to-point communication library purpose-built for moving data across heterogeneous environments. It provides a consistent API for fast, asynchronous transfers across memory and storage tiers, abstracting away the complexity of dealing with different hardware and protocols. It supports integration with GDS (GPUDirect Storage), UCX (Unified Communication X), Amazon S3 (Simple Storage Service), and more, and works across interconnects like NVIDIA NVLink-C2C, NVIDIA NVLink Switch, InfiniBand, RoCE, and Ethernet. 

With NIXL, developers no longer need to manually handle protocol selection or integration logic. Instead, they can use simple get, push, and delete commands through NIXL’s front-end API, while the library intelligently selects the optimal backend and handles all underlying bookkeeping, drastically simplifying development and improving performance.

The latest release of Dynamo expands NIXLs’ support to include AWS Elastic Fabric Adapter (EFA). EFA is an inter-node communications network interface for EC2 instances. With the V0.2 Dynamo release, AI service providers deploying LLMs on the AWS cloud can now take advantage of Dynamo’s distributed and disaggregated serving capabilities in multinode setups running on NVIDIA-powered EC2 instances, such the P5 family powered by NVIDIA Hopper and the P6 family powered by NVIDIA Blackwell.

Join us at the next user meetup

We are building NVIDIA Dynamo in the open and have published our roadmap on GitHub. We’re excited to keep the conversation going and hear how you’re building with NVIDIA Dynamo. 

Join us for our first in-person user meetup on June 5  in San Francisco, where we’ll dive deeper into the v0.2 release and the Dynamo roadmap. We’d love to see you there.

Discuss (0)

Tags