Skip to content

Conversation

@gujingit
Copy link
Collaborator

Ⅰ. Motivation

Ⅱ. Modifications

Add Role coordination KEP

Ⅲ. Does this pull request fix one issue?

NONE

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@gujingit gujingit force-pushed the feature/role-coordination branch from 4688446 to c845bb4 Compare October 17, 2025 08:52
title: KEP Template
kep-number: NNNN
authors:
- "@jane.doe"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the information of this file.

Each coordination step will:

1. Apply the specified strategies to the relevant roles
2. Monitor the status of those roles
Copy link
Collaborator

@cheyang cheyang Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding the detail design to instruct the router component to split requests between the old and new Pods in real time during a RoleBasedGroup coordinated upgrade, so that it can integrate with SGLang Router's rolling update workflow

@gujingit gujingit force-pushed the feature/role-coordination branch from c845bb4 to 470ee3c Compare October 21, 2025 13:15
Copy link
Contributor

@ZYecho11 ZYecho11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this association upgrade strategy can be considered as an enhancement and supplement to the original upgrade strategy, rather than a replacement relationship. For example, if the user's original Role maxUnavailable is 5, but wants to configure a 4:1 association upgrade relationship, does it need to modify the original maxUnavailable to 4?

@gujingit
Copy link
Collaborator Author

API design has updated. Remove maxUnavailable & maxSurge, and use batchSize to represent the ratio between roles.

@cheyang cheyang self-requested a review October 24, 2025 02:51
@RongGu RongGu requested a review from Copilot October 26, 2025 15:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces KEP-30 (Role Coordination for RoleBasedGroup), which proposes a coordination mechanism for managing updates across multiple roles in a RoleBasedGroup. The enhancement enables phased rollouts where roles are updated in a coordinated sequence while maintaining specific ratios (e.g., 2:1 prefill-to-decode for LLM inference workloads).

Key Changes:

  • New coordination API to define cross-role update strategies with configurable batch sizes and partitions
  • Controller logic to execute coordinated updates based on defined strategies
  • Support for maintaining role ratios during rolling updates (e.g., 4P2D pattern for PD-disaggregated LLM inference)

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
keps/NNNN-kep-template/kep.yaml Template KEP metadata file added
keps/NNNN-kep-template/README.md Template KEP documentation added
keps/30-role-coordination/kep.yaml KEP-30 metadata with incorrect kep-number (21 instead of 30) and title
keps/30-role-coordination/README.md Complete KEP-30 specification including API design, user stories, and implementation details

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,15 @@
title: KEP Template
kep-number: 21
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kep-number is set to 21 but the directory name and README title indicate this is KEP-30. This inconsistency should be corrected to 30.

Suggested change
kep-number: 21
kep-number: 30

Copilot uses AI. Check for mistakes.
template.

Ensure the TOC is wrapped with
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected HTML entity closing tags from '&rt;' to '>' in the TOC generation instruction.

Suggested change
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
<code>&lt;!-- toc --&gt;&lt;!-- /toc --&gt;</code>

Copilot uses AI. Check for mistakes.
type RoleBasedGroupSpec struct {
// Existing fields...

// Coordination defines how roles should be coordinated
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is missing a period at the end. Go documentation comments should be complete sentences ending with punctuation.

Suggested change
// Coordination defines how roles should be coordinated
// Coordination defines how roles should be coordinated.

Copilot uses AI. Check for mistakes.
- role: decode
updateStrategy:
partition: 5
batchSize: 4
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batchSize for the decode role is set to 4, but according to the example description and the strategy in lines 211-220, decode should have batchSize: 1 to maintain the 4:1 ratio with prefill (batchSize: 4).

Suggested change
batchSize: 4
batchSize: 1

Copilot uses AI. Check for mistakes.
40 prefill pods and 10 decode pods were upgraded in a rolling update with a 4:1 ratio of prefill to decode.
After rolling updates of 20 prefill pods and 5 decode pods, the process was paused.
```yaml
apiVersion: workloads.x-k8s.io/v1alpha1
Copy link
Collaborator

@cheyang cheyang Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend retaining support for multiple strategy types such as updateStrategy, scaleStrategy, and gangStrategy, while using a unified config structure internally implemented as a map. The API would be:

apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
spec:
  coordinations:
    - name: "prefill-decode"
      strategies:
        - role: prefill
          strategyType: updateStrategy   # Explicit strategy type
          config:                        # Universal map-based config
            partition: 20
            batchSize: 2
        - role: decode
          strategyType: updateStrategy
          config:
            partition: 5
            batchSize: 1
            
    - name: "router-planner"
      strategies:
        - role: router
          strategyType: gangStrategy    # Different strategy type
          config:                       # Same config structure
            minAvailable: 2
            timeout: 10m
        - role: planner
          strategyType: scaleStrategy   # Third strategy type
          config:
            minReplicas: 3
            maxReplicas: 10
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: role-coordination
spec:
  coordination:
    - name: "prefill-decode-update" 
      type: updateStrategy # Explicit strategy type
      strategy:
      - role: prefill  
        config:   # Universal map-based config
          type: Recreate # [Recreate, InplaceIfPossible], default is Recreate
          batchSize: 2
      - role: decode
        config:
          type: Recreate # [Recreate, InplaceIfPossible], default is Recreate
          batchSize: 1
    # other coordination strategy type demo
    - name: "prefill-decode-scaling" # Support later
      type: scalingStrategy    # Different strategy type
      strategies:
        - role: prefill
          config:                       # Same config structure
            batchSize: 2
        - role: decode
          config:
            batchSize: 1
    - name: "prefill-decode-gang" # Other type will support on demand
      type: gangStrategy    # Different strategy type
      strategies:
        - role: prefill
          config:                       # Same config structure
            minAvailable: 2
            timeout: 10m
        - role: decode
          config:
            minAvailable: 2
            timeout: 10m

Key Implementation Features:

  1. Unified Config Structure

    config:  # Common structure for all strategies
      key1: value1
      key2: value2
      ...
  2. Strategy Type Flexibility

    type Strategy struct {
        Role        string
        StrategyType string    // updateStrategy|scaleStrategy|gangStrategy
        Config      map[string]string
    }
  3. Validation Handling

    func Validate(config map[string]string, strategyType string) error {
        switch strategyType {
        case "updateStrategy":
            return validateUpdateConfig(config)
        case "gangStrategy":
            return validateGangConfig(config)
        case "scaleStrategy":
            return validateScaleConfig(config)
        default:
            return fmt.Errorf("unknown strategy type: %s", strategyType)
        }
    }

This maintains API consistency while enabling diverse strategy implementations through:

  • Single config field with flexible key-value pairs
  • Explicit strategyType designation for behavior dispatch
  • Backward compatibility with existing updateStrategy configurations
  • Clean extension path for future strategy types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to set gangPolicy under the coordinates field? What is User Story?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to deploy a minimal structure consisting of at least one router, two prefiller, and two decoder instances using RBG.

apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
spec:
  coordinations:
    - name: "minimal-deployment"
      strategies:
        - role: router
          strategyType: gangStrategy
          config:
            minAvailable: 1   # Minimum 1 router
            
        - role: prefill
          strategyType: gangStrategy
          config:
            minAvailable: 2   # Minimum 2 prefill
            
        - role: decode
          strategyType: gangStrategy
          config:
            minAvailable: 2   # Minimum 2 decode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this information already defined in the podgroup and podgroupPolicy? Why do we need to repeat the definition?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that coordination needs to support multiple atomic gangs. For example:

  • Role prefill requires 22 replicas
  • Role decode requires 10 replicas
  • But each atomic gang must contain exactly 2 prefill + 1 decode pods each time to place in the same node.

This necessitates coordinated sequential gangs – a single PodGroup cannot express this multi-batch requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, get it

name: role-coordination
spec:
coordination:
- name: "prefill-decode-update" # strategy 1: reconcile prefill & decode at 4:1 ratio
Copy link
Contributor

@veophi veophi Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

coordination:
- name: prefill-decode-update
  type: RollingUpdate
  roles:
  - prefill
  - decode
  strategy:
    maxSkew: 1% # the max skew of updated replicas between prefill and decode. For example, one rbg with (200p, 100d) allows `abs(updated_prefill/200 - updated_decode/100) < 1%`.
    partition: 99%

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design will binds the rolling update ratio to the replicas of prefill and decode. If the prefill or decode has HPA , their replica count is not accurate, the update ratio will not meet expectations.

Copy link
Collaborator Author

@gujingit gujingit Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this design can not support this case. There are 40 Prefill and 10 Decode pods and users want to update prefill:decode = 2:1.

// existing independent role update behavior
}

func (r *RoleBasedGroupReconciler) executeCoordinationStrategy() {
Copy link
Contributor

@veophi veophi Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior when a single Pod becomes unhealthy during rollout?

If one Pod enters an Unhealthy state (e.g., readiness/failure probe fails due to some node problems) during a rolling upgrade, will the rbg controller block the entire upgrade — even when maxUnavailable is explicitly set to >1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rolling update process of the rbg controller follows the K8s sts design. If a old version pod enters an unhealthy state, it will not block the upgrade. If a new version pod (only single pod) enters an unhealthy state, it will block the upgrade process.

Copy link
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@cheyang cheyang merged commit 22e7ded into sgl-project:main Nov 5, 2025
3 checks passed
@veophi veophi mentioned this pull request Nov 5, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants