Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
add role coordination kep
  • Loading branch information
gujingit committed Oct 17, 2025
commit d78c75d68a516908bc653eadbc3df15d27c822e8
257 changes: 257 additions & 0 deletions keps/30-role-coordination/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
# KEP-30: Role Coordination for RoleBasedGroup

## Table of Contents

<!-- toc -->

- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1: Coordinated Rolling Update](#story-1-coordinated-rolling-update)
- [Implementation Details](#implementation-details)
- [API Changes](#api-changes)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)

<!-- /toc -->

## Release Signoff Checklist

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial
KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test
refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements
for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit
by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
within one minor version of promotion to GA
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings,
relevant PRs/issues, release notes

## Summary

This KEP proposes adding role coordination capabilities to the RoleBasedGroup (RBG) controller.
Currently, RBG manages multiple roles independently, but many real-world applications require coordinated updates
across multiple roles. This enhancement introduces a coordination mechanism that allows defining complex update
strategies spanning multiple roles, such as updating a frontend role partially, then updating a backend role completely,
and finally completing the frontend update.

## Motivation

In complex distributed applications, individual components often need to be updated in a specific sequence to
maintain application availability and consistency. The current RoleBasedGroup implementation updates each role
independently, which can lead to service disruptions during updates. For example,
it's often necessary to update the prefill and decode at a fixed ratio (4P2D) in PD-disagg LLM inferences.

| Stage | Upgrade Process | Comments |
|-------|-------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| 1 | Old Prefill: 4; Old Decode: 2 | Begin to update rbg. |
| 2 | Old Prefill: 2, New Prefill: 2; Old Decode: 2 | Update 2 prefill pods first . |
| 3 | Old Prefill: 2, New Prefill: 2; Old Decode: 1; New Decode 1 | Stop updating the prefill and update only one decode. The ratio of P to D must be maintained at 2:1. |
| 4 | New Prefill: 4; Old Decode: 1; New Decode 1 | Continue to update the prefill pods. |
| 5 | New Prefill: 4; New Decode 2 | Update completely. |

### Goals

1. Enable coordinated updates across multiple roles in a RoleBasedGroup
2. Support phased rollouts where some roles are updated partially while others wait
3. Provide a flexible coordination strategy definition mechanism
4. Maintain backward compatibility with existing RoleBasedGroup resources
5. Allow rollback capabilities for coordinated updates

### Non-Goals

1. Implement coordination across multiple RoleBasedGroupSet
2. Handle coordination of non-workload resources (e.g., ConfigMaps, Secrets)

## Proposal

This KEP introduces a new `coordination` field to the RoleBasedGroup specification that defines how roles should
be connected in relation to each other. The coordination strategy consists of a series of steps, each specifying which
roles to update and how.

### User Stories

#### Story 1: Coordinated Rolling Update

In the PD-disaggregated scenario for LLM inference, the input/output pattern is relatively fixed,
with an optimal P:D ratio of 2:1.
Each time 2 Prefill Pods are updated, 1 Decode Pod needs to be updated accordingly to maintain this ratio.

##### Coordinate Rolling Update Process

The coordinated rolling update process ensures that the P:D ratio is maintained throughout the update cycle.
Here's how it works:

1. **Initial State**: The system starts with all old Prefill and Decode pods running
2. **Step-by-Step Update**:
- Update Prefill pods in batches of 2
- For each batch of 2 Prefill pods updated, update 1 Decode pod
- Monitor readiness of updated pods before proceeding
3. **Completion**: Continue until all Prefill and Decode pods are updated while maintaining the 2:1 ratio

This approach ensures service continuity and optimal resource utilization during the update process,
preventing performance degradation due to imbalanced P:D ratios.

### Implementation Details

#### API Changes

Add a new `Coordination` field to the RoleBasedGroup spec:

```go
type RoleBasedGroupSpec struct {
// Existing fields...

// Coordination defines how roles should be coordinated
// +optional
Coordination *Coordination `json:"coordination,omitempty"`
}

type Coordination struct {
// Steps defines the sequence of coordination steps
Steps []CoordinationStep `json:"steps"`
}

type CoordinationStep struct {
// Roles involved in this step
Roles []string `json:"roles"`

// Strategy for each role in this step
RoleStrategies map[string]RoleStrategy `json:"roleStrategies,omitempty"`
}

type RoleStrategy struct {
UpdateStrategy RoleUpdateStrategy `json:"updateStrategy,omitempty"`
// ScalingStrategy
// DeletingStrategy
// ...
}

type RoleUpdateStrategy struct {
Partition *int32 `json:"partition,omitempty"`
MaxUnavailable intstr.IntOrString `json:"maxUnavailable,omitempty"`
MaxSurge `json:"maxSurge,omitempty"`
}

```

Add a new `CoordinationState` field to the RoleBasedGroup status:

```go
// Add coordination status in RoleBasedGroupStatus
type RoleBasedGroupStatus struct {
CoordinationState CoordinationState `json:"coordinationState,omitempty"`
}

type CoordinationState struct {
// Current phase being coordinated
CurrentPhase string `json:"currentPhase,omitempty"`
// Coordination progress information
Progress map[string]string `json:"progress,omitempty"`
LastUpdateTime metav1.Time `json:"lastUpdateTime,omitempty"`
}

```

### Risks and Mitigations

1. **Complexity Risk**: Adding coordination logic increases controller complexity
- Mitigation: Implement thorough unit and integration tests

2. **Deadlock Risk**: Poorly configured coordination strategies could cause updates to stall
- Mitigation: Add timeouts and clear status reporting

3. **Backward Compatibility**: Existing RoleBasedGroups should continue to work unchanged
- Mitigation: Only apply coordination logic when `coordination` is specified

## Design Details

The implementation will modify the main reconciliation loop
in [RoleBasedGroupReconciler] to check for a coordination strategy. If present, it will execute the coordinated
update logic; otherwise, it will fall back to the existing independent role update behavior.

Each coordination step will:

1. Apply the specified strategies to the relevant roles
2. Monitor the status of those roles
Copy link
Collaborator

@cheyang cheyang Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding the detail design to instruct the router component to split requests between the old and new Pods in real time during a RoleBasedGroup coordinated upgrade, so that it can integrate with SGLang Router's rolling update workflow

3. Proceed to the next step only when the current step is complete

The controller will use the existing workload reconcilers (StatefulSetReconciler, DeploymentReconciler, etc.) but with
modified parameters based on the coordination strategy.

### Test Plan

#### Unit Tests

- Test coordination strategy parsing and validation
- Test step execution logic
- Test status tracking and updates
- Test edge cases (empty steps, invalid configurations)

#### Integration Tests

- Test full coordination flow with multiple roles
- Test partial updates within steps
- Test rollback scenarios
- Test interaction with existing independent role updates

#### E2E Tests

- Deploy a multi-role application with coordination strategy
- Execute coordinated update and verify correct sequence
- Verify application availability during update

### Graduation Criteria

#### Alpha

- Basic coordination strategy implementation
- Support for simple sequential role updates
- Unit and integration tests
- Documentation and examples

#### Beta

- Support for complex coordination patterns
- Comprehensive e2e tests
- Metrics and monitoring
- User feedback and iterations

#### GA

- Proven stability in production environments
- Complete documentation and best practices
- No critical bugs reported for 2 consecutive releases

## Implementation History

- 2025-10-17: KEP created
- TBD: Alpha implementation
- TBD: Beta implementation
- TBD: GA implementation

## Drawbacks

1. Increased complexity in the RoleBasedGroup controller
2. Additional status tracking and state management
3. Potential for misconfigured coordination strategies to block updates

51 changes: 51 additions & 0 deletions keps/30-role-coordination/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
title: KEP Template
kep-number: NNNN
authors:
- "@jane.doe"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the information of this file.

owning-sig: sig-xyz
participating-sigs:
- sig-aaa
- sig-bbb
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced
creation-date: yyyy-mm-dd
reviewers:
- TBD
- "@alice.doe"
approvers:
- TBD
- "@oscar.doe"

see-also:
- "/keps/sig-aaa/1234-we-heard-you-like-keps"
- "/keps/sig-bbb/2345-everyone-gets-a-kep"
replaces:
- "/keps/sig-ccc/3456-replaced-kep"

# The target maturity stage in the current dev cycle for this KEP.
# If the purpose of this KEP is to deprecate a user-visible feature
# and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
stage: alpha|beta|stable

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.19"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.19"
beta: "v1.20"
stable: "v1.22"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: MyFeature
components:
- kube-apiserver
- kube-controller-manager
disable-supported: true

# The following PRR answers are required at beta release
metrics:
- my_feature_metric
Loading
Loading