Skip to content

health_checker: clear last_hc_http_status_ on network failures#45687

Open
MyUmmaGumma wants to merge 2 commits into
envoyproxy:mainfrom
MyUmmaGumma:kee-hds-clear-stale-status
Open

health_checker: clear last_hc_http_status_ on network failures#45687
MyUmmaGumma wants to merge 2 commits into
envoyproxy:mainfrom
MyUmmaGumma:kee-hds-clear-stale-status

Conversation

@MyUmmaGumma

Copy link
Copy Markdown

Commit Message: health_checker: clear last_hc_http_status_ on network failures so the HDS EndpointHealthResponse does not ship a stale code from the last successful response after the upstream becomes unreachable.

Additional Description:

The HTTP health checker records the response status in host_->setLastHealthCheckHttpStatus() in HttpActiveHealthCheckSession::onResponseComplete(). The field is consulted by HdsDelegate::sendResponse() and attached to the HDS report's health_metadata as http_status_code. It is not cleared on subsequent probe failures, so once an upstream has had a single successful HTTP response, the cached status persists forever — even across network-level probe failures (connection refused, timeouts, goaway, reset stream).
An HDS server that interprets http_status_code for health states then sees:

t1: probe succeeds          → metadata.status_code = 200
t2: probe fails at network  → metadata.status_code = 200 (stale)

This is asymmetric with the HealthStatus enum, which reflects the latest probe outcome. The fix is to clear last_hc_http_status_ in handleFailure() when the failure type is NETWORK or NETWORK_TIMEOUT, restoring the contract that http_status_code in health_metadata represents the response code from the most recent HTTP exchange (if any), absent otherwise.

Does not affect ACTIVE failures (where the upstream returned a non-2XX response code): in that case setLastHealthCheckHttpStatus() has already been called in HttpActiveHealthCheckSession::onResponseComplete() with the actual response code, which is current and correct.

Risk Level: Low

Narrowly-scoped change in the base health-checker session's failure handler. No API changes. New unit test exercises the cleared-on-timeout case.

Testing: Unit

Added HttpHealthCheckerImplTest.LastHealthCheckHttpStatusClearedOnNetworkFailure that verifies the recorded HTTP status is cleared after a timeout-driven probe failure following a successful 200 response.

Docs Changes: No

Release Notes: Added entry under bug_fixes: in changelogs/current.yaml:

- area: upstream
  change: |
    health_checker: clear cached HTTP response status code on network-level
    health check failures so the HDS report does not ship a stale code after
    the upstream becomes unreachable.

Platform Specific Features: N/A

The HTTP health checker records the response status in
host_->setLastHealthCheckHttpStatus() in HttpActiveHealthCheckSession::onResponseComplete().
The field is consulted by HdsDelegate::sendResponse() and attached to the HDS
report's health_metadata as http_status_code. It is never cleared on subsequent
probe failures, so once an upstream has had a single successful HTTP response,
the cached status persists forever — even across network-level probe failures
(connection refused, timeouts, goaway, reset stream).

An HDS server that interprets http_status_code for richer health states then
sees:
  t1: probe succeeds; metadata.status_code = 200
  t2: probe fails at network layer; metadata.status_code = 200 (stale)

This is asymmetric with the HealthStatus enum, which DOES reflect the latest
probe outcome. Clear last_hc_http_status_ in handleFailure() when the failure
type is NETWORK or NETWORK_TIMEOUT, restoring the contract that http_status_code
in health_metadata represents the response code from the most recent HTTP
exchange (if any), absent otherwise.

Does not affect ACTIVE failures (where the upstream returned a non-2XX response
code): in that case setLastHealthCheckHttpStatus() has already been called in
HttpActiveHealthCheckSession::onResponseComplete() with the actual response
code, which is current and correct.

Signed-off-by: Keerti Narayan <keerti2882@gmail.com>
@repokitteh-read-only

Copy link
Copy Markdown

Hi @MyUmmaGumma, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #45687 was opened by MyUmmaGumma.

see: more, trace.

@mathetake

Copy link
Copy Markdown
Member

This is a follow up on #43804

@mathetake mathetake self-assigned this Jun 17, 2026
mathetake
mathetake previously approved these changes Jun 17, 2026
Signed-off-by: Keerti Narayan <keerti2882@gmail.com>
@MyUmmaGumma MyUmmaGumma force-pushed the kee-hds-clear-stale-status branch from dcf4fbf to 72920cf Compare June 18, 2026 00:12
@MyUmmaGumma MyUmmaGumma temporarily deployed to external-contributors June 18, 2026 00:12 — with GitHub Actions Inactive
@mathetake mathetake enabled auto-merge (squash) June 18, 2026 00:13
@mathetake

Copy link
Copy Markdown
Member

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants