health_checker: clear last_hc_http_status_ on network failures#45687
Open
MyUmmaGumma wants to merge 2 commits into
Open
health_checker: clear last_hc_http_status_ on network failures#45687MyUmmaGumma wants to merge 2 commits into
MyUmmaGumma wants to merge 2 commits into
Conversation
The HTTP health checker records the response status in host_->setLastHealthCheckHttpStatus() in HttpActiveHealthCheckSession::onResponseComplete(). The field is consulted by HdsDelegate::sendResponse() and attached to the HDS report's health_metadata as http_status_code. It is never cleared on subsequent probe failures, so once an upstream has had a single successful HTTP response, the cached status persists forever — even across network-level probe failures (connection refused, timeouts, goaway, reset stream). An HDS server that interprets http_status_code for richer health states then sees: t1: probe succeeds; metadata.status_code = 200 t2: probe fails at network layer; metadata.status_code = 200 (stale) This is asymmetric with the HealthStatus enum, which DOES reflect the latest probe outcome. Clear last_hc_http_status_ in handleFailure() when the failure type is NETWORK or NETWORK_TIMEOUT, restoring the contract that http_status_code in health_metadata represents the response code from the most recent HTTP exchange (if any), absent otherwise. Does not affect ACTIVE failures (where the upstream returned a non-2XX response code): in that case setLastHealthCheckHttpStatus() has already been called in HttpActiveHealthCheckSession::onResponseComplete() with the actual response code, which is current and correct. Signed-off-by: Keerti Narayan <keerti2882@gmail.com>
|
Hi @MyUmmaGumma, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
Member
|
This is a follow up on #43804 |
mathetake
previously approved these changes
Jun 17, 2026
acc32f8 to
dcf4fbf
Compare
dcf4fbf to
72920cf
Compare
mathetake
approved these changes
Jun 18, 2026
Member
|
/retest |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Commit Message: health_checker: clear last_hc_http_status_ on network failures so the HDS EndpointHealthResponse does not ship a stale code from the last successful response after the upstream becomes unreachable.
Additional Description:
The HTTP health checker records the response status in
host_->setLastHealthCheckHttpStatus()inHttpActiveHealthCheckSession::onResponseComplete(). The field is consulted byHdsDelegate::sendResponse()and attached to the HDS report'shealth_metadataashttp_status_code. It is not cleared on subsequent probe failures, so once an upstream has had a single successful HTTP response, the cached status persists forever — even across network-level probe failures (connection refused, timeouts, goaway, reset stream).An HDS server that interprets
http_status_codefor health states then sees:This is asymmetric with the
HealthStatusenum, which reflects the latest probe outcome. The fix is to clearlast_hc_http_status_inhandleFailure()when the failure type isNETWORKorNETWORK_TIMEOUT, restoring the contract thathttp_status_codeinhealth_metadatarepresents the response code from the most recent HTTP exchange (if any), absent otherwise.Does not affect
ACTIVEfailures (where the upstream returned a non-2XX response code): in that casesetLastHealthCheckHttpStatus()has already been called inHttpActiveHealthCheckSession::onResponseComplete()with the actual response code, which is current and correct.Risk Level: Low
Narrowly-scoped change in the base health-checker session's failure handler. No API changes. New unit test exercises the cleared-on-timeout case.
Testing: Unit
Added
HttpHealthCheckerImplTest.LastHealthCheckHttpStatusClearedOnNetworkFailurethat verifies the recorded HTTP status is cleared after a timeout-driven probe failure following a successful 200 response.Docs Changes: No
Release Notes: Added entry under
bug_fixes:inchangelogs/current.yaml:Platform Specific Features: N/A