[WIP] dns: add support for clusters based on SRV DNS record#35160
[WIP] dns: add support for clusters based on SRV DNS record#35160mikhainin wants to merge 42 commits into
Conversation
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
|
Testing locally with Consul: Add service nodes: Verify that the service |
|
Hi there, apologies for openning in this state - I acknowledge that there still lots. |
|
Thanks for you contribution. But based on our contributing rules (see https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md), any extension, should have a maintainer sponsor to help review and maintain the code. And I think this should be a DNS extension rather than a new cluster extension? |
@wbpcode, Thank you for your comment. I posted a message in slack some while ago but didn't hear anything back. What would be the right way for this?
Based on previous discussion, it was decided to go with the cluster extension. Basically, I followed the previous proposal. I will be happy to discuss the new design if someone helps me to find the right place. The initial issue doesn't seem to have activity :/ |
|
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
This isn't yet forgotten, going to ping maintainers in the maillist |
|
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new cluster type, envoy.clusters.dns_srv, enabling service discovery through DNS SRV records. It extends the DnsResolver interface with a resolveSrv method and provides a concrete implementation for the c-ares resolver. The DnsSrvCluster manages a two-stage resolution process, first retrieving SRV records and then resolving the resulting hostnames into IP addresses. Feedback identifies a critical bug where the cluster fails to remove stale hosts because the current host list is not correctly maintained across updates. Additionally, the reviewer suggests incorporating the SRV record's TTL into the cluster refresh logic and removing the hardcoded restriction that limits SRV support exclusively to the c-ares resolver extension.
| for (const auto& dns : response) { | ||
|
|
||
| ENVOY_LOG(debug, "SRV: host: {}, port: {}, weight: {}, prio: {}", dns.srv().target_, | ||
| dns.srv().port_, dns.srv().weight_, dns.srv().priority_); | ||
|
|
||
| if (auto address = Envoy::Network::Utility::parseInternetAddressNoThrow( | ||
| dns.srv().target_, 0, false); | ||
| address != nullptr) { | ||
| // SRV record target is an IP address, not a hostname. | ||
| ResolveTargetPtr target = std::make_unique<ResolveTarget>( | ||
| *active_resolve_list_, dns_resolver_, dns_lookup_family_, dns.srv().target_, | ||
| dns.srv().priority_, dns.srv().weight_, dns.srv().port_); | ||
|
|
||
| active_resolve_list_->addResolvedTarget(std::move(target), address); | ||
| } else { | ||
| active_resolve_list_->addTarget(std::make_unique<ResolveTarget>( | ||
| *active_resolve_list_, dns_resolver_, dns_lookup_family_, dns.srv().target_, | ||
| dns.srv().priority_, dns.srv().weight_, dns.srv().port_)); | ||
| } | ||
| } |
There was a problem hiding this comment.
If it is simpler, I think for now you could take the minimum of the TTLs from A/AAAA vs SRV response, and use that
| if (cluster.typed_dns_resolver_config().name() != "envoy.network.dns_resolver.cares") { | ||
| return absl::InvalidArgumentError( | ||
| fmt::format("Only c-ares supports resolve of SRV records, " | ||
| "please use typed_dns_resolver_config.name = " | ||
| "'envoy.network.dns_resolver.cares'. Current value: '{}'", | ||
| cluster.typed_dns_resolver_config().name())); | ||
| } |
There was a problem hiding this comment.
There was a problem hiding this comment.
If an out-of-tree resolver supported SRV, this code would disallow it, so it would be good to fix this
There was a problem hiding this comment.
What is "out-of-tree resolver"? I thought, we could add other resolvers in the following change sets...
|
|
||
| ASSERT_TRUE(response->complete()); | ||
| EXPECT_EQ("200", response->headers().getStatusValue()); | ||
| } |
There was a problem hiding this comment.
Please add an integration test that adds and removes hosts via srv response, as well as via changed A/AAAA response.
| EXPECT_CALL(dns_resolver_factory_, createDnsResolver(_, _, _)) | ||
| .WillRepeatedly(testing::Return(dns_resolver)); | ||
|
|
||
| EXPECT_CALL(*dns_resolver, resolve(_, _, _)) |
There was a problem hiding this comment.
This isn't really an integration test if it doesn't use a real resolver, with asyncronous resolution. Can you refactor this to use c-ares, and make the test respond with real dns responses?
There was a problem hiding this comment.
Hey @ggreenway,
I think this one deserves a bit of discussion (and probably help?).
As far as i understand, this patter is being used in test/extensions/clusters/common/logical_host_integration_test.cc, that's how it appeared here.
There is a FakeUdpDnsServer, which I think is used in benchmark only.
source/extensions/network/dns_resolver/cares/dns_impl.cc already does some sort of network communication, and we could extract it to some common place to reuse between tests, but this looks like a significant change.
How would you suggest to approach?
| friend class DnsSrvClusterFactory; | ||
| friend class DnsSrvClusterTest; |
There was a problem hiding this comment.
Can you refactor the tests so the friend isn't needed? Maybe add accessors for the data you need in the test?
There was a problem hiding this comment.
Hmm, this looks like a common pattern for clusters. But if you think, it worth changing, I can give it a try...
| namespace Clusters { | ||
| class DnsSrvClusterTest; | ||
| class DnsSrvClusterTest_CreateClusterWithMinimalConfig_Test; | ||
| } | ||
| } | ||
|
|
||
| namespace Upstream { | ||
|
|
||
| class DnsSrvClusterFactory; | ||
| class DnsSrvClusterTest; |
There was a problem hiding this comment.
These are declared in different namespaces
Co-authored-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: Mikhail Galanin <195510+mikhainin@users.noreply.github.com>
Co-authored-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: Mikhail Galanin <195510+mikhainin@users.noreply.github.com>
Co-authored-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: Mikhail Galanin <195510+mikhainin@users.noreply.github.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
ggreenway
left a comment
There was a problem hiding this comment.
Claude code says the following test cases are missing:
Missing Test Cases
Resolution lifecycle:
1. SRV failure at cluster level — resolveSrv returns ResolutionStatus::Failure. Verify update_failure_ incremented, timer armed, onPreInitComplete called.
2. Empty SRV response — SRV succeeds with zero records. Cluster should end up with no hosts.
3. All A/AAAA queries fail — SRV returns 3 targets, all address resolutions fail. some_targets_resolved stays false — verify host list is NOT updated (existing hosts preserved? or empty?).
4. Multiple IPs per hostname — One SRV target hostname resolves to 3 A records. Verify all 3 become hosts with the
same port/priority/weight.
Timer and re-resolution:
5. Timer re-resolve — After initial resolution, verify timer fires and triggers a new SRV cycle.
6. Re-resolve changes hosts — First cycle returns hosts A, B. Second cycle returns B, C. Verify A removed, C added, B kept.
7. TTL respect — With respect_dns_ttl: true, verify timer is armed with the minimum TTL from A/AAAA responses.
8. SRV TTL — Even with the current design (SRV TTL ignored), a test documenting this behavior would be useful.
Destruction and cancellation:
9. Cluster destroyed during SRV query — Verify no crash/leak (exercises ~DnsSrvCluster cancel path).
10. Cluster destroyed during A/AAAA queries — Verify ResolveTarget destructors cancel outstanding queries cleanly.
Edge cases:
11. Duplicate addresses — Two different SRV hostnames resolve to the same IP:port. Verify deduplication
(all_new_hosts logic).
12. Weight=0 — SRV record with weight 0. Verify host is created and functional.
13. High SRV priority values — Priority=10, 20. Verify hosts end up at the correct priority levels (or document
that priorities > 0 are broken).
14. Synchronous DNS completion — Mock dns_resolver_->resolve() to return nullptr (synchronous). Verify the
resolution still completes correctly.
Factory/Config validation:
15. Non-c-ares resolver rejected — Config with getaddrinfo resolver should fail with an error.
16. Missing typed_dns_resolver_config — What happens if the cluster config doesn't specify a resolver? The name
check will reject "".
17. Proto validation — Empty srv_name rejected by validate.rules.
c-ares level:
18. SRV cancellation — Start SRV query, cancel it, verify no callback and clean destruction.
19. ARES_EDESTRUCTION during SRV — Resolver destroyed while SRV query pending (currently crashes due to null dnsrec dereference).
20. Multiple SRV records in DNS response — Verify all records are parsed and returned.
21. SRV with CNAME indirection — DNS returns CNAME then SRV records.
The most critical missing tests are #9/#10 (destruction safety), #3 (all failures), and #5/#6 (re-resolution).
These exercise the paths most likely to have lifetime bugs in production.
/wait
| void DnsResolverImpl::PendingSrvResolution::onAresSrvCallback(ares_status_t status, size_t timeouts, | ||
| const ares_dns_record_t* dnsrec) { | ||
|
|
||
| unsigned int aresAnswerQueryId = ares_dns_record_get_id(dnsrec); |
There was a problem hiding this comment.
I think dnsrec can be null here, and this would crash
There was a problem hiding this comment.
Looking at the source, it looks it won't crash here. But it will return zero, which is a valid query_id and will crash on the assertion below (when assertions enabled). Updated implementation, considering dnsrec == nullptr as a non-successful result.
| target->weight_, constLocalitySharedPool()->getObject(locality_lb_endpoints.locality()), | ||
| lb_endpoint.endpoint().health_check_config(), target->priority_, | ||
| lb_endpoint.health_status()); |
There was a problem hiding this comment.
The PR comments say weight and priority aren't implemented yet, but you're at least half-implementing them here. I think you should set weight and priority to zero, and make a followup PR to implement them fully.
| void addResolvedTarget(Network::Address::InstanceConstSharedPtr address); | ||
|
|
||
| ResolveList& parent_; | ||
| Network::DnsResolverSharedPtr dns_resolver_; |
There was a problem hiding this comment.
I think this is unused and can be deleted
There was a problem hiding this comment.
I think, it is used in DnsSrvCluster::ResolveTarget::startResolve()...
| dereferencing | ||
| deprioritized | ||
| differentially | ||
| dnsrec |
There was a problem hiding this comment.
You can put terms like this in backticks in comments so they don't need to be added to the dictionary
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
One more attempt to address #125 (resolve cluster IPs via SRV-record)
Yet on early stage. Tested with Consul and the collowing configuration:
Based on discussion and design doc by @kylebevans in #125 (comment)
Current limitations:
srv_namedns_refresh_rateweightandpriorityare ignored (going to support them in the future PRs)