Skip to content

[WIP] dns: add support for clusters based on SRV DNS record#35160

Draft
mikhainin wants to merge 42 commits into
envoyproxy:mainfrom
mikhainin:srv-record
Draft

[WIP] dns: add support for clusters based on SRV DNS record#35160
mikhainin wants to merge 42 commits into
envoyproxy:mainfrom
mikhainin:srv-record

Conversation

@mikhainin

@mikhainin mikhainin commented Jul 11, 2024

Copy link
Copy Markdown
Contributor

One more attempt to address #125 (resolve cluster IPs via SRV-record)

Yet on early stage. Tested with Consul and the collowing configuration:

  clusters:
  - name: local-php
    connect_timeout: 0.25s
    lb_policy: round_robin
    # http2_protocol_options: {}
    cluster_type:
      name: envoy.clusters.dns_srv
      typed_config:
        "@type": "type.googleapis.com/envoy.extensions.clusters.dns_srv.v3.DnsSrvClusterConfig"
        srv_name: _local-php._tcp.service.consul.
    dns_resolvers:
    typed_dns_resolver_config:
      name: envoy.network.dns_resolver.cares
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.network.dns_resolver.cares.v3.CaresDnsResolverConfig
        resolvers:
          - socket_address:
              address: 127.0.0.1
              port_value: 8600

Based on discussion and design doc by @kylebevans in #125 (comment)

Current limitations:

  • Only one srv_name
  • Resolution is strictly sequentional: First SRV record, Envoy will try resolving A-record for each target. The whole process will repeat after dns_refresh_rate
  • weight and priority are ignored (going to support them in the future PRs)
  • All resolved IP addresses will be addedd to cluster (acting like Strict DNS cluster).
  • Both initial SRV record and subsequent A/AAAA records resolved via the same DNS resolver.

Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
@repokitteh-read-only

Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #35160 was opened by mikhainin.

see: more, trace.

@repokitteh-read-only

Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #35160 was opened by mikhainin.

see: more, trace.

Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
@mikhainin

Copy link
Copy Markdown
Contributor Author

Testing locally with Consul:
run consul:

consul agent -server -advertise=127.0.0.1 -data-dir=consul-data -ui -bootstrap-expect=1 -dev -log-level=debug

Add service nodes:

consul services register -name=local-php -id=web-1 -address=192.168.1.102 -port=80 -tag=v1 -tag=prod
consul services register -name=local-php -id=web-2 -address=192.168.1.102 -port=81 -tag=v1 -tag=prod

Verify that the service local-php can be resolved via Consul:

dig @127.0.0.1 -p 8600 local-php.service.consul. SRV
# or
dig @127.0.0.1 -p 8600 _local-php._tcp.service.consul. SRV

main doc: https://developer.hashicorp.com/consul/docs/services/discovery/dns-static-lookups

Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
@mikhainin

Copy link
Copy Markdown
Contributor Author

Hi there, apologies for openning in this state - I acknowledge that there still lots.
I just needed someone to point me at what I'm missing and get feedback whether I'm going in the right direction.

@mikhainin mikhainin marked this pull request as ready for review August 19, 2024 19:39
@wbpcode wbpcode marked this pull request as draft August 21, 2024 12:49
@wbpcode

wbpcode commented Aug 21, 2024

Copy link
Copy Markdown
Member

Thanks for you contribution. But based on our contributing rules (see https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md), any extension, should have a maintainer sponsor to help review and maintain the code.

And I think this should be a DNS extension rather than a new cluster extension?

@mikhainin

mikhainin commented Aug 21, 2024

Copy link
Copy Markdown
Contributor Author

should have a maintainer sponsor to help review and maintain the code

@wbpcode, Thank you for your comment. I posted a message in slack some while ago but didn't hear anything back. What would be the right way for this?

And I think this should be a DNS extension rather than a new cluster extension?

Based on previous discussion, it was decided to go with the cluster extension. Basically, I followed the previous proposal.

I will be happy to discuss the new design if someone helps me to find the right place. The initial issue doesn't seem to have activity :/

@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 20, 2024
@mikhainin

Copy link
Copy Markdown
Contributor Author

This isn't yet forgotten, going to ping maintainers in the maillist

@github-actions github-actions Bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 26, 2024
@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 26, 2024
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
@ggreenway

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new cluster type, envoy.clusters.dns_srv, enabling service discovery through DNS SRV records. It extends the DnsResolver interface with a resolveSrv method and provides a concrete implementation for the c-ares resolver. The DnsSrvCluster manages a two-stage resolution process, first retrieving SRV records and then resolving the resulting hostnames into IP addresses. Feedback identifies a critical bug where the cluster fails to remove stale hosts because the current host list is not correctly maintained across updates. Additionally, the reviewer suggests incorporating the SRV record's TTL into the cluster refresh logic and removing the hardcoded restriction that limits SRV support exclusively to the c-ares resolver extension.

Comment thread source/extensions/clusters/dns_srv/dns_srv_cluster.cc Outdated
Comment on lines +71 to +90
for (const auto& dns : response) {

ENVOY_LOG(debug, "SRV: host: {}, port: {}, weight: {}, prio: {}", dns.srv().target_,
dns.srv().port_, dns.srv().weight_, dns.srv().priority_);

if (auto address = Envoy::Network::Utility::parseInternetAddressNoThrow(
dns.srv().target_, 0, false);
address != nullptr) {
// SRV record target is an IP address, not a hostname.
ResolveTargetPtr target = std::make_unique<ResolveTarget>(
*active_resolve_list_, dns_resolver_, dns_lookup_family_, dns.srv().target_,
dns.srv().priority_, dns.srv().weight_, dns.srv().port_);

active_resolve_list_->addResolvedTarget(std::move(target), address);
} else {
active_resolve_list_->addTarget(std::make_unique<ResolveTarget>(
*active_resolve_list_, dns_resolver_, dns_lookup_family_, dns.srv().target_,
dns.srv().priority_, dns.srv().weight_, dns.srv().port_));
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SRV record's TTL should be tracked to influence the cluster's refresh rate. Currently, only the TTLs from the subsequent A/AAAA lookups are used. If the SRV record itself has a shorter TTL, the cluster might use stale targets.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is simpler, I think for now you could take the minimum of the TTLs from A/AAAA vs SRV response, and use that

Comment on lines +327 to +333
if (cluster.typed_dns_resolver_config().name() != "envoy.network.dns_resolver.cares") {
return absl::InvalidArgumentError(
fmt::format("Only c-ares supports resolve of SRV records, "
"please use typed_dns_resolver_config.name = "
"'envoy.network.dns_resolver.cares'. Current value: '{}'",
cluster.typed_dns_resolver_config().name()));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding a check for the c-ares resolver name is restrictive. While currently only c-ares supports SRV in this implementation, it would be better to check for a capability or interface support rather than a specific extension name, allowing for future or custom resolvers that might support SRV.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an out-of-tree resolver supported SRV, this code would disallow it, so it would be good to fix this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "out-of-tree resolver"? I thought, we could add other resolvers in the following change sets...

@ggreenway ggreenway left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Comment thread source/extensions/clusters/dns_srv/dns_srv_cluster.cc Outdated
Comment thread test/extensions/clusters/dns_srv/dns_srv_cluster_integration_test.cc Outdated

ASSERT_TRUE(response->complete());
EXPECT_EQ("200", response->headers().getStatusValue());
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an integration test that adds and removes hosts via srv response, as well as via changed A/AAAA response.

EXPECT_CALL(dns_resolver_factory_, createDnsResolver(_, _, _))
.WillRepeatedly(testing::Return(dns_resolver));

EXPECT_CALL(*dns_resolver, resolve(_, _, _))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really an integration test if it doesn't use a real resolver, with asyncronous resolution. Can you refactor this to use c-ares, and make the test respond with real dns responses?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ggreenway,
I think this one deserves a bit of discussion (and probably help?).
As far as i understand, this patter is being used in test/extensions/clusters/common/logical_host_integration_test.cc, that's how it appeared here.

There is a FakeUdpDnsServer, which I think is used in benchmark only.

source/extensions/network/dns_resolver/cares/dns_impl.cc already does some sort of network communication, and we could extract it to some common place to reuse between tests, but this looks like a significant change.

How would you suggest to approach?

Comment thread source/extensions/clusters/dns_srv/dns_srv_cluster.h Outdated
Comment thread source/extensions/network/dns_resolver/cares/dns_impl.h Outdated
Comment thread envoy/network/dns.h Outdated
Comment thread source/extensions/clusters/dns_srv/dns_srv_cluster.cc Outdated
Comment on lines +64 to +65
friend class DnsSrvClusterFactory;
friend class DnsSrvClusterTest;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor the tests so the friend isn't needed? Maybe add accessors for the data you need in the test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looks like a common pattern for clusters. But if you think, it worth changing, I can give it a try...

Comment on lines +23 to +32
namespace Clusters {
class DnsSrvClusterTest;
class DnsSrvClusterTest_CreateClusterWithMinimalConfig_Test;
}
}

namespace Upstream {

class DnsSrvClusterFactory;
class DnsSrvClusterTest;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are declared in different namespaces

Co-authored-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Mikhail Galanin <195510+mikhainin@users.noreply.github.com>
mikhainin and others added 6 commits April 29, 2026 19:46
Co-authored-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Mikhail Galanin <195510+mikhainin@users.noreply.github.com>
Co-authored-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Mikhail Galanin <195510+mikhainin@users.noreply.github.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>

@ggreenway ggreenway left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude code says the following test cases are missing:

  Missing Test Cases

  Resolution lifecycle:
  1. SRV failure at cluster level — resolveSrv returns ResolutionStatus::Failure. Verify update_failure_ incremented,   timer armed, onPreInitComplete called.
  2. Empty SRV response — SRV succeeds with zero records. Cluster should end up with no hosts.
  3. All A/AAAA queries fail — SRV returns 3 targets, all address resolutions fail. some_targets_resolved stays false   — verify host list is NOT updated (existing hosts preserved? or empty?).
  4. Multiple IPs per hostname — One SRV target hostname resolves to 3 A records. Verify all 3 become hosts with the
  same port/priority/weight.

  Timer and re-resolution:
  5. Timer re-resolve — After initial resolution, verify timer fires and triggers a new SRV cycle.
  6. Re-resolve changes hosts — First cycle returns hosts A, B. Second cycle returns B, C. Verify A removed, C added,   B kept.
  7. TTL respect — With respect_dns_ttl: true, verify timer is armed with the minimum TTL from A/AAAA responses.
  8. SRV TTL — Even with the current design (SRV TTL ignored), a test documenting this behavior would be useful.

  Destruction and cancellation:
  9. Cluster destroyed during SRV query — Verify no crash/leak (exercises ~DnsSrvCluster cancel path).
  10. Cluster destroyed during A/AAAA queries — Verify ResolveTarget destructors cancel outstanding queries cleanly.

  Edge cases:
  11. Duplicate addresses — Two different SRV hostnames resolve to the same IP:port. Verify deduplication
  (all_new_hosts logic).
  12. Weight=0 — SRV record with weight 0. Verify host is created and functional.
  13. High SRV priority values — Priority=10, 20. Verify hosts end up at the correct priority levels (or document
  that priorities > 0 are broken).
  14. Synchronous DNS completion — Mock dns_resolver_->resolve() to return nullptr (synchronous). Verify the
  resolution still completes correctly.

  Factory/Config validation:
  15. Non-c-ares resolver rejected — Config with getaddrinfo resolver should fail with an error.
  16. Missing typed_dns_resolver_config — What happens if the cluster config doesn't specify a resolver? The name
  check will reject "".
  17. Proto validation — Empty srv_name rejected by validate.rules.

  c-ares level:
  18. SRV cancellation — Start SRV query, cancel it, verify no callback and clean destruction.
  19. ARES_EDESTRUCTION during SRV — Resolver destroyed while SRV query pending (currently crashes due to null dnsrec   dereference).
  20. Multiple SRV records in DNS response — Verify all records are parsed and returned.
  21. SRV with CNAME indirection — DNS returns CNAME then SRV records.

  The most critical missing tests are #9/#10 (destruction safety), #3 (all failures), and #5/#6 (re-resolution).
  These exercise the paths most likely to have lifetime bugs in production.

/wait

void DnsResolverImpl::PendingSrvResolution::onAresSrvCallback(ares_status_t status, size_t timeouts,
const ares_dns_record_t* dnsrec) {

unsigned int aresAnswerQueryId = ares_dns_record_get_id(dnsrec);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dnsrec can be null here, and this would crash

@mikhainin mikhainin May 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the source, it looks it won't crash here. But it will return zero, which is a valid query_id and will crash on the assertion below (when assertions enabled). Updated implementation, considering dnsrec == nullptr as a non-successful result.

Comment thread source/extensions/network/dns_resolver/cares/dns_impl.cc Outdated
Comment on lines +142 to +144
target->weight_, constLocalitySharedPool()->getObject(locality_lb_endpoints.locality()),
lb_endpoint.endpoint().health_check_config(), target->priority_,
lb_endpoint.health_status());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR comments say weight and priority aren't implemented yet, but you're at least half-implementing them here. I think you should set weight and priority to zero, and make a followup PR to implement them fully.

void addResolvedTarget(Network::Address::InstanceConstSharedPtr address);

ResolveList& parent_;
Network::DnsResolverSharedPtr dns_resolver_;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unused and can be deleted

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it is used in DnsSrvCluster::ResolveTarget::startResolve()...

dereferencing
deprioritized
differentially
dnsrec

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put terms like this in backticks in comments so they don't need to be added to the dictionary

Comment thread test/extensions/clusters/dns_srv/dns_srv_cluster_test.cc Outdated
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
mikhainin added 9 commits May 17, 2026 11:16
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Signed-off-by: Mikhail Galanin <mikhail.galanin@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants