Skip to content

Retina can break connectivity of pods to the Kubernetes Cluster IP in clusters using Cilium #252

@andreev-io

Description

@andreev-io

This issue is seen in both AKS and GCP. See notes for AKS at #252 (comment)

Describe the bug
Upon installation of Retina, connectivity can be lost for pods in a GKE cluster using managed Cilium.

To Reproduce

  1. Go to create a standard GKE cluster.
  2. Select the Standard: You manage your cluster option (see screenshot 1).
  3. Specify GKE version 1.26.11-gke.105500 in the No channel channel selector (see screenshot 2). We suspect the issue would occur with other versions too, but we used a specific one for reproducability.
  4. [Optional] Configure the cluster to run in one AZ with fewer nodes than the default to manage cost.
  5. [Important] In the Networking configuration tab for the entire cluster, select Enable Dataplane V2 to enable managed Cilium-powered networking.
  6. Create the cluster and wait for all default pods in the cluster to come up.
  7. Install Retina and wait for the agent pods to start.
> VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name)
helm install retina oci://ghcr.io/microsoft/retina/charts/retina \
    --set namespace=kube-system \
    --version $VERSION \
    --namespace kube-system \
    --set image.tag=$VERSION \
    --set operator.tag=$VERSION \
    --set image.pullPolicy=Always \
    --set logLevel=info \
    --set operator.enabled=true \
    --set operator.enableRetinaEndpoint=true \
    --set enabledPlugin_linux="\[packetparser\]" \
    --set enablePodLevel=true \
    --set remoteContext=true

Note: if you are running a cluster with small nodes, you might need to manually edit the retina-agent DaemonSet to lower resource requests. Wait until retina-agent pods start.

  1. Identify metrics-server running in the kube-system namespace and check its logs. You will see error logs such as
E0409 15:21:23.378785       1 webhook.go:202] Failed to make webhook authorizer request: Post "https://10.114.192.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": context canceled
E0409 15:21:23.378851       1 errors.go:77] Post "https://10.114.192.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": context canceled
  1. Identify the cluster IP and the endpoint IP:
> kubectl get service
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.114.192.1   <none>        443/TCP   45m
> kubectl get ep     
NAME         ENDPOINTS        AGE
kubernetes   10.128.0.7:443   45m
  1. Connect to another pod and check connectivity to these origins. You'll see that there is connectivity to the endpoint IP but not to the service IP.
> kubectl debug -ti --image="nixery.dev/shell/curl" kube-dns-ff4bbcc87-tvzm7 -n kube-system
bash-5.2# curl https://10.114.192.1 -v -k
...
bash-5.2# curl https://10.128.0.7 -v -k
*   Trying 10.128.0.7:443...
* Connected to 10.128.0.7 (10.128.0.7) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=34.173.138.225
*  start date: Apr  9 14:52:44 2024 GMT
*  expire date: Apr  8 14:54:44 2029 GMT
*  issuer: CN=ca353e3b-048b-4feb-aa93-19a7c8a6aa89
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://10.128.0.7/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: 10.128.0.7]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET / HTTP/2
> Host: 10.128.0.7
> User-Agent: curl/8.4.0
> Accept: */*
> 
* received GOAWAY, error=0, last_stream=1
< HTTP/2 403 
< audit-id: 2c7f6280-d595-4ddf-850f-abf1cadd85d8
< cache-control: no-cache, private
< content-type: application/json
< x-content-type-options: nosniff
< x-kubernetes-pf-flowschema-uid: 759447f6-3823-412a-86a3-09c764ef91eb
< x-kubernetes-pf-prioritylevel-uid: 2707b41b-d15c-402a-a039-b0df8aff1c2d
< content-length: 217
< date: Tue, 09 Apr 2024 15:45:36 GMT
< 
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
* Closing connection
* TLSv1.3 (OUT), TLS alert, close notify (256):

Expected behaviour
No connectivity impact when installing Retina.

Screenshots
Step (2). Select Standard: You manage your cluster.
image

Step (3). No channel when specifying the version, then specify version 1.26.11-gke.1055000.
image

Step (4). Select Enable Dataplane V2 in the cluster network configuration tab.
image

Platform (please complete the following information):
See steps to reproduce.

Additional context
N/A

Metadata

Metadata

Type

No type

Projects

Status

Done

Status

Accepted

Relationships

None yet

Development

No branches or pull requests

Issue actions