k8s驱逐篇(6)-kube-controller-manager驱逐-NodeLifecycleController源码分析

最新推荐文章于 2025-02-05 23:59:55 发布

原创

最新推荐文章于 2025-02-05 23:59:55 发布 · 1.3k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#kubernetes #云原生 #源码 #容器

概述

k8s v1.16版本中NodeController已经分为了NodeIpamController与NodeLifecycleController，本文主要介绍NodeLifecycleController。

NodeLifecycleController主要功能有：
（1）定期检查node的心跳上报，某个node间隔一定时间都没有心跳上报时，更新node的ready condition值为false或unknown，开启了污点驱逐的情况下，给该node添加NoExecute的污点；
（2）当污点驱逐未开启时，当node的ready Condition值为false或unknown且已经持续了一段时间（该时间可配置）时，对该node上的pod做驱逐（删除）操作；
（3）当污点驱逐开启时，node上有NoExecute污点后，立马驱逐（删除）不能容忍污点的pod，对于能容忍该污点的pod，则等待所有污点的容忍时间里最小值后，pod才被驱逐（删除）；

源码分析

源码分析分成3部分：
（1）启动参数分析；
（2）初始化与相关结构体分析；
（3）处理逻辑分析；

1.相关启动参数分析

// cmd/kube-controller-manager/app/core.go
func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {
	lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(
		ctx.InformerFactory.Coordination().V1().Leases(),
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.InformerFactory.Core().V1().Nodes(),
		ctx.InformerFactory.Apps().V1().DaemonSets(),
		// node lifecycle controller uses existing cluster role from node-controller
		ctx.ClientBuilder.ClientOrDie("node-controller"),
		ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,
		ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,
		ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,
		ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,
		ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,
		ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,
		ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,
		ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,
		ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,
		utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
	)
	if err != nil {
		return nil, true, err
	}
	go lifecycleController.Run(ctx.Stop)
	return nil, true, nil
}

看到上面的startNodeLifecycleController函数中lifecyclecontroller.NewNodeLifecycleController方法的入参，其中传入了多个kube-controller-manager的启动参数；

（1）ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration；

即kcm启动参数--node-monitor-period，默认值5秒，代表NodeLifecycleController中更新同步node对象的status值（node的污点、node的condition值）的周期；

fs.DurationVar(&o.NodeMonitorPeriod.Duration, "node-monitor-period", o.NodeMonitorPeriod.Duration,
		"The period for syncing NodeStatus in NodeController.")

（2）ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration；

即kcm启动参数--node-startup-grace-period，默认值60秒，代表node启动后多久才会更新node对象的conditions值；

fs.DurationVar(&o.NodeStartupGracePeriod.Duration, "node-startup-grace-period", o.NodeStartupGracePeriod.Duration,
		"Amount of time which we allow starting Node to be unresponsive before marking it unhealthy.")

（3）ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration；

即kcm启动参数--node-monitor-grace-period，默认值40秒，代表在距离上一次上报心跳时间超过40s后，将该node的conditions值更新为unknown（kubelet通过更新node lease来上报心跳）；

fs.DurationVar(&o.NodeMonitorGracePeriod.Duration, "node-monitor-grace-period", o.NodeMonitorGracePeriod.Duration,
		"Amount of time which we allow running Node to be unresponsive before marking it unhealthy. "+
			"Must be N times more than kubelet's nodeStatusUpdateFrequency, "+
			"where N means number of retries allowed for kubelet to post node status.")

（4）ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration；

即kcm启动参数--pod-eviction-timeout，默认值5分钟，当不开启污点驱逐时该参数起效，当node的ready condition值变为false或unknown并持续了5分钟后，将驱逐（删除）该node上的pod；

fs.DurationVar(&o.PodEvictionTimeout.Duration, "pod-eviction-timeout", o.PodEvictionTimeout.Duration, "The grace period for deleting pods on failed nodes.")

（5）ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager；

即kcm启动参数--enable-taint-manager，默认值true，代表启动taintManager，当已经调度到该node上的pod不能容忍node的NoExecute 污点时，由TaintManager负责驱逐此类pod，若为false即不启动taintManager，则根据--pod-eviction-timeout来做驱逐操作；

fs.BoolVar(&o.EnableTaintManager, "enable-taint-manager", o.EnableTaintManager, "WARNING: Beta feature. If set to true enables NoExecute Taints and will evict all not-tolerating Pod running on Nodes tainted with this kind of Taints.")

（6）utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions)；

即kcm启动参数--feature-gates=TaintBasedEvictions=xxx，默认值true，配合--enable-taint-manager共同作用，两者均为true，才会开启污点驱逐；

（7）ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate；

即kcm启动参数--node-eviction-rate，默认值0.1，代表当集群下某个zone（zone的概念后面详细介绍）为healthy时，每秒应该触发pod驱逐操作的node数量，默认0.1，即每10s触发1个node上的pod驱逐操作；

fs.Float32Var(&o.NodeEvictionRate, "node-eviction-rate", 0.1, "Number of nodes per second on which pods are deleted in case of node failure when a zone is healthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters.")

（8）ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate；

即kcm启动参数--secondary-node-eviction-rate，代表如果某个zone下的unhealthy节点的百分比超过--unhealthy-zone-threshold （默认为 0.55）时，驱逐速率将会减小，如果不是LargeCluster（zone节点数量小于等于--large-cluster-size-threshold个，默认为 50），驱逐操作将会停止，如果是LargeCluster，驱逐速率将降为每秒--secondary-node-eviction-rate 个，默认为0.01；

fs.Float32Var(&o.SecondaryNodeEvictionRate, "secondary-node-eviction-rate", 0.01, "Number of nodes per second on which pods are deleted in case of node failure when a zone is unhealthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters. This value is implicitly overridden to 0 if the cluster size is smaller than --large-cluster-size-threshold.")

（9）ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold；

即kcm启动参数--large-cluster-size-threshold，默认值50，当某zone的节点数超过该值时，认为该zone是一个LargeCluster，不是LargeCluster时，对应的SecondaryNodeEvictionRate配置会被设置为0；

fs.Int32Var(&o.LargeClusterSizeThreshold, "large-cluster-size-threshold", 50, "Number of nodes from which NodeController treats the cluster as large for the eviction logic purposes. --secondary-node-eviction-rate is implicitly overridden to 0 for clusters this size or smaller.")

（10）ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold；

即kcm启动参数--unhealthy-zone-threshold，代表认定某zone为unhealthy的阈值，即会影响什么时候开启二级驱逐速率；默认值0.55，当该zone中not ready节点（ready condition值不为true）数目超过55%，认定该zone为unhealthy；

fs.Float32Var(&o.UnhealthyZoneThreshold, "unhealthy-zone-threshold", 0.55, "Fraction of Nodes in a zone which needs to be not Ready (minimum 3) for zone to be treated as unhealthy. ")

（11）--feature-gates=NodeLease=xxx：默认值true，使用lease对象上报node心跳信息，替换老的更新node的status的方式，能大大减轻apiserver的负担；

zone概念介绍

根据每个node对象的region和zone的label值，将node划分到不同的zone中；

region、zone值都相同的node，划分为同一个zone；

zone状态介绍

zone状态有四种，分别是：
（1）Initial：初始化状态；
（2）FullDisruption：ready的node数量为0，not ready的node数量大于0；
（3）PartialDisruption：not ready的node数量大于2且其占比大于等于unhealthyZoneThreshold；
（4）Normal：上述三种状态以外的情形，都属于该状态；

需要注意二级驱逐速率对驱逐的影响，即kcm启动参数--secondary-node-eviction-rate，代表如果某个zone下的unhealthy节点的百分比超过--unhealthy-zone-threshold （默认为 0.55）时，驱逐速率将会减小，如果不是LargeCluster（zone节点数量小于等于--large-cluster-size-threshold，默认为 50），驱逐操作将会停止，如果是LargeCluster，驱逐速率将降为每秒--secondary-node-eviction-rate 个，默认为0.01；

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, ZoneState) {
	readyNodes := 0
	notReadyNodes := 0
	for i := range nodeReadyConditions {
		if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {
			readyNodes++
		} else {
			notReadyNodes++
		}
	}
	switch {
	case readyNodes == 0 && notReadyNodes > 0:
		return notReadyNodes, stateFullDisruption
	case notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:
		return notReadyNodes, statePartialDisruption
	default:
		return notReadyNodes, stateNormal
	}
}

2.初始化与相关结构体分析

2.1 Controller结构体分析

Controller结构体关键属性：
（1）taintManager：负责污点驱逐的manager；
（2）enterPartialDisruptionFunc：返回当zone状态为PartialDisruption时的驱逐速率（node节点数量超过largeClusterThreshold时，返回secondaryEvictionLimiterQPS，即kcm启动参数--secondary-node-eviction-rate，否则返回0）；
（3）enterFullDisruptionFunc：返回当zone状态为FullDisruption时的驱逐速率（直接返回NodeEvictionRate值，kcm启动参数--node-eviction-rate）；
（4）computeZoneStateFunc：计算zone状态的方法，即上面zone状态介绍中的ComputeZoneState方法；
（5）nodeHealthMap：用于记录所有node的最近一次的状态信息；
（6）zoneStates：用于记录所有zone的状态；
（7）nodeMonitorPeriod、nodeStartupGracePeriod、nodeMonitorGracePeriod、podEvictionTimeout、evictionLimiterQPS、secondaryEvictionLimiterQPS、largeClusterThreshold、unhealthyZoneThreshold，上面介绍启动参数时已经做了分析；
（8）runTaintManager：kcm启动参数--enable-taint-manager赋值，代表是否启动taintManager；
（9）useTaintBasedEvictions：代表是否开启污点驱逐，kcm启动参数--feature-gates=TaintBasedEvictions=xxx赋值，默认值true，配合--enable-taint-manager共同作用，两者均为true，才会开启污点驱逐；

Controller结构体中的两个关键队列：
（1）zonePodEvictor：pod需要被驱逐的node节点队列（只有在未开启污点驱逐时，才用到该队列），当node的ready condition变为false或unknown且持续了podEvictionTimeout的时间，会将该node放入该队列，然后有worker负责从该队列中读取node，去执行node上的pod驱逐操作；
（2）zoneNoExecuteTainter：需要更新taint的node节点队列，当node的ready condition变为false或unknown时，会将该node放入该队列，然后有worker负责从该队列中读取node，去执行taint更新操作（增加notReady或unreachable的taint）；

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
type Controller struct {
    ...
	taintManager *scheduler.NoExecuteTaintManager

	// This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
	// to avoid the problem with time skew across the cluster.
	now func() metav1.Time

	enterPartialDisruptionFunc func(nodeNum int) float32
	enterFullDisruptionFunc    func(nodeNum int) float32
	computeZoneStateFunc       func(nodeConditions []*v1.NodeCondition) (int, ZoneState)

	knownNodeSet map[string]*v1.Node
	// per Node map storing last observed health together with a local time when it was observed.
	nodeHealthMap *nodeHealthMap

	// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
	// TODO(#83954): API calls shouldn't be executed under the lock.
	evictorLock     sync.Mutex
	nodeEvictionMap *nodeEvictionMap
	// workers that evicts pods from unresponsive nodes.
	zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
	// workers that are responsible for tainting node