概述
k8s v1.16版本中NodeController
已经分为了NodeIpamController
与NodeLifecycleController
,本文主要介绍NodeLifecycleController
。
NodeLifecycleController
主要功能有:
(1)定期检查node的心跳上报,某个node间隔一定时间都没有心跳上报时,更新node的ready condition
值为false或unknown,开启了污点驱逐的情况下,给该node添加NoExecute
的污点;
(2)当污点驱逐未开启时,当node的ready Condition
值为false或unknown且已经持续了一段时间(该时间可配置)时,对该node上的pod做驱逐(删除)操作;
(3)当污点驱逐开启时,node上有NoExecute
污点后,立马驱逐(删除)不能容忍污点的pod,对于能容忍该污点的pod,则等待所有污点的容忍时间里最小值后,pod才被驱逐(删除);
源码分析
源码分析分成3部分:
(1)启动参数分析;
(2)初始化与相关结构体分析;
(3)处理逻辑分析;
1.相关启动参数分析
// cmd/kube-controller-manager/app/core.go
func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {
lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(
ctx.InformerFactory.Coordination().V1().Leases(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.InformerFactory.Core().V1().Nodes(),
ctx.InformerFactory.Apps().V1().DaemonSets(),
// node lifecycle controller uses existing cluster role from node-controller
ctx.ClientBuilder.ClientOrDie("node-controller"),
ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,
ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,
ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,
ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,
ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,
ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,
ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,
ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,
ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,
utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
)
if err != nil {
return nil, true, err
}
go lifecycleController.Run(ctx.Stop)
return nil, true, nil
}
看到上面的startNodeLifecycleController
函数中lifecyclecontroller.NewNodeLifecycleController
方法的入参,其中传入了多个kube-controller-manager的启动参数;
(1)ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration
;
即kcm启动参数--node-monitor-period
,默认值5秒,代表NodeLifecycleController
中更新同步node对象的status值(node的污点、node的condition值)的周期;
fs.DurationVar(&o.NodeMonitorPeriod.Duration, "node-monitor-period", o.NodeMonitorPeriod.Duration,
"The period for syncing NodeStatus in NodeController.")
(2)ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration
;
即kcm启动参数--node-startup-grace-period
,默认值60秒,代表node启动后多久才会更新node对象的conditions值;
fs.DurationVar(&o.NodeStartupGracePeriod.Duration, "node-startup-grace-period", o.NodeStartupGracePeriod.Duration,
"Amount of time which we allow starting Node to be unresponsive before marking it unhealthy.")
(3)ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration
;
即kcm启动参数--node-monitor-grace-period
,默认值40秒,代表在距离上一次上报心跳时间超过40s后,将该node的conditions值更新为unknown(kubelet通过更新node lease来上报心跳);
fs.DurationVar(&o.NodeMonitorGracePeriod.Duration, "node-monitor-grace-period", o.NodeMonitorGracePeriod.Duration,
"Amount of time which we allow running Node to be unresponsive before marking it unhealthy. "+
"Must be N times more than kubelet's nodeStatusUpdateFrequency, "+
"where N means number of retries allowed for kubelet to post node status.")
(4)ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration
;
即kcm启动参数--pod-eviction-timeout
,默认值5分钟,当不开启污点驱逐时该参数起效,当node的ready condition值变为false或unknown并持续了5分钟后,将驱逐(删除)该node上的pod;
fs.DurationVar(&o.PodEvictionTimeout.Duration, "pod-eviction-timeout", o.PodEvictionTimeout.Duration, "The grace period for deleting pods on failed nodes.")
(5)ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager
;
即kcm启动参数--enable-taint-manager
,默认值true,代表启动taintManager,当已经调度到该node上的pod不能容忍node的NoExecute
污点时,由TaintManager负责驱逐此类pod,若为false即不启动taintManager,则根据--pod-eviction-timeout
来做驱逐操作;
fs.BoolVar(&o.EnableTaintManager, "enable-taint-manager", o.EnableTaintManager, "WARNING: Beta feature. If set to true enables NoExecute Taints and will evict all not-tolerating Pod running on Nodes tainted with this kind of Taints.")
(6)utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions)
;
即kcm启动参数--feature-gates=TaintBasedEvictions=xxx
,默认值true,配合--enable-taint-manager
共同作用,两者均为true,才会开启污点驱逐;
(7)ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate
;
即kcm启动参数--node-eviction-rate
,默认值0.1,代表当集群下某个zone(zone的概念后面详细介绍)为healthy时,每秒应该触发pod驱逐操作的node数量,默认0.1,即每10s触发1个node上的pod驱逐操作;
fs.Float32Var(&o.NodeEvictionRate, "node-eviction-rate", 0.1, "Number of nodes per second on which pods are deleted in case of node failure when a zone is healthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters.")
(8)ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate
;
即kcm启动参数--secondary-node-eviction-rate
,代表如果某个zone下的unhealthy节点的百分比超过--unhealthy-zone-threshold
(默认为 0.55)时,驱逐速率将会减小,如果不是LargeCluster(zone节点数量小于等于--large-cluster-size-threshold
个,默认为 50),驱逐操作将会停止,如果是LargeCluster,驱逐速率将降为每秒--secondary-node-eviction-rate
个,默认为0.01;
fs.Float32Var(&o.SecondaryNodeEvictionRate, "secondary-node-eviction-rate", 0.01, "Number of nodes per second on which pods are deleted in case of node failure when a zone is unhealthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters. This value is implicitly overridden to 0 if the cluster size is smaller than --large-cluster-size-threshold.")
(9)ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold
;
即kcm启动参数--large-cluster-size-threshold
,默认值50,当某zone的节点数超过该值时,认为该zone是一个LargeCluster,不是LargeCluster时,对应的SecondaryNodeEvictionRate
配置会被设置为0;
fs.Int32Var(&o.LargeClusterSizeThreshold, "large-cluster-size-threshold", 50, "Number of nodes from which NodeController treats the cluster as large for the eviction logic purposes. --secondary-node-eviction-rate is implicitly overridden to 0 for clusters this size or smaller.")
(10)ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold
;
即kcm启动参数--unhealthy-zone-threshold
,代表认定某zone为unhealthy的阈值,即会影响什么时候开启二级驱逐速率;默认值0.55,当该zone中not ready节点(ready condition值不为true)数目超过55%,认定该zone为unhealthy;
fs.Float32Var(&o.UnhealthyZoneThreshold, "unhealthy-zone-threshold", 0.55, "Fraction of Nodes in a zone which needs to be not Ready (minimum 3) for zone to be treated as unhealthy. ")
(11)--feature-gates=NodeLease=xxx
:默认值true,使用lease对象上报node心跳信息,替换老的更新node的status的方式,能大大减轻apiserver的负担;
zone概念介绍
根据每个node对象的region和zone的label值,将node划分到不同的zone中;
region、zone值都相同的node,划分为同一个zone;
zone状态介绍
zone状态有四种,分别是:
(1)Initial
:初始化状态;
(2)FullDisruption
:ready的node数量为0,not ready的node数量大于0;
(3)PartialDisruption
:not ready的node数量大于2且其占比大于等于unhealthyZoneThreshold
;
(4)Normal
:上述三种状态以外的情形,都属于该状态;
需要注意二级驱逐速率对驱逐的影响,即kcm启动参数--secondary-node-eviction-rate
,代表如果某个zone下的unhealthy节点的百分比超过--unhealthy-zone-threshold
(默认为 0.55)时,驱逐速率将会减小,如果不是LargeCluster(zone节点数量小于等于--large-cluster-size-threshold
,默认为 50),驱逐操作将会停止,如果是LargeCluster,驱逐速率将降为每秒--secondary-node-eviction-rate
个,默认为0.01;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, ZoneState) {
readyNodes := 0
notReadyNodes := 0
for i := range nodeReadyConditions {
if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {
readyNodes++
} else {
notReadyNodes++
}
}
switch {
case readyNodes == 0 && notReadyNodes > 0:
return notReadyNodes, stateFullDisruption
case notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:
return notReadyNodes, statePartialDisruption
default:
return notReadyNodes, stateNormal
}
}
2.初始化与相关结构体分析
2.1 Controller结构体分析
Controller结构体关键属性:
(1)taintManager
:负责污点驱逐的manager;
(2)enterPartialDisruptionFunc
:返回当zone状态为PartialDisruption
时的驱逐速率(node节点数量超过largeClusterThreshold
时,返回secondaryEvictionLimiterQPS
,即kcm启动参数--secondary-node-eviction-rate
,否则返回0);
(3)enterFullDisruptionFunc
:返回当zone状态为FullDisruption
时的驱逐速率(直接返回NodeEvictionRate
值,kcm启动参数--node-eviction-rate
);
(4)computeZoneStateFunc
:计算zone状态的方法,即上面zone状态介绍中的ComputeZoneState
方法;
(5)nodeHealthMap
:用于记录所有node的最近一次的状态信息;
(6)zoneStates
:用于记录所有zone的状态;
(7)nodeMonitorPeriod
、nodeStartupGracePeriod
、nodeMonitorGracePeriod
、podEvictionTimeout
、evictionLimiterQPS
、secondaryEvictionLimiterQPS
、largeClusterThreshold
、unhealthyZoneThreshold
,上面介绍启动参数时已经做了分析;
(8)runTaintManager
:kcm启动参数--enable-taint-manager
赋值,代表是否启动taintManager;
(9)useTaintBasedEvictions
:代表是否开启污点驱逐,kcm启动参数--feature-gates=TaintBasedEvictions=xxx
赋值,默认值true,配合--enable-taint-manager
共同作用,两者均为true,才会开启污点驱逐;
Controller结构体中的两个关键队列:
(1)zonePodEvictor
:pod需要被驱逐的node节点队列(只有在未开启污点驱逐时,才用到该队列),当node的ready condition变为false或unknown且持续了podEvictionTimeout
的时间,会将该node放入该队列,然后有worker负责从该队列中读取node,去执行node上的pod驱逐操作;
(2)zoneNoExecuteTainter
:需要更新taint的node节点队列,当node的ready condition变为false或unknown时,会将该node放入该队列,然后有worker负责从该队列中读取node,去执行taint更新操作(增加notReady
或unreachable
的taint);
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
type Controller struct {
...
taintManager *scheduler.NoExecuteTaintManager
// This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
// to avoid the problem with time skew across the cluster.
now func() metav1.Time
enterPartialDisruptionFunc func(nodeNum int) float32
enterFullDisruptionFunc func(nodeNum int) float32
computeZoneStateFunc func(nodeConditions []*v1.NodeCondition) (int, ZoneState)
knownNodeSet map[string]*v1.Node
// per Node map storing last observed health together with a local time when it was observed.
nodeHealthMap *nodeHealthMap
// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
// TODO(#83954): API calls shouldn't be executed under the lock.
evictorLock sync.Mutex
nodeEvictionMap *nodeEvictionMap
// workers that evicts pods from unresponsive nodes.
zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
// workers that are responsible for tainting node