Openshift v4 Infrastructure Day To Day Activities 300 Questions Answers
Openshift v4 Infrastructure Day To Day Activities 300 Questions Answers
h ps://assistedcloud.com/
oc get nodes lists all registered nodes (control plane and workers).
The STATUS column shows if a node is Ready (healthy and schedulable), NotReady
(unhealthy or unreachable), or Ready,SchedulingDisabled (healthy but cordoned).
Adding -o wide provides addi onal valuable informa on, including the node's internal
and external IP addresses, OS image, kernel version, and container run me version,
giving a richer snapshot of the node's state and configura on.
2. How can you determine the Kubelet version running on a specific node?
Descrip on: The Kubelet is the primary agent running on each node that registers the node with the
API server and manages pods and containers. Knowing its version is crucial for:
Compa bility: Ensuring the node's Kubelet version aligns with the control plane version,
especially during or a er upgrades.
Troubleshoo ng: Iden fying if issues might be related to known bugs in a specific
Kubelet version.
This command directly queries the node's status informa on reported to the API server
and extracts the specific kubeletVersion field.
3. Explain the difference between a node's Capacity and Allocatable resources. How
do you check them?
Command to Check: oc describe node <node_name> | grep -E 'Capacity|Allocatable'
Descrip on:
Allocatable: Represents the amount of resources available for user pods to consume. It is
calculated by subtrac ng resources reserved for the opera ng system, the container
run me (CRI-O), and the Kubelet itself from the total Capacity.
Why the difference ma ers: The Kubernetes scheduler uses the Allocatable value when
deciding where to place pods. Understanding this difference is key for accurate capacity
planning and troubleshoo ng situa ons where pods won't schedule even if Capacity
seems sufficient. The oc describe node command displays both values clearly.
4. How would you check the current CPU and memory usage of a specific node?
Command: oc adm top node <node_name>
Descrip on: This command provides a real- me snapshot of the actual resource consump on on the
specified node. It relies on the metrics-server component being deployed and healthy in the cluster
(which it usually is by default in OCP 4).
It shows CPU usage (in cores/millicores) and memory usage (in bytes, typically MiB or
GiB).
It also shows the percentage of the node's allocatable resources being used.
This is essen al for iden fying nodes under heavy load or diagnosing performance
bo lenecks.
5. What is the purpose of cordoning a node, and what command achieves this?
Command: oc adm cordon <node_name>
Purpose: Cordoning marks a node as unschedulable. This means the Kubernetes scheduler will not
place any new pods onto this node.
Use Case: It's the first step when preparing a node for maintenance (like patching, hardware
changes, or reboo ng). It prevents new work from landing on the node while allowing exis ng pods
to con nue running without disrup on un l they are deliberately drained or terminate naturally.
Purpose: This command removes the unschedulable taint added by the cordon command.
Use Case: A er node maintenance is complete and the node is verified to be healthy, uncordoning
makes it available again for the scheduler to place new pods onto it, bringing it back into full service
within the cluster.
7. Describe the process and command for safely draining a node for maintenance.
What precau ons should be taken?
Command: oc adm drain <node_name> --ignore-daemonsets --delete-emptydir-data
Process: Draining automates the process of safely removing workloads before node maintenance. It
performs two main ac ons:
Evicts (gracefully terminates and reschedules) all regular pods running on the node. It
respects PodDisrup onBudgets (PDBs), ensuring applica on availability isn't compromised
below configured levels.
Command Flags:
--ignore-daemonsets: DaemonSet pods are meant to run on specific (or all) nodes and
are managed differently; they are not evicted by drain. This flag tells drain to proceed
even though DaemonSet pods will remain.
--delete-emptydir-data: Pods using emptyDir volumes will lose their data when evicted
(as emptyDir is ed to the pod lifecycle on that specific node). This flag confirms you
understand and accept this data loss.
Precau ons:
PodDisrup onBudgets (PDBs): Ensure cri cal applica ons have PDBs configured correctly
before draining. A PDB that's too restric ve (e.g., requiring 100% availability) can block
the drain indefinitely.
Stateful Workloads: Understand how stateful applica ons handle termina on and
rescheduling. Ensure data is persisted correctly (using Persistent Volumes) or that the
applica on can gracefully handle leader elec on changes or instance restarts.
Cluster Capacity: Verify there is enough capacity on other nodes to accommodate the
pods being evicted from the drained node.
Drain Timeout: Drains can take me, especially if pods have long termina on grace
periods. Monitor the process.
oc get nodes: Look at the STATUS column. If it includes SchedulingDisabled, the node is
cordoned/unschedulable (e.g., Ready,SchedulingDisabled).
oc describe node <node_name>: Look for the Taints sec on. A cordoned node will have a
taint like node.kubernetes.io/unschedulable:NoSchedule.
9. Why might you add labels to nodes? Provide the command to add a label.
Command: oc label node <node_name> key=value (e.g., oc label node worker-gpu-1.example.com
accelerator=nvidia-a100)
Purpose: Labels are key-value pairs used to organize and categorize nodes. Common use cases
include:
Targe ng Workloads: Using nodeSelector or nodeAffinity in pod specifica ons to ensure pods
run only on nodes with specific hardware (like GPUs), geographic loca on (region=east),
environment (env=prod), or specific storage capabili es.
Applying Configura ons: Targe ng specific nodes for MachineConfigs or Tuned profiles.
Inventory/Grouping: Simply organizing nodes for easier filtering and management (e.g.,
role=infra, project=billing).
10. How do you remove a label from a node using the oc command?
Command: oc label node <node_name> key- (Note the trailing hyphen a er the key).
Descrip on: This command removes the label iden fied by key from the specified node. This is used
when a label is no longer relevant, was applied incorrectly, or the node's role/characteris c has
changed.
Taint: Applied to a node. It acts as a "repellant" mark, indica ng that pods generally
shouldn't schedule there unless they explicitly tolerate the taint.
Tolera on: Applied to a pod. It indicates that the pod is "willing" to schedule onto nodes that
have matching taints.
NoSchedule: No new pods will be scheduled unless they tolerate the taint. Exis ng pods are
unaffected. (Cordoning uses this).
PreferNoSchedule: The scheduler will try not to schedule pods without the tolera on onto
the node, but it's not a strict requirement.
NoExecute: New pods won't schedule, and exis ng pods running on the node without the
tolera on will be evicted. O en used for node condi ons like NotReady or DiskPressure.
Use Case: Dedica ng nodes for specific func ons (e.g., infra workloads, specific hardware), ensuring
pods only run on appropriate nodes, or automa cally evic ng pods from unhealthy nodes.
Descrip on: This command applies a taint with the specified key, value, and the NoSchedule effect.
Any pod wan ng to schedule on this node must now have a tolera on in its spec for key=value (or
just key if the operator is Exists). This is commonly used to reserve nodes for specific types of
workloads that are configured with the necessary tolera on.
Descrip on: This removes the taint iden fied by the specified key and Effect. If mul ple taints exist
with the same key but different effects, only the one matching the specified effect is removed.
Removing a NoSchedule taint makes the node generally available for scheduling again (assuming no
other taints prevent it).
14. What informa on can you find using oc describe node <node_name>?
Descrip on: This command provides a wealth of detailed informa on about a node's current state
and configura on as known by the API server. Key sec ons include:
Condi ons: The node's health status (Ready, MemoryPressure, DiskPressure, PIDPressure,
NetworkUnavailable) with reasons and transi on mes. Crucial for diagnosing NotReady
states.
Capacity & Allocatable: Total vs. available resources (CPU, memory, pods, ephemeral-
storage).
System Info: OS Image, Kernel Version, Kubelet Version, CRI-O Version, Opera ng System,
Architecture.
Pods: A list of non-termina ng pods currently scheduled on the node, including their
resource requests/limits.
Events: A chronological list of recent events related to the node (e.g., health checks,
scheduling decisions, image pulls, volume mounts, reboots detected via Kubelet restarts).
Essen al for troubleshoo ng.
15. How do you find the internal and external IP addresses assigned to a node?
Commands:
oc get node <node_name> -o wide: The INTERNAL-IP and EXTERNAL-IP columns show the
primary addresses.
oc describe node <node_name>: The Addresses sec on lists all types clearly.
Descrip on: Nodes typically have an internal IP used for cluster communica on and may have an
external IP for access from outside the cluster network (common in cloud environments). Knowing
these is vital for networking configura on and troubleshoo ng.
16. How can you check if the Kubelet on a node is healthy using the API?
Command: oc get --raw /api/v1/nodes/<node_name>/proxy/healthz
Descrip on: This command uses the API server as a proxy to directly access the Kubelet's /healthz
endpoint on the specified node.
If it fails or mes out, it indicates a problem with the Kubelet process itself or network
connec vity between the API server and the Kubelet on that node. This is a more direct
health check than just relying on the Ready status, which involves other factors.
17. What is the recommended way to gain shell access to an RHCOS node for
debugging?
Command: oc debug node/<node_name>
Descrip on: This is the standard, supported method in OpenShi 4. It works by:
This pod mounts the node's host filesystem at /host within the pod.
From inside the pod's shell, you can run chroot /host to enter the node's actual filesystem
context.
Why preferred: It doesn't require managing SSH keys for nodes, leverages cluster
authen ca on/authoriza on, and provides the necessary privileges within a temporary, managed
container.
18. Once you have shell access via a debug pod, how would you check the node's disk
usage?
Command: chroot /host df -h
Descrip on:
First, you use chroot /host to change the root filesystem context from the debug pod's
filesystem to the node's actual host filesystem (mounted at /host).
Then, you run standard Linux commands like df -h (disk free, human-readable) to see the
usage sta s cs for the node's mounted filesystems (root par on, /var, etc.). This is crucial
for diagnosing DiskPressure condi ons.
19. How would you check the status of the kubelet or crio systemd services on an
RHCOS node?
Commands (inside oc debug node/... pod):
Descrip on: A er using chroot /host within the debug pod, you can use standard systemctl
commands to interact with the node's systemd services. systemctl status <service> checks if the
service is ac ve (running), enabled, and shows recent log entries, helping diagnose issues if these
core components are not running correctly.
20. How do you iden fy which node a par cular pod is currently scheduled on?
Command: oc get pod <pod_name> -o wide -n <project_name>
Descrip on: The -o wide output format for oc get pod includes a NODE column that explicitly shows
the name of the node where the pod is running. This is essen al for correla ng pod issues with
poten al node problems or for accessing node-specific informa on related to the pod.
Descrip on: This command filters the list of pods across all projects (--all-namespaces) to show only
those whose spec.nodeName field matches the specified node. The -o wide output helps see details
like pod IP and readiness status alongside the node name. It's useful for understanding the workload
distribu on on a node or iden fying all poten ally affected pods if a node has issues.
22. How can you check the version of the container run me (CRI-O) on a node?
Command: oc get node <node_name> -o jsonpath='{.status.nodeInfo.containerRun meVersion}'
Descrip on: Similar to checking the Kubelet version, this retrieves the specific version of the
container run me (CRI-O in OpenShi 4) reported by the node. This is useful for checking
compa bility or known issues related to the run me.
23. When might you need to force delete a pod, and what is the command? What are
the risks?
Command: oc delete pod <pod_name> -n <project_name> --grace-period=0 --force
When Needed: This is a last resort used when a pod is stuck in the Termina ng state indefinitely. This
usually happens because the Kubelet on the node cannot successfully stop the container(s) within
the pod, o en due to unresponsive processes, storage issues (unmountable volumes), or network
problems preven ng cleanup.
Risks:
Bypasses Graceful Shutdown: The container process does not receive a SIGTERM signal
and has no chance to shut down cleanly. This can lead to data corrup on or inconsistent
state, especially for stateful applica ons.
Resource Leaks: Resources held by the pod (like network endpoints or poten ally
mounted volumes on the node) might not be cleaned up correctly by the Kubelet,
poten ally requiring manual interven on or a node reboot later.
StatefulSet Issues: Force dele ng pods managed by a StatefulSet can violate its ordering
and uniqueness guarantees, poten ally leading to data inconsistencies or "split-brain"
scenarios if not handled carefully. Use oc delete pod ... --force --grace-period=0 very
cau ously.
24. How are node cer ficates managed in OpenShi 4, and how can you check their
status?
Management: Node (Kubelet server and client) cer ficates are managed automa cally by the cluster.
1. When a node joins, its Kubelet generates a Cer ficate Signing Request (CSR).
4. Cer ficates have a rela vely short lifespan (e.g., 1 year). The Kubelet automa cally requests
renewal before expira on, repea ng the CSR process.
Checking Status:
Node Condi ons: The primary indicator is the Ready condi on of the node (oc describe
node <node_name>). Cer ficate issues o en cause the Kubelet to fail communica on,
leading to a NotReady status with relevant messages in the condi ons or events.
CSRs: You can list CSRs with oc get csr. Look for pending or failed requests related to
nodes (kubernetes.io/kubelet-serving or kubernetes.io/kube-apiserver-client-kubelet).
Usually, approved CSRs are cleaned up quickly.
Kubelet Logs: If a node is NotReady, checking Kubelet logs (oc debug node/..., chroot
/host journalctl -u kubelet) o en reveals specific cer ficate errors (e.g., expired, unable
to request).
25. In an IPI environment, how do you link a Node object back to its corresponding
Machine object?
Command: oc get node <node_name> -o
jsonpath='{.metadata.annota ons.machine\.openshi \.io/machine}'
Descrip on: In Installer-Provisioned Infrastructure (IPI) or environments where the Machine API
Operator (MAO) manages nodes, each Node object has an annota on
(machine.openshi .io/machine) that stores the name and namespace of the corresponding Machine
custom resource. The Machine object represents the underlying infrastructure instance (e.g., EC2
instance, vSphere VM) and manages its lifecycle. This command extracts that annota on value.
26. How can you determine the MachineSet that manages a specific Machine object?
Command: First, get the Machine name (using the previous ques on's command if star ng from a
Node). Then:
oc get machine <machine_name> -n openshi -machine-api -o
jsonpath='{.metadata.ownerReferences[?(@.kind=="MachineSet")].name}'
Descrip on: Machine objects are typically created and managed by a MachineSet (analogous to how
ReplicaSets manage Pods). The MachineSet defines the template (instance type, image, user data)
and the desired number of replicas for a group of iden cal machines. A Machine object's metadata
contains an ownerReferences field poin ng to its controlling MachineSet. This command filters the
owner references to find the one whose kind is MachineSet and extracts its name.
27. What's a key indicator that a node is managed by the Machine API Operator
(MAO)?
Indicator: The presence of the machine.openshi .io/machine annota on on the Node object.
Descrip on: If this command returns a value (like openshi -machine-api/worker- LZNQ-machine-0),
it strongly indicates the node's lifecycle is ed to a Machine object, which is managed by the
Machine API Operator. Nodes provisioned manually outside of the Machine API (common in some
UPI scenarios) will typically lack this specific annota on.
Descrip on: This is the primary command for a high-level cluster health check. Cluster Operators are
controllers that manage specific components of OpenShi (like networking, authen ca on, registry,
etc.). This command lists all operators and their current status (AVAILABLE, PROGRESSING,
DEGRADED, UNKNOWN). A healthy cluster typically shows all operators as AVAILABLE=True,
PROGRESSING=False, and DEGRADED=False.
29. Explain the meaning of the AVAILABLE, PROGRESSING, and DEGRADED statuses for
a Cluster Operator.
Descrip on: These statuses indicate the operator's ability to manage its component:
AVAILABLE (True): The operator is running and the component it manages (its
operand) is func onal and available according to the operator's checks. This is
the desired healthy state.
PROGRESSING (True): The operator is ac vely working to deploy or update its
managed component to a desired state. This is expected during cluster upgrades,
configura on changes, or ini al deployment. It should be temporary; if an
operator stays PROGRESSING for an extended period, it might indicate a
problem.
DEGRADED (True): The operator is encountering errors that prevent it or its
managed component from func oning correctly. The component might be
unavailable or experiencing significant issues. This status requires immediate
inves ga on as it indicates a problem with a core cluster func on.
30. If a Cluster Operator is DEGRADED, what is the first command you would use to
inves gate?
Command: oc describe clusteroperator <operator_name> (e.g., oc describe co authen ca on)
Status Condi ons: More granular details about why the operator is in its current
state (e.g., specific error messages, failing checks).
Related Objects: References to the operand resources it manages.
Events: Recent events associated with the operator, which o en contain specific
error logs or failure reasons. This helps pinpoint the root cause of the
degrada on.
31. How can you find the specific version of a deployed Cluster Operator?
Command: oc get clusteroperator <operator_name> -o
jsonpath='{.status.versions[?(@.name=="operator")].version}'
Descrip on: Cluster Operators manage operands, and both the operator code and the operand code
have versions. This command specifically extracts the version of the operator controller itself from
the operator's status field. Knowing the operator version is useful for checking compa bility or
iden fying bugs specific to that operator release.
32. How would you typically find and view the logs for the pods managed by a specific
Cluster Operator (e.g., the authen ca on operator)?
Process:
View Logs: Use oc logs with the pod name and namespace: oc logs <operator-
pod-name> -n openshi -<operator-name> [-f] (-f to follow logs).
Descrip on: Operator logs contain detailed informa on about the operator's ac ons, decisions, and
any errors encountered while managing its component. This is o en necessary for deep
troubleshoo ng when oc describe co doesn't provide enough detail.
33. What command shows the overall installed version of the OpenShi cluster?
Command: oc get clusterversion
Descrip on: This command queries the ClusterVersion object, which is a singleton resource (version)
that reports the currently installed OpenShi version, the desired version (if an upgrade is in
progress), and the overall status of the cluster version/upgrade.
Descrip on: This extracts the history field from the ClusterVersion object's status. It provides a list of
previous versions the cluster has run, the state (Completed or Par al), and the start/comple on
mes for each version transi on. This is useful for tracking the cluster's upgrade path and iden fying
when specific versions were installed.
35. How do you check if updates are available for your cluster?
Command: oc adm upgrade
Descrip on: For clusters with internet connec vity (or access to a mirrored OpenShi Update
Service), this command queries the update service based on the cluster's current version and
configured channel. It reports the current version and lists any available updates (newer versions)
within that channel, along with recommended update paths if applicable.
36. What is an update channel in OpenShi 4, and how do you check the currently
configured channel?
Command to Check: oc get clusterversion version -o jsonpath='{.spec.channel}' (or oc adm upgrade |
grep Channel)
Descrip on: An update channel dictates the stream of OpenShi updates offered to a cluster.
Common channels include:
stable-4.x: Receives generally available (GA) Z-stream (patch) releases for the
specified minor version (e.g., 4.14). Recommended for produc on.
fast-4.x: Receives GA Z-stream releases slightly earlier than stable. Suitable for
environments wan ng faster access to patches.
eus-4.x: Extended Update Support channels, providing patches for specific minor
versions for a longer period (requires EUS subscrip on).
The channel determines the upgrade path and stability level of the versions presented to
the cluster.
37. How would you assess the health of the core control plane components like the API
server and etcd?
Process:
Check Cluster Operators: The primary method. Check the status of kube-apiserver, etcd,
kube-controller-manager, and kube-scheduler operators using oc get co. Any DEGRADED
or non-AVAILABLE status needs inves ga on via oc describe co.
Check Component Pods: List pods in the relevant namespaces (e.g., openshi -kube-
apiserver, openshi -etcd, openshi -kube-controller-manager, openshi -kube-scheduler)
to ensure they are Running and haven't restarted frequently. Check their logs if issues
are suspected.
API Responsiveness: Use basic oc commands (like oc get nodes) to gauge API server
responsiveness. Check API server latency metrics in monitoring.
Etcd Health (Specific): Use oc describe co etcd and poten ally etcdctl commands (see
next ques on).
Descrip on: The control plane is the brain of the cluster. Assessing its health involves checking the
operators managing these components, the component pods themselves, and observing overall API
responsiveness.
38. What are the steps to check the health of the etcd cluster specifically?
Steps:
Check Pods: oc get pods -n openshi -etcd. Ensure all pods (typically 3 on the masters)
are Running. Check for restarts.
Check Endpoints: oc get endpoints etcd-client -n openshi -etcd -o yaml. Verify it lists
endpoints for all etcd pods.
Check member list: etcdctl member list -w table --cacert ... --cert ... --key ...
Check Metrics: Look at etcd performance metrics in monitoring (disk sync mes, leader
changes, proposal failures).
Descrip on: Etcd health is cri cal as it stores all cluster state. Checking involves verifying the
managing operator, the etcd pods themselves, and poten ally using etcdctl for direct member status
and health checks. Performance metrics are also key indicators.
39. How do you verify the status and health of the internal container image registry?
Steps:
Check Pods: oc get pods -n openshi -image-registry. Ensure pods are Running and check
logs (oc logs <registry_pod> -n openshi -image-registry) for errors if needed.
Check Storage: If using persistent storage, check the associated PVC status: oc get pvc -n
openshi -image-registry.
(Op onal) Test Push/Pull: Try pushing/pulling a test image using podman or docker
logged into the internal registry route (if exposed) or test image pulls within cluster pods.
Descrip on: Ensures the cluster's built-in registry for storing applica on and S2I images is
opera onal. Problems here affect builds and deployments.
40. How do you verify the status and health of the cluster's Ingress Controllers?
Steps:
Check Pods: oc get pods -n openshi -ingress. Ensure router pods are Running. Check logs
(oc logs <router_pod> -n openshi -ingress) for errors (e.g., config reload issues,
connec on problems).
Check Route Status: Check specific applica on routes (oc get route <route_name>) for
errors or admission status.
Test External Access: Try accessing an applica on via its Route URL from outside the
cluster.
Descrip on: Verifies the components responsible for rou ng external traffic to internal services are
healthy. Problems impact applica on accessibility.
41. Where can you find the unique Cluster ID for your OpenShi installa on?
Command: oc get clusterversion version -o jsonpath='{.spec.clusterID}'
Descrip on: Retrieves the globally unique iden fier assigned to the cluster during installa on. This
ID is used for various purposes, including telemetry repor ng (if enabled) and iden fying the cluster
in Red Hat support systems and the OpenShi Cluster Manager portal.
42. How do you check the status of the cluster monitoring stack components
(Prometheus, Grafana, Alertmanager)?
Steps:
Check Pods: List pods in the openshi -monitoring namespace. Specifically check for:
Check UIs: Access the Grafana and Alertmanager routes (oc get routes -n openshi -
monitoring) to ensure they are responsive.
Descrip on: Ensures the core components responsible for metrics collec on, aler ng, and
visualiza on are running correctly.
43. If the cluster logging stack is installed, how do you check its overall health?
Steps:
Check Operator: oc get co logging (if using the Red Hat OpenShi Logging operator). If
not healthy, oc describe co logging.
Check Pods: List pods in the openshi -logging namespace. Specifically check for:
Check Elas csearch Health: Use the curl command (from Q#135 in the previous doc) or
check the Kibana Stack Management UI for cluster health (green/yellow/red).
Check Kibana UI: Access the Kibana route (oc get route kibana -n openshi -logging) and
verify logs are searchable and dashboards load.
Descrip on: Verifies the end-to-end health of the log aggrega on pipeline (collec on, storage,
visualiza on).
44. How can you generally determine which operator is responsible for managing a
specific Custom Resource Defini on (CRD)?
Methods:
CRD Naming Conven on: O en, the CRD's group name hints at the operator (e.g.,
consoles.operator.openshi .io is managed by the console operator,
etcds.operator.openshi .io by the etcd operator).
oc describe crd <crd_name>: While it doesn't explicitly list the operator, the descrip on
and related resources might provide clues.
Operator Descrip ons: Check the descrip ons of installed operators (oc describe co
<operator_name>) - they some mes list the CRDs they manage.
Operator YAML: Look at the Cluster Operator's deployment YAML (oc get deployment -n
openshi -<operator-name> -o yaml) - the RBAC rules might show which CRDs it has
permissions for.
Documenta on: Operator documenta on usually lists the CRDs it introduces and
manages.
Descrip on: Understanding which operator controls a CRD is essen al when troubleshoo ng issues
related to Custom Resources (CRs) based on that CRD.
45. Under what circumstances might you temporarily set a Cluster Operator's
managementState to Unmanaged, and what is the command?
Risks: The operator will no longer ensure its components are in the desired state, enforce
configura ons, or perform updates. This can lead to configura on dri , instability, and
prevent future automated management or upgrades un l set back to Managed. Never do
this unless explicitly instructed and fully understanding the consequences. To revert: oc
patch clusteroperator <operator_name> --type=merge -p '{"spec": {"managementState":
"Managed"}}'.
Descrip on: This command retrieves all PVC objects within the specified project (namespace). Key
informa on displayed typically includes:
This overview is crucial for understanding applica on storage requests and their fulfillment status
within a project.
47. What is the difference between a Persistent Volume (PV) and a Persistent Volume
Claim (PVC)?
Descrip on: These two objects form the core of Kubernetes/OpenShi persistent storage
abstrac on:
Persistent Volume (PV): Represents a piece of actual storage in the cluster (e.g., an NFS
share, a vSphere VMDK, an AWS EBS volume, a Ceph RBD volume). PVs are cluster
resources, managed by administrators. They have a lifecycle independent of any
individual pod and contain details about the storage capacity, access modes, and type.
PVs can be provisioned sta cally (pre-created by an admin) or dynamically (created on-
demand by a StorageClass provisioner).
Persistent Volume Claim (PVC): Represents a request for storage by a user or applica on
within a project. A PVC specifies the desired storage size, access modes, and op onally a
specific StorageClass. It acts like a voucher that consumes the resources of a matching
PV. Pods mount PVCs, not PVs directly.
Analogy: Think of PVs as the available lockers (storage) in a gym, and PVCs as the request slip a
member (applica on) uses to get assigned a specific locker that meets their size requirements.
48. How can you check if a PVC is Bound or Pending? What does Pending usually
indicate?
Command: oc get pvc <pvc_name> -n <project_name> (Check the STATUS column).
Descrip on:
Bound: This is the desired state. It means the PVC has successfully found and claimed a
matching PV (either pre-exis ng or dynamically provisioned). The applica on can now
use this PVC in its pods.
Pending: This indicates the PVC's request cannot currently be fulfilled. Common reasons
include:
1. No Matching PV: For sta c provisioning, no available PV meets the
PVC's requirements (size, access modes, labels).
2. Dynamic Provisioning Issues: If using a StorageClass, the provisioner
might be failing (check StorageClass, CSI driver pods), or there might
be insufficient capacity in the underlying storage pool.
3. StorageClass Not Found: The StorageClass specified in the PVC
doesn't exist.
4. Quota Limits: The project might have hit its storage quota limits.
Troubleshoo ng a Pending PVC usually involves checking oc describe pvc <pvc_name> for
events and verifying the availability and configura on of PVs and StorageClasses.
49. How do you find the name of the PV that a specific PVC is bound to?
Command: oc get pvc <pvc_name> -n <project_name> -o jsonpath='{.spec.volumeName}'
Descrip on: Once a PVC is Bound, its spec.volumeName field holds the name of the specific PV it has
claimed. This command extracts that field value directly. Knowing the PV name allows you to
inves gate the underlying storage volume using oc describe pv <pv_name>.
50. What details about a PVC can you find using oc describe pvc?
Command: oc describe pvc <pvc_name> -n <project_name>
Descrip on: This command provides comprehensive details about a PVC, including:
51. How do you list all Persistent Volumes (PVs) in the cluster? What statuses can a PV
have?
Command: oc get pv
Descrip on: Lists all PV objects known to the cluster. Key statuses include:
Available: The PV is ready and has not yet been claimed by any PVC.
Bound: The PV has been successfully claimed by a PVC and is in use.
Released: The PVC that was bound to this PV has been deleted, but the PV itself has not
yet been reclaimed (its fate depends on the persistentVolumeReclaimPolicy). It's not
available for a new PVC yet.
Failed: The PV encountered an error during provisioning or opera on.
52. What command provides detailed informa on about a PV, including its reclaim
policy and source?
Command: oc describe pv <pv_name>
53. How do you determine which PVC, if any, a specific PV is currently bound to?
Command: oc get pv <pv_name> -o jsonpath='{.spec.claimRef}'
Descrip on: A Bound PV has a claimRef field in its spec that references the PVC claiming it. This
command extracts that reference, which includes the PVC's name and namespace. If the PV is not
Bound, this field will likely be null or empty.
54. What is a StorageClass in OpenShi /Kubernetes? How do you list available ones?
Command to List: oc get storageclass or oc get sc
Descrip on: A StorageClass provides a way for administrators to define different "classes" or types of
storage they offer. It acts as a template or blueprint for dynamic provisioning. When a PVC requests a
specific StorageClass, the cluster uses the provisioner defined in that StorageClass to automa cally
create a matching PV and the underlying storage volume. Key elements defined in a StorageClass
include:
55. How can you find out which provisioner is used by a specific StorageClass?
Command: oc get sc <storageclass_name> -o jsonpath='{.provisioner}'
Descrip on: This command directly extracts the provisioner field from the StorageClass defini on.
This tells you which storage plugin (internal, or more commonly now, a CSI driver) is responsible for
crea ng PVs based on this StorageClass. Knowing the provisioner is key to troubleshoo ng dynamic
provisioning failures, as you'd then check the logs and status of the corresponding provisioner pods.
56. How does OpenShi determine which StorageClass to use if a PVC doesn't specify
one? How do you iden fy the default?
Mechanism: If a PVC is created without explicitly se ng the storageClassName field,
OpenShi /Kubernetes will use the StorageClass marked as the default for the cluster. Only one
StorageClass can be marked as default. If no default is set and the PVC doesn't specify a class,
dynamic provisioning won't occur, and the PVC will only bind to a pre-exis ng PV that matches its
requirements and doesn't have a StorageClass specified.
Descrip on: This command filters all StorageClasses to find the one with the specific annota on
(storageclass.kubernetes.io/is-default-class: "true") that designates it as the default.
57. What are Volume Snapshots, and how do you check if VolumeSnapshotClasses are
available?
Command to Check: oc get volumesnapshotclass
Descrip on:
The command oc get volumesnapshotclass lists the available classes, indica ng if the
snapshot feature is configured and available for use with corresponding CSI drivers.
58. How would you check the health of the pods belonging to a specific CSI storage
driver?
Process:
Iden fy Driver Name: Determine the name of the CSI provisioner (e.g., from the
StorageClass using oc get sc <sc_name> -o jsonpath='{.provisioner}').
Find Namespace: CSI drivers usually run their components (controller pods, node
daemonsets) in a dedicated namespace, o en openshi -cluster-csi-drivers or a specific
namespace like openshi -storage (for ODF) or one named a er the driver.
List Pods: Use oc get pods -n <csi_driver_namespace> and filter using labels associated
with the driver (e.g., app=<driver_name>).
Check Status: Ensure the controller pods and the node daemonset pods are Running
without frequent restarts. Check their logs if issues are suspected.
Descrip on: Container Storage Interface (CSI) drivers are the modern way Kubernetes interacts with
storage systems. They typically consist of controller components (handling provisioning, a aching)
and node components (handling moun ng). Checking the health of these pods is crucial for
troubleshoo ng any storage opera on failures (provisioning, a aching, moun ng, snapshots).
59. A user reports their PVC is stuck in Pending. How would you troubleshoot this?
Troubleshoo ng Steps:
Check StorageClass:
Check Provisioner Health: Check the status of the pods for the CSI driver/provisioner
associated with the StorageClass (see Q13). Look for errors in their logs.
Check PV Availability (Sta c Provisioning): If not using dynamic provisioning, check if any
Available PVs match the PVC's requirements (size, access modes): oc get pv.
Check Quotas: Does the project have ResourceQuotas defined for storage? Check if limits
have been reached: oc describe resourcequota -n <project_name>.
Check Underlying Storage: Are there issues in the backend storage system itself (e.g.,
pool full, connec vity issues)?
1. Describe PV: oc describe pv <pv_name>. Look at the Events sec on and the Message
field in the status for error details. This o en indicates why it failed (e.g., provisioner
error, invalid configura on, underlying storage issue).
3. Check Underlying Storage: Inves gate the storage system directly using its
management tools. Was the volume crea on ini ated? Did it encounter errors
there?
Integra ng Exis ng Storage: You have pre-exis ng storage volumes (e.g., NFS exports,
iSCSI LUNs, cloud disks) that you want to make available to the cluster without using a
provisioner.
Unsupported Provisioner: The storage system doesn't have a dynamic provisioner (or CSI
driver) available or configured.
Fine-grained Control: You need absolute control over specific volume parameters or
lifecycle that dynamic provisioning doesn't offer easily.
Process: You create a YAML manifest defining the PV, specifying its capacity, access modes, reclaim
policy, and crucially, the details of the underlying storage source (e.g., NFS server/path, volume IDs).
Then apply it with oc apply -f my-pv.yaml.
62. Explain the different Persistent Volume Reclaim Policies (Delete, Retain) and their
implica ons.
Descrip on: The persistentVolumeReclaimPolicy field on a PV dictates what happens to the
underlying storage volume in the storage system when the PV becomes Released (i.e., a er its
bound PVC is deleted).
Delete: When the PVC is deleted, the PV object is deleted, and OpenShi instructs the
storage provisioner (if dynamic) or system admin (if sta c) to delete the actual storage
volume in the backend (e.g., delete the EBS volume, delete the VMDK file, remove the
NFS directory contents). Data is lost. This is common for dynamically provisioned
volumes where the data lifecycle matches the PVC lifecycle.
Retain: When the PVC is deleted, the PV status changes to Released, but the PV object
and the underlying storage volume are kept. The data remains intact on the volume. An
administrator must manually clean up the PV object (oc delete pv <pv_name>) and
decide what to do with the underlying storage volume (reuse it, delete it manually). This
is safer for cri cal data, preven ng accidental dele on, but requires manual cleanup.
Recycle: (Deprecated) A empted basic cleanup (rm -rf /volume/*). Not recommended
and o en unavailable.
63. How do you check the storage capacity defined for a specific PV?
Command: oc get pv <pv_name> -o jsonpath='{.spec.capacity.storage}'
Descrip on: This command extracts the storage value from the spec.capacity field of the PV object,
showing the total size of the storage volume represented by this PV (e.g., 10Gi, 1Ti).
64. How can you iden fy which running pods are currently using a specific PVC?
Methods:
1. oc describe pvc <pvc_name> -n <project_name>: Look for the Mounted By field near
the top. It lists the names of the pods currently moun ng this claim.
2. (Manual/Scripted): List all pods in the namespace (oc get pods -n <project_name> -o
yaml or -o json) and inspect the spec.volumes sec on of each pod defini on. Look
for volumes of type persistentVolumeClaim where the claimName matches the PVC
you're interested in.
Descrip on: This is important for understanding which applica ons depend on a specific piece of
storage, especially before a emp ng to delete a PVC or perform maintenance that might affect the
volume.
Descrip on: The Cluster Network Operator is responsible for deploying and managing the cluster's
core networking components (like OpenShi SDN or OVN-Kubernetes). This command provides a
quick health check. Look for AVAILABLE=True, PROGRESSING=False, DEGRADED=False. If the status is
not healthy, use oc describe co network to get detailed error messages and events.
66. How can you iden fy whether the cluster is using OpenShi SDN or OVN-
Kubernetes as its CNI plugin?
Command: oc get network.config.openshi .io cluster -o jsonpath='{.spec.networkType}'
Descrip on: OpenShi 4 supports different network plugins (CNIs). This command queries the
cluster-wide network configura on object and extracts the networkType field, which will explicitly
state either OpenShi SDN or OVNKubernetes. Knowing the CNI plugin is crucial as configura on,
features, and troubleshoo ng steps differ between them.
67. How would you check the status of the main SDN/OVN pods running on the cluster
nodes?
Commands:
Descrip on: A Kubernetes Service is an abstrac on that defines a logical set of Pods (usually
determined by a label selector) and a policy by which to access them. Services provide a stable
endpoint (ClusterIP, NodePort, or LoadBalancer IP) for accessing pods, even as pods are created,
destroyed, or rescheduled. They act as an internal load balancer and service discovery mechanism
within the cluster. Lis ng services shows these stable endpoints available within a project.
ClusterIP: (Default type) Exposes the Service on an internal IP address within the cluster.
This IP is only reachable from within the cluster. This is the most common type for
internal service-to-service communica on.
NodePort: Exposes the Service on each Node's IP address at a sta c port (the NodePort).
A ClusterIP Service (to which the NodePort routes) is automa cally created. This allows
external traffic to reach the Service by accessing <NodeIP>:<NodePort>. It's o en used
as a building block for external load balancers or for direct access during
development/tes ng, but less common for produc on external access due to node IP
management challenges.
LoadBalancer: Exposes the Service externally using a cloud provider's load balancer (e.g.,
AWS ELB, Azure Load Balancer, GCP Load Balancer) or an on-premise solu on like
MetalLB. The cloud provider (or MetalLB) creates a load balancer, which then directs
traffic to the Service's NodePorts (which are automa cally created, along with the
ClusterIP). This is the standard way to expose services directly to the internet in
supported cloud/on-prem environments. The external IP address of the load balancer is
populated in the Service's status.
Descrip on: This command extracts the stable internal IP address assigned to the Service. Pods
within the cluster can use this IP (along with the service port) to reliably connect to the pods backing
the Service. If the value is None, it might be a Headless Service.
71. What are Service Endpoints, and how do you check them for a specific Service?
Command to Check: oc get endpoints <service_name> -n <project_name> or oc get ep
<service_name> -n <project_name>
Descrip on: An Endpoints object holds the list of actual IP addresses and ports of the healthy Pods
that match the Service's label selector. When a connec on is made to a Service's ClusterIP, kube-
proxy (or OVN) uses the informa on in the Endpoints object to route the traffic to one of the listed
pod IPs. Checking endpoints is crucial for verifying that a Service is correctly selec ng healthy
backend pods.
72. What does it mean if a Service has no endpoints listed? How would you
troubleshoot?
Meaning: It means the Service selector is not matching any currently running and ready
pods. Traffic sent to the Service's ClusterIP will fail because there are no backend pods to
route to.
Troubleshoo ng Steps:
2. Check Pod Labels: List pods intended to be part of the service: oc get pods -n
<project_name> -l <key>=<value> (using the selector labels). Do any pods exist with
exactly these labels? Check for typos.
3. Check Pod Status: Are the matching pods Running? Are they Ready? Services only
include pods that are marked as ready (i.e., passing their readiness probes). Use oc
get pods <pod_name> -o wide and oc describe pod <pod_name> to check status and
readiness probe results.
4. Check Pod Namespace: Ensure the pods and the Service are in the same namespace.
5. Check Readiness Probes: If pods are Running but not Ready, inves gate why their
readiness probes are failing (oc describe pod, oc logs).
73. What is an OpenShi Route? How does it differ from a Kubernetes Ingress object?
Descrip on:
Key Differences: Routes are ghtly integrated with the OpenShi Ingress Controller and offer more
built-in features out-of-the-box compared to the base Kubernetes Ingress specifica on, which relies
more heavily on the capabili es of the specific controller implementa on being used. OpenShi can
automa cally generate hostnames for Routes based on the cluster's ingress domain.
Descrip on: This command shows all the Route objects defined in the specified project, lis ng their
names, assigned hostnames, the services they point to, ports, and TLS termina on status.
75. How do you find the publicly accessible hostname generated for a Route?
Command: oc get route <route_name> -n <project_name> -o jsonpath='{.spec.host}'
Descrip on: This extracts the host field from the Route's specifica on. This is the DNS hostname that
external clients use to access the applica on exposed by this Route. DNS must be configured (o en
automa cally via wildcard DNS for the *.apps domain) to point this hostname to the OpenShi
router's public IP address.
76. How can you tell which Service a par cular Route is direc ng traffic towards?
Command: oc get route <route_name> -n <project_name> -o jsonpath='{.spec.to.name}'
Descrip on: A Route must target an internal Kubernetes Service. This command extracts the name of
the target Service (spec.to.name) from the Route defini on, showing where the Ingress Controller
will forward incoming requests that match the Route's host/path.
77. How do you check the status of the Ingress Controller (router) pods?
Commands:
Descrip on: The Ingress Controller runs as regular pods (typically managed by a Deployment) within
the openshi -ingress namespace. Checking these pods ensures the router instances responsible for
handling Route traffic are running and healthy. Checking the ingress Cluster Operator verifies the
overall health of the ingress subsystem.
78. How would you typically find the public IP address used by the OpenShi
router/Ingress Controller?
Method: This depends on how the router Service is exposed, which varies by pla orm:
Cloud Provider (AWS, Azure, GCP etc.): The router service is usually of type LoadBalancer.
Find its external IP: oc get svc router-default -n openshi -ingress -o
jsonpath='{.status.loadBalancer.ingress[0].ip}' or
{.status.loadBalancer.ingress[0].hostname}.
Bare Metal (with MetalLB): Similar to cloud providers, check the LoadBalancer service: oc
get svc router-default -n openshi -ingress. The external IP will be assigned from a
MetalLB pool.
vSphere/Other UPI: O en uses NodePort or external HAProxy/F5. You might check the
NodePort service (oc get svc router-default -n openshi -ingress) and then find the public
IPs of the worker nodes designated for ingress traffic, or check the configura on of the
external load balancer VIP.
Descrip on: This IP address is the external entry point for all Route traffic. DNS records for Route
hostnames must resolve to this IP (or the IPs of the load balancer/nodes).
Func on: By default, all pods within a project can communicate with each other. Network Policies
allow administrators to define rules specifying which pods (based on labels) are allowed to connect
to other pods, or which pods are allowed to receive incoming connec ons from specific sources
(other pods, namespaces, or IP blocks) on par cular ports/protocols. They are crucial for
implemen ng security principles like zero-trust networking and least privilege.
80. How do you list all Network Policies applied within a project?
Command: oc get networkpolicy -n <project_name> or oc get netpol -n <project_name>
Descrip on: This command shows all the NetworkPolicy objects currently defined within the
specified project, giving an overview of the network segmenta on rules in place.
81. How can you view the specific rules (selectors, ingress/egress rules) defined in a
Network Policy?
Commands:
Descrip on:
82. How would you implement a "default deny" network stance for a project?
Method: Apply a Network Policy that selects all pods in the namespace but allows no ingress traffic.
Example YAML:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {} # Selects all pods in the namespace
policyTypes:
- Ingress # Applies only to ingress rules
# Implicitly denies all ingress because the 'ingress' list is empty or omi ed
Descrip on: This policy selects every pod (podSelector: {}) and specifies it applies to Ingress. By not
defining any ingress rules, it effec vely blocks all incoming traffic to all pods from any source (within
or outside the namespace), unless allowed by other more specific Network Policies. You would then
create addi onal policies to explicitly allow necessary traffic (e.g., allow ingress from the router,
allow ingress from specific app ers).
83. Describe how you would test network connec vity between two specific pods in
different projects.
Process:
1. Iden fy Pod IPs: Get the IP addresses of the source and des na on pods: oc get pod
<pod_name> -n <namespace> -o wide.
2. Exec into Source Pod: Start an interac ve shell in the source pod: oc exec
<source_pod_name> -n <source_namespace> -it -- /bin/bash (or /bin/sh).
3. Install Test Tools (if needed): The base container image might not have tools like
ping, curl, or telnet. You might need to install them temporarily (e.g., yum install
ipu ls curl telnet on UBI) if possible, or use a debug container with these tools.
5. Check Network Policies: If connec vity fails, check Network Policies in both the
source and des na on namespaces. Ensure an egress policy in the source
namespace allows traffic to the des na on pod/namespace/IP, AND an ingress
policy in the des na on namespace allows traffic from the source
pod/namespace/IP on the required port.
84. What is an Egress IP in OpenShi , and how would you check if one is configured for
a project?
Descrip on: Egress IP allows you to assign a specific, predictable source IP address to traffic
origina ng from pods within one or more designated projects when that traffic leaves the OpenShi
cluster network (e.g., goes to the internet or legacy systems). This is o en required by external
firewalls that filter based on source IP. OpenShi automa cally configures rou ng on the node
hos ng the Egress IP to NAT the outgoing traffic. High availability can be configured.
Commands to Check:
85. How can you check the configured MTU for the cluster network?
Command (OVN-Kubernetes): oc get network.config.openshi .io cluster -o
jsonpath='{.status.networkType}{"\n"}{.status.clusterNetworkMTU}'
Command (OpenShi SDN): Check the network.config.openshi .io object or the configura on of the
openshi -sdn pods/operator. O en inferred or set via CNO config.
Descrip on: Checks the Maximum Transmission Unit (packet size) configured for the pod network
overlay. Mismatches between the overlay MTU and the underlying physical network MTU can cause
packet fragmenta on or loss, leading to performance degrada on or connec vity failures.
86. How do you find the defined CIDR block for the Cluster Network (pod network)?
Command: oc get network.config.openshi .io cluster -o jsonpath='{.spec.clusterNetwork[*].cidr}'
Descrip on: Retrieves the IP address range(s) from which pods within the cluster are assigned their
IP addresses.
87. How do you find the defined CIDR block for the Service Network (ClusterIP range)?
Command: oc get network.config.openshi .io cluster -o jsonpath='{.spec.serviceNetwork}'
Descrip on: Retrieves the IP address range from which Services of type ClusterIP are assigned their
virtual IP addresses.
88. If using Multus for mul ple networks, how would you check the status of its
components?
Process:
1. Check CNO: The Cluster Network Operator manages Multus deployment. Check oc
get co network.
2. Check DaemonSet: Multus typically runs as a DaemonSet to install the Multus CNI
binary on nodes. Check pods in openshi -multus or kube-system (depending on
version/config): oc get pods -n openshi -multus or oc get ds -n openshi -multus.
3. Check NetworkA achmentDefini ons: List the custom resources that define the
addi onal networks: oc get network-a achment-defini ons -A or oc get net-a ach-
def -A.
Descrip on: Multus allows pods to connect to mul ple networks simultaneously (e.g., the default
pod network plus an SR-IOV network). Checking involves verifying the Multus plugin deployment and
the defini ons of the addi onal networks.
89. How would you view the logs for the core OVN or SDN components on a specific
node?
Method: Use oc debug node/ to get shell access, then use journalctl.
Descrip on: Accessing the detailed logs of the node-level networking agents (OVN controller, OVS, or
SDN agent) is essen al for deep troubleshoo ng of pod networking issues, policy enforcement
problems, or overlay network failures on that specific node.
90. How can you test DNS resolu on using the cluster's internal DNS service from
within a pod?
Method: Exec into a pod and use a DNS lookup tool (dig, nslookup) directed specifically at the cluster
DNS service IP.
1. Find DNS Service IP: CLUSTER_DNS_IP=$(oc get svc -n openshi -dns dns-default -o
jsonpath='{.spec.clusterIP}') (Run this outside the pod or pass it in).
Descrip on: This bypasses the pod's local /etc/resolv.conf se ngs and directly queries the CoreDNS
service responsible for internal cluster name resolu on. It helps isolate whether a DNS issue lies with
the pod's configura on or the central DNS service itself. Ensure the pod's container image has dig or
nslookup installed (o en in bind-u ls or dnsu ls packages).
Standard SCCs:
nonroot (formerly nonroot-v2): Requires pods to run with a non-root UID, but is
slightly less restric ve than restricted in other areas.
anyuid: Allows pods to run with any UID (including root/UID 0), but s ll restricts
other privileged se ngs.
privileged: The least restric ve SCC, gran ng almost all capabili es, including
running privileged containers and accessing the host filesystem/network. Access
is ghtly controlled and usually reserved for cluster infrastructure pods.
Descrip on: This command lists all the Security Context Constraint objects defined in the cluster,
showing their names and some basic se ngs (like whether privileged containers are allowed, default
add/drop capabili es). It provides an overview of the different security profiles available.
93. How can you view the specific permissions and se ngs defined within an SCC like
restricted?
Command: oc describe scc restricted or oc get scc restricted -o yaml
Descrip on:
get ... -o yaml provides the full YAML defini on, showing the precise
configura on of every field within the SCC object. This is useful for
understanding the exact constraints it enforces.
94. How do you determine which SCCs a specific service account is allowed to use?
Command: oc adm policy scc-subject-review -z <service_account_name> -n <project_name>
Descrip on: This command checks the RBAC permissions (Roles/ClusterRoles bound to the service
account and its groups) and determines which SCCs the specified service account (-z <name>) in the
given namespace (-n <namespace>) is authorized to use. OpenShi will try to validate a pod against
the allowed SCCs in order of priority (most restric ve first usually).
Alterna vely, you can check which users/groups can use a specific SCC: oc adm policy who-can use
scc <scc_name>.
95. What is the command to grant a service account access to a specific SCC? Why
should this be done cau ously?
Command: oc adm policy add-scc-to-user <scc_name> -z <service_account_name> -n
<project_name>
Descrip on: This command directly binds an SCC to a specific service account within a namespace.
Cau on: Gran ng access to less restric ve SCCs (like anyuid, hostaccess, or especially privileged)
significantly increases the poten al security risk if a pod running under that service account is
compromised. It bypasses many default security protec ons. This should only be done when
absolutely necessary for the applica on's func on and a er carefully evalua ng the security
implica ons. Always grant the least permissive SCC that meets the pod's requirements.
96. Explain the concept of Role-Based Access Control (RBAC) in OpenShi /Kubernetes.
Concept: RBAC is the standard mechanism for controlling who (Users, Groups, Service Accounts -
called "Subjects") can perform what ac ons (Verbs like get, list, create, delete, patch) on which
resources (like pods, deployments, secrets, nodes) within the cluster or specific projects
(namespaces).
Role: Contains rules that grant permissions within a specific namespace. A Role
can only grant access to namespaced resources (like pods, deployments, secrets
within its namespace) or non-resource URLs within that namespace. It cannot
grant access to cluster-scoped resources (like nodes, clusterroles, sccs) or
resources in other namespaces.
ClusterRole: Contains rules that can grant permissions cluster-wide. It can grant
access to namespaced resources across all namespaces, cluster-scoped
resources (like nodes, persistentvolumes, clusterroles, sccs), or non-resource
URLs (/healthz, /version).
99. How do you list all Roles defined within a specific project?
Command: oc get roles -n <project_name>
Descrip on: This command retrieves all the Role objects that exist within the specified namespace
(<project_name>). These define sets of permissions scoped to that project.
Descrip on: This command retrieves all ClusterRole objects defined cluster-wide. This includes
default roles (like cluster-admin, admin, edit, view) and any custom cluster roles created by
administrators or operators.
101. How can you inspect the specific API permissions (verbs, resources) granted by a Role
or ClusterRole?
Command:
Descrip on: The describe command provides a human-readable summary of the rules sec on within
the Role or ClusterRole. It lists the allowed API Resources (like pods, services, nodes), Non-Resource
URLs, and the permi ed Verbs (like get, list, watch, create, update, patch, delete) for each. This
clearly shows what ac ons the role allows.
102. How do you check which users, groups, or service accounts are bound to a specific
Role within a project?
Command: oc get rolebinding -n <project_name> -o wide (Inspect bindings referencing the role) or
oc describe rolebinding <rolebinding_name> -n <project_name>
Descrip on: You need to look at RoleBinding objects within the project.
List all bindings (oc get rolebindings -n <project_name>) and find the ones where
the ROLE column matches the Role you're interested in. The USER, GROUP, and
SERVICE ACCOUNT columns (in -o wide or describe) show the bound subjects.
Alterna vely, if you know the binding name, describe it to see the Role it
references and the Subjects it applies to.
103. How do you check which subjects are bound to a specific ClusterRole cluster-wide?
Command: oc get clusterrolebinding -o wide (Inspect bindings referencing the ClusterRole) or oc
describe clusterrolebinding <clusterrolebinding_name>
Descrip on: Similar to RoleBindings, you list ClusterRoleBinding objects (oc get clusterrolebindings)
and find those referencing the ClusterRole in ques on (check the ROLE column or roleRef field in
describe or YAML output). The Subjects sec on of the binding shows the users, groups, or service
accounts granted those cluster-wide permissions.
104. How can you verify if a par cular user has the permission to perform a specific ac on
(e.g., delete pods) in a certain project?
Command: oc auth can-i <verb> <resource> -n <project_name> --as <user_name> (e.g., oc auth can-i
delete pods -n my-app-dev --as john.doe)
Descrip on: This is a direct authoriza on check. It simulates the ac on request as the specified user
(--as) and tells you (yes or no) if their combined RBAC permissions allow them to perform that
specific verb on that resource within the given namespace. It's very useful for quickly verifying
permissions without needing to trace through all role bindings.
105. What command grants a user the standard edit role within a project?
Command: oc policy add-role-to-user edit <user_name> -n <project_name>
Descrip on: This command creates or updates a RoleBinding within the specified project (-n). It
binds the default edit ClusterRole (which allows modifying most standard applica on resources but
not RBAC rules or quotas) to the specified user (<user_name>). This is a common way to give
developers permissions to manage their applica ons within a project.
Descrip on: This command finds the RoleBinding that grants the specified <role_name> to the
specified <user_name> within the project (-n) and removes that user from the binding's subjects list.
If the user was the only subject, the binding might be deleted. This effec vely revokes those specific
permissions from the user within that project.
107. What is a Kubernetes Secret used for? How do you list secrets in a project?
Command to List: oc get secrets -n <project_name>
Descrip on: A Secret is a Kubernetes object designed to store small amounts of sensi ve data, such
as passwords, OAuth tokens, SSH keys, TLS cer ficates, or API keys. Storing this informa on in Secrets
is more secure and flexible than hardcoding it into pod defini ons or container images. Secrets are
stored (by default) base64 encoded in etcd, and poten ally encrypted at rest if etcd encryp on is
enabled. Pods can access secrets as mounted volumes or environment variables. Lis ng secrets
shows the available secret objects within a project.
108. How can you view the decoded data stored within a Secret? What precau ons are
needed?
Command: oc get secret <secret_name> -n <project_name> -o jsonpath='{.data}' | jq
'map_values(@base64d)' (Requires jq u lity)
Alterna vely, get the YAML (-o yaml), copy a base64 encoded value, and decode
it manually: echo "<base64_encoded_value>" | base64 --decode
Precau ons:
Descrip on: A Service Account provides an iden ty for processes running inside pods to interact with
the Kubernetes API server or external services. When a pod needs to talk to the API (e.g., to list other
pods, modify resources), it authen cates using the token associated with its Service Account. Each
namespace has a default service account, but it's best prac ce to create dedicated service accounts
for applica ons with specific RBAC permissions assigned (principle of least privilege).
110. How would you check the expira on date and issuer of the cluster's API server
cer ficate?
Command: echo | openssl s_client -connect $(oc whoami --show-server | cut -d':' -f 2 | sed
's/\///'):443 2>/dev/null | openssl x509 -noout -text | grep -E 'Issuer:|Not A er'
Descrip on: This command connects to the Kubernetes API server's secure port (usually 443),
retrieves its TLS cer ficate using openssl s_client, pipes the cer ficate details to openssl x509 for
parsing, and then filters the output to show the Issuer (who signed the cer ficate, o en an internal
CA) and the Not A er field (the expira on date). Monitoring cer ficate expira on is crucial for cluster
stability.
111. How would you check the expira on date and issuer of the default Ingress (router)
cer ficate?
Command: echo | openssl s_client -connect $(oc get route console -n openshi -console -o
jsonpath='{.spec.host}' | sed 's/console-//'):443 -servername $(oc get route console -n openshi -
console -o jsonpath='{.spec.host}') 2>/dev/null | openssl x509 -noout -text | grep -E 'Issuer:|Not
A er' (Uses console route as an example to find the apps domain)
Descrip on: This command finds the hostname of a known route (like the console route) to
determine the base *.apps domain, connects to the Ingress Controller (router) on port 443 using that
domain (important for SNI), retrieves the default TLS cer ficate presented by the router, and displays
its Issuer and Expira on Date (Not A er). This checks the validity of the cer ficate securing external
applica on access via Routes.
112. What are Cer ficate Signing Requests (CSRs) used for in OpenShi , and how do you
list them?
Command to List: oc get csr
Descrip on: CSRs are the mechanism by which clients (primarily Kubelets on nodes) request TLS
cer ficates from the cluster's internal Cer ficate Authority (managed by the kube-controller-
manager). When a new node joins or an exis ng node needs to renew its cer ficate, its Kubelet
creates a CSR object. The cluster then validates and approves (usually automa cally for nodes) the
CSR, and the cer ficate is issued. Lis ng CSRs shows pending, approved, or denied requests.
113. Under what circumstances might you need to manually approve a CSR? What is the
command?
Command: oc adm cer ficate approve <csr_name>
Circumstances: Manual approval is generally not required for node Kubelet cer ficates in a standard
OCP 4 installa on, as this is handled by automated approvers. However, you might need manual
approval if:
114. How can you check the configured audit policy for the Kubernetes API server?
Method: Audit configura on is part of the API server's configura on, managed by the kube-
apiserver-operator.
For detailed policy: The actual policy file might be referenced in the operator config or
mounted directly into the API server pods. You might need to inspect the kube-apiserver-
operator config or the sta c pod manifest on master nodes (oc debug node/...) to find the
exact policy file path and content if a custom policy is used.
Descrip on: Audit logging records ac ons performed against the Kubernetes API. The audit
policy defines what events are logged (e.g., metadata only, requests, responses) and at what
level (e.g., metadata, request, requestResponse). Checking the policy helps understand the
scope and detail of audit logging.
115. How would you typically find the loca on of API server audit logs on the master
nodes?
Method: Requires access to the master nodes, usually via oc debug node/<master_node_name>.
1. chroot /host
Descrip on: Audit logs contain sensi ve records of API ac vity. Finding their loca on on the master
nodes (or the webhook configura on) is necessary for security analysis, compliance checks, or
detailed troubleshoo ng.
116. How do you list the Iden ty Providers (IDPs) configured for cluster authen ca on?
Command: oc get oauth cluster -o jsonpath='{.spec.iden tyProviders}'
Descrip on: This command queries the central OAuth configura on object and extracts the list of
configured IDPs. This shows how users can log in to the cluster (e.g., htpasswd, ldap, github, oidc).
Each entry in the list contains the name and configura on details for that specific IDP.
Descrip on: OpenShi includes a pre-configured Grafana instance providing dashboards for
visualizing cluster and node metrics collected by Prometheus. This command finds the Route
exposing the Grafana web UI. You access this URL in a browser and typically log in using your
OpenShi creden als (via OAuth integra on).
Descrip on: Alertmanager is a component of the monitoring stack responsible for handling alerts
sent by Prometheus. It deduplicates, groups, and routes alerts to configured receivers (like email,
Slack, PagerDuty). It also manages silencing (mu ng) alerts. The UI allows you to view currently firing
alerts, check receiver configura ons, and manage silences. Accessing its Route URL provides access
to this UI.
119. How can you check which alerts are currently firing in the cluster?
Methods:
Alertmanager UI: Access the Alertmanager UI (see previous ques on). The main
page displays currently ac ve (firing) alerts.
Descrip on: Iden fying ac ve alerts is crucial for proac ve cluster management. Alertmanager is the
primary tool for viewing and managing these ac ve alerts.
120. Explain how you would temporarily silence a specific, known alert.
Method: Use the Alertmanager UI.
Process:
Set a dura on for the silence (e.g., 1 hour, 2 days). Add a comment explaining
the reason.
Click "Create".
Descrip on: Silencing temporarily stops Alertmanager from sending no fica ons for alerts matching
specific criteria. This is useful during planned maintenance, for known issues being addressed, or to
reduce noise from flapping alerts while inves ga ng the root cause. Silences are temporary and
expire automa cally.
121. How do you check the status of the core Prometheus pods responsible for cluster
monitoring?
Command: oc get pods -n openshi -monitoring -l app.kubernetes.io/name=prometheus
Descrip on: The core cluster monitoring relies on a highly available Prometheus deployment
(typically 2 replicas: prometheus-k8s-0 and prometheus-k8s-1). This command lists these specific
pods within the openshi -monitoring namespace. Check that they are Running and have minimal
restarts. These pods scrape metrics, evaluate aler ng rules, and store me-series data.
Descrip on: Alertmanager typically runs as a StatefulSet (e.g., alertmanager-main-0, -1, -2) for high
availability. This command lists the Alertmanager pods. Ensure they are Running and stable.
Problems here can prevent alert no fica ons from being delivered.
Descrip on: This command lists the pod(s) running the Grafana web UI and backend. Ensure it's
Running to allow users access to monitoring dashboards.
124. What is user workload monitoring, and how would you check the status of its
components if enabled?
Descrip on: User Workload Monitoring (UWM) is an op onal feature in OpenShi that allows
developers and applica on owners to monitor their own applica ons within their projects using the
same Prometheus-based stack used for core cluster monitoring. It deploys a separate Prometheus
instance (prometheus-user-workload) that discovers and scrapes metrics from user-defined
ServiceMonitor and PodMonitor resources within allowed namespaces.
Details: This command lists the pods specific to UWM, primarily the prometheus-user-workload-*
pods and poten ally thanos-ruler-user-workload-* pods if configured. Check that these are Running.
125. How can you verify if user workload monitoring is enabled for the cluster?
Method: Check the cluster monitoring configura on ConfigMap.
Descrip on: Look inside the data.config.yaml sec on of this ConfigMap for a se ng like
enableUserWorkload: true. If this key exists and is set to true, UWM is enabled. The presence and
health of the openshi -user-workload-monitoring namespace and its pods is also a strong indicator.
126. Describe how you could query a specific metric directly from the cluster's Prometheus
instance.
Method: Use the Prometheus web UI via port-forwarding or a Route (if exposed).
Process:
Expose Prometheus:
Access UI: Open the Prometheus URL (localhost:9090 or the Route URL) in a
browser.
Query: Use the "Graph" or "Table" view. Enter a PromQL (Prometheus Query
Language) query in the expression bar (e.g.,
node_memory_MemAvailable_bytes,
sum(rate(container_cpu_usage_seconds_total{namespace="my-app"}[5m])) by
(pod)).
Descrip on: Allows direct interac on with the Prometheus query engine to retrieve specific me-
series data, test alert rule expressions, or perform advanced analysis beyond the standard Grafana
dashboards. Requires understanding PromQL.
127. If cluster logging is installed, how do you typically access the Kibana UI?
Method: Access via its Route.
Descrip on: The OpenShi Logging stack (based on Elas csearch, Fluentd, Kibana - EFK) includes
Kibana as the web UI for searching, visualizing, and analyzing the aggregated logs. This command
finds the Route exposing the Kibana UI. Login is typically via OpenShi creden als.
128. How do you check the status of the Elas csearch pods used for logging?
Command: oc get pods -n openshi -logging -l component=elas csearch (or similar label depending
on deployment method).
Descrip on: Elas csearch runs as a StatefulSet to store and index log data. This command lists the
Elas csearch pods. Check that they are Running, stable (minimal restarts), and that the desired
number of replicas are present. Health issues here impact log storage and search capabili es.
129. How do you check the status of the Fluentd pods responsible for collec ng logs from
nodes?
Command: oc get pods -n openshi -logging -l component=fluentd (or similar label). Check the
DaemonSet: oc get ds fluentd -n openshi -logging.
Descrip on: Fluentd runs as a DaemonSet, meaning one pod runs on each eligible node in the
cluster. These pods collect container and node logs and forward them to Elas csearch. This
command lists these collector pods. Ensure one is Running on each expected node. Problems here
mean logs from specific nodes might be missing.
Descrip on: This command lists the pod(s) running the Kibana web UI and backend service. Ensure
it's Running for users to access the log explora on interface.
131. How would you check the health status (e.g., green, yellow, red) of the Elas csearch
cluster used for logging?
Methods:
oc exec <any_es_pod_name> -n openshi -logging -c elas csearch -- curl -s -k -u elas c:$(oc get
secret elas csearch -n openshi -logging -o jsonpath='{.data.admin-password}' | base64 -d)
"h ps://localhost:9200/_cluster/health?pre y"
green: All primary and replica shards are allocated and ac ve. Healthy.
yellow: All primary shards are ac ve, but some replica shards are not allocated
(e.g., not enough nodes). Cluster is func onal but lacks full redundancy.
red: Some primary shards are not allocated. Cluster is non-func onal, data might
be missing, searches will likely fail. Requires immediate inves ga on.
132. How can you view the logs of the Fluentd log collector running on a par cular node?
Process:
Find Pod: Iden fy the Fluentd pod running on the target node: oc get pods -n
openshi -logging -o wide --field-selector spec.nodeName=<node_name> -l
component=fluentd
Get Logs: Use oc logs with the pod name found: oc logs <fluentd_pod_name> -n
openshi -logging
Descrip on: Checking the logs of a specific Fluentd pod is essen al for troubleshoo ng log collec on
issues origina ng from that par cular node. Logs might show errors connec ng to Elas csearch,
parsing specific log formats, or reading log files.
133. How would you troubleshoot if logs from applica ons are not appearing in Kibana?
Troubleshoo ng Steps:
Check Fluentd on Node: Is the Fluentd pod running on the node where the
applica on pod resides (oc get pods -n openshi -logging -o wide)? Check its logs
(oc logs ...) for errors related to that applica on's logs or connec on issues to
Elas csearch.
Check Elas csearch Health: Is the ES cluster healthy (green/yellow)? (oc exec ...
_cluster/health or Kibana UI). If red/yellow, logs might not be indexing correctly.
Check Elas csearch Disk Space: Is ES running out of disk space? (oc get pvc -n
openshi -logging). Full disks prevent indexing.
Check Kibana Index Pa ern: In Kibana -> Stack Management -> Index Pa erns,
ensure the correct index pa ern (e.g., app-*, infra-*) is configured and includes
the relevant indices. Refresh the pa ern if needed.
Check Time Range in Kibana: Ensure the me filter selected in Kibana covers the
period when the logs were generated.
134. How do you monitor the disk usage of the Elas csearch cluster?
Methods:
Kibana UI: Stack Management -> Index Management o en shows disk usage per
node and shard informa on.
ES API: Use _cat/alloca on?v or _cluster/stats APIs via curl inside an ES pod
(similar to health check command) to get detailed node disk usage.
Descrip on: Monitoring Elas csearch disk usage is cri cal because it will stop indexing new logs if it
runs low on disk space, leading to log loss. Regular monitoring and capacity planning (or index
lifecycle management) are essen al.
135. What is the role of the node-exporter pods in the monitoring stack? How do you
check their status?
Command to Check Status: oc get pods -n openshi -monitoring -l app.kubernetes.io/name=node-
exporter. Check the DaemonSet: oc get ds node-exporter -n openshi -monitoring.
Role: node-exporter is an official Prometheus exporter that runs as a DaemonSet on every node in
the cluster. Its role is to collect hardware and OS-level metrics from the host node it's running on
(CPU usage, memory usage, disk I/O, network sta s cs, filesystem usage, etc.). Prometheus then
scrapes these metrics from each node-exporter pod.
Descrip on: These pods provide the fundamental host-level metrics visible in Grafana dashboards
for node performance analysis. Ensuring they are running correctly on all nodes is vital for complete
node monitoring coverage.
136. What is the role of the kube-state-metrics pods? How do you check their status?
Command to Check Status: oc get pods -n openshi -monitoring -l app.kubernetes.io/name=kube-
state-metrics. Check the Deployment: oc get deployment kube-state-metrics -n openshi -
monitoring.
Role: kube-state-metrics listens to the Kubernetes API server and converts informa on about the
state of Kubernetes objects (like Deployments, Pods, Nodes, Services, PVCs) into metrics that
Prometheus can scrape. For example, it generates metrics for the number of desired vs. available
replicas in a Deployment, pod statuses, PVC statuses, node condi ons, etc.
oc adm upgrade
Key Output: Look for lines like Updates: which list available versions, and Channel: which shows the
currently configured update stream (e.g., stable-4.12, fast-4.13).
138. What is the command to ini ate a cluster upgrade to a specific version or the latest
recommended one?
To upgrade to the latest recommended version within the current channel (as shown by oc adm
upgrade):
Important: Always review the release notes for the target version before ini a ng an upgrade.
Ensure cluster health and prerequisites are met.
139. How can you monitor the real- me progress of an ongoing cluster upgrade?
There are several ways:
oc adm upgrade: Running this command while an upgrade is in progress will show the target
version and o en indicate which component (like a specific Cluster Operator or Machine
Config Pool) is currently being updated.
oc get clusterversion: This shows the overall status, the target version (spec.desiredUpdate),
and the history of applied updates (status.history). The status.condi ons will indicate if the
upgrade is Progressing.
oc get clusteroperator or oc get co: Monitor the status of individual operators. During an
upgrade, many operators will temporarily enter the Progressing=True state. Watch for any
operators becoming DEGRADED=True.
watch oc get co
oc get machineconfigpool or oc get mcp: Monitor the status of node pools. They will show
UPDATING=True as nodes within the pool are rebooted with the new configura on. Check
the UPDATEDMACHINECOUNT, READYMACHINECOUNT, and MACHINECOUNT columns.
140. Is it possible to pause an ongoing cluster upgrade? If so, how, and why might you do
it?
Yes, it is possible to pause an ongoing cluster upgrade, but it should be done with cau on and
typically only when troubleshoo ng a blocking issue.
A cri cal Cluster Operator becomes DEGRADED and blocks progress, requiring inves ga on
and manual interven on.
An unexpected issue arises in the infrastructure or cri cal applica ons during the upgrade
process that needs immediate a en on before proceeding.
Cau on: Pausing upgrades for extended periods is generally not recommended as it can
leave the cluster in an inconsistent state.
141. What are Machine Config Pools (MCPs), and how do you check their status during an
upgrade?
Machine Config Pools (MCPs): Groups of nodes (typically master and worker, but custom pools can
exist) that share the same MachineConfig. The Machine Config Operator (MCO) manages updates to
nodes within a pool sequen ally to apply new configura ons (including OS updates delivered via
MachineConfigs during an OCP upgrade).
oc get mcp
Look for the UPDATING column. If True, the pool is ac vely being updated.
Monitor UPDATEDMACHINECOUNT increasing towards MACHINECOUNT.
Monitor READYMACHINECOUNT to ensure nodes become ready a er reboo ng.
Check the DEGRADED column for any issues.
oc describe mcp <pool_name> provides more detailed status and events.
142. How can you see which specific MachineConfig version is currently applied to the
nodes in an MCP?
Use oc describe machineconfigpool <pool_name>. Look for the CurrentMachineConfig field (or
status.configura on.name in the YAML output). This shows the name of the rendered MachineConfig
that the pool's nodes are currently running or a emp ng to apply.
143. How do you monitor the status of individual nodes within an MCP as they are being
updated?
List nodes in the pool: Use labels associated with the pool.
Observe Node Status: During an update, nodes in the pool will be cordoned, drained, rebooted, and
uncordoned one by one (or based on maxUnavailable se ngs). Watch the STATUS column in oc get
nodes. Nodes will transi on through Ready,SchedulingDisabled -> NotReady,SchedulingDisabled ->
Ready.
Check MCD Logs: For detailed progress on a specific node, check the Machine Config Daemon logs
(see Q11).
144. What steps would you take if a Cluster Operator becomes DEGRADED and halts the
upgrade process?
1. Iden fy the Degraded Operator: Use oc get co to find which operator(s) have
DEGRADED=True.
4. Check Operand Logs: The operator manages other components (operands). Check
the logs of the pods related to the operator's func on (e.g., for ingress operator,
check router pods in openshi -ingress).
8. A empt Remedia on: Based on the findings, a empt to fix the underlying issue
(e.g., fix a configura on error, address resource constraints, resolve network issues).
145. What checks should you perform before ini a ng a cluster upgrade?
1. Read Release Notes: Thoroughly review the release notes for the target OpenShi
version for known issues, prerequisites, deprecated features, and breaking changes.
3. Check Node Status: Ensure all nodes are in the Ready state (oc get nodes). Address
any NotReady nodes.
4. Check PodDisrup onBudgets (PDBs): Verify that cri cal applica on PDBs allow for
sufficient disrup ons (Allowed Disrup ons > 0) so node drains during MCP updates
do not stall (oc get pdb -A). Misconfigured PDBs are a common cause of upgrade
delays.
5. Check Resource Usage: Ensure sufficient CPU, memory, and storage resources are
available on nodes, especially control plane nodes, to handle the upgrade process.
6. Backup: Perform a recent etcd backup and ensure applica on data backups (PVs,
databases) are current. Back up cri cal CRs/YAMLs.
7. Check Network Connec vity: Ensure the cluster can reach required endpoints
(Update Service, Quay.io/Registry.redhat.io, or mirror registry).
9. Check Operator Subscrip ons: Ensure any installed Operators (from OperatorHub)
are compa ble with the target OpenShi version and their update channels are
appropriate.
146. What is the Machine Config Daemon (MCD), and how do you find its pod on a
specific node?
Machine Config Daemon (MCD): A DaemonSet managed by the Machine Config Operator (MCO). An
MCD pod runs on every node in the cluster. Its primary responsibility is to watch for changes to the
desired MachineConfig for the node it's running on, apply those changes (e.g., wri ng files,
modifying systemd units), and report status back to the MCO. It orchestrates the node updates
during upgrades or custom config rollouts.
Finding the Pod: MCD pods run in the openshi -machine-config-operator namespace. Use a
field selector to find the pod on a specific node:
147. How can you check the logs of the MCD to troubleshoot node update issues?
Once you have iden fied the MCD pod name on the specific node (using the command from the
previous ques on), use oc logs:
The logs will show details about which MachineConfig it's trying to apply, steps being taken (wri ng
files, reloading services), interac ons with rpm-ostree (for RHCOS), drain/cordon opera ons, and any
errors encountered during the update process.
148. What is a MachineConfig object? How are custom node configura ons typically
applied?
MachineConfig Object: A Kubernetes Custom Resource (CR) used by the Machine Config Operator
(MCO) to define the configura on state of nodes (specifically RHCOS nodes) in an OpenShi cluster.
They can contain Igni on configura on snippets, systemd units, files, kernel arguments, etc. The
MCO combines mul ple MachineConfigs (base OS config, cluster-specific se ngs, custom se ngs)
into a single "rendered" MachineConfig for each pool.
1. Create a new MachineConfig YAML file defining your desired change (e.g.,
adding a kernel argument, crea ng a file).
no label is specified, it typically applies to all pools where its se ngs don't
conflict.
149. How can you view the final "rendered" MachineConfig for a pool, which combines
mul ple configura on sources?
1. First, find the name of the current rendered config for the pool using oc describe
mcp <pool_name> and look for CurrentMachineConfig or status.configura on.name.
This YAML will contain the combined configura on from the base OS, cluster se ngs, and any
applied custom MachineConfigs for that specific pool.
150. Explain the role of PodDisrup onBudgets (PDBs) during node maintenance and
upgrades.
PodDisrup onBudgets (PDBs): Kubernetes objects that limit the number of pods of a specific
applica on (iden fied by labels) that can be voluntarily disrupted simultaneously. Voluntary
disrup ons include ac ons like node drains performed during upgrades or maintenance.
Role during Upgrades/Maintenance: When the MCO (via MCD) or an administrator ini ates a node
drain (oc adm drain), the drain process respects PDBs. Before evic ng a pod covered by a PDB, the
system checks if the evic on would violate the budget (i.e., cause the number of available pods for
that applica on to fall below the PDB's specified minimum available or maximum unavailable count).
Impact: If evic ng a pod would violate its PDB, the node drain opera on will block un l the PDB
allows the disrup on (e.g., a er other pods become ready elsewhere). This ensures applica on
availability but can stall upgrades or maintenance if PDBs are too restric ve (e.g., minAvailable: 1 for
a single-replica deployment) or if pods cannot be rescheduled successfully. It's crucial to configure
PDBs correctly to balance availability with the ability to perform cluster maintenance.
Note: This only shows users recognized by OpenShi , not necessarily all users defined in an external
iden ty provider (like LDAP) unless they have logged in.
152. How do you list all Groups known to the OpenShi cluster?
Use the oc get groups command. This lists Group objects within the cluster. These groups can be
synchronized from an external iden ty provider (like LDAP groups) or created manually within
OpenShi using oc adm groups new.
oc get groups
Example:
Use the oc adm groups remove-users command. Specify the group name and the user(s) to remove.
oc adm groups remove-users <group_name> <user_name_1> <user_name_2> ...
# Example: Remove 'testuser' from the 'app-devs' group
oc adm groups remove-users app-devs testuser
156. If using the HTPasswd iden ty provider, what is the process for adding a new user?
Adding a user via HTPasswd involves modifying the htpasswd file used by the provider and upda ng
the corresponding secret in OpenShi :
157. How can you verify the group memberships for a specific user?
Use oc get user <user_name> -o yaml. The output YAML will contain a groups: field lis ng all the
groups OpenShi recognizes that user as being a member of.
oc get user developer -o yaml
Look for the groups: sec on in the output. It might be null if the user belongs to no groups
recognized by OpenShi .
158. How do you assign the cluster-level cluster-admin role to a specific user? What are
the risks?
Command: Use oc adm policy add-cluster-role-to-user.
oc adm policy add-cluster-role-to-user cluster-admin <user_name>
Risks: Assigning cluster-admin grants unrestricted superuser access to the en re OpenShi cluster.
The user can perform any ac on on any resource in any project, including modifying cluster
configura ons, managing nodes, dele ng projects, viewing all secrets, and changing security se ngs.
This role should be assigned extremely sparingly and only to trusted cluster administrators
responsible for the overall health and management of the pla orm. Accidental or malicious ac ons
by a cluster-admin can have catastrophic consequences.
This removes the ClusterRoleBinding that grants the specified user the cluster-admin ClusterRole.
160. What is the command to create a new project (namespace) with a display name and
descrip on?
Use the oc new-project command.
oc new-project <project_name> --display-name="Your Display Name" --descrip on="Project
descrip on here"
# Example:
oc new-project my-app-prod --display-name="My App (Produc on)" --descrip on="Produc on
environment for My App"
This creates the Kubernetes Namespace and associated OpenShi Project object, applying any
default templates or configura ons defined by the cluster administrator. The user running the
command automa cally gets the admin role within the new project.
161. How can you inspect the template used to create default resources when a new
project is requested?
New projects are typically created based on a cluster-level template. You can inspect this template,
usually named project-request, located in the openshi -config namespace.
oc describe template project-request -n openshi -config
# Or view the full YAML
oc get template project-request -n openshi -config -o yaml
This template defines default objects like RoleBindings (gran ng the creator admin rights),
LimitRanges, or poten ally default NetworkPolicies that are created automa cally whenever oc new-
project is executed.
162. What is a ResourceQuota object used for? How would you apply one to a project?
Purpose: A ResourceQuota object constrains the total amount of compute resources (CPU, memory),
storage resources (PVC count, total storage capacity), or object counts (pods, services, secrets) that
can be consumed within a specific project (namespace). It helps prevent resource exhaus on and
ensures fair usage across different projects or teams.
Applying:
163. What is a LimitRange object used for? How would you apply one to a project?
Purpose: A LimitRange object defines constraints on resource requests and limits for individual Pods
or Containers within a project. It can set default request/limit values if not specified by the container,
enforce minimum/maximum values, and control the ra o between requests and limits. This helps
ensure pods have reasonable resource se ngs even if not explicitly defined.
Applying:
164. How can you display the authen ca on token currently used by your oc CLI session?
Use the oc whoami --show-token command. This will output the bearer token associated with your
current login session. This token is used by oc to authen cate subsequent requests to the OpenShi
API server.
oc whoami --show-token
Handle this token carefully as it grants permissions associated with your user account.
165. What command ini ates a login using a username and password?
Use the oc login command, providing the API server URL and creden als.
oc login <api_server_url> -u <username> -p <password>
# Example: oc login h ps://api.mycluster.example.com:6443 -u developer -p mysecretpassword
You can o en omit the password (-p) and be prompted securely. This command authen cates against
the iden ty provider configured for the cluster (e.g., HTPasswd, LDAP).
168. How do you get a list of all projects your current user has access to?
Use the oc projects command. This queries the API server and lists all the projects (namespaces) for
which your currently logged-in user has at least view permissions.
oc projects
oc get co image-registry
# Ensure AVAILABLE=True, PROGRESSING=False, DEGRADED=False
oc describe co image-registry # Check for detailed status messages/errors
Check the Deployment: Verify the image-registry deployment in the openshi -image-registry
namespace is available and its pods are running and ready.
170. How would you find the external URL (Route) for the internal registry, if one is
configured?
By default, the internal registry is not exposed externally with a Route. If it has been manually
exposed:
Check Operator Configura on: The exposure might be configured via the registry operator's config.
171. What is the standard internal service hostname and port for the OpenShi image
registry?
The internal registry is accessible within the cluster using its Kubernetes service name and port:
Port: 5000
Pods within the cluster use this address to push and pull images from the
internal registry.
172. How can you check the storage backend configura on (e.g., PVC, S3, filesystem) for
the internal registry?
Examine the imageregistry.operator.openshi .io cluster configura on resource:
oc get config.imageregistry.operator.openshi .io/cluster -o jsonpath='{.spec.storage}'
# Or view the full YAML for more context:
oc get config.imageregistry.operator.openshi .io/cluster -o yaml
The spec.storage field will show the configured backend, such as pvc, s3, azure, gcs, swi , or
emptyDir (not recommended for produc on). It will also contain specific parameters for the
chosen backend (like PVC name/claim, bucket names, creden als secret).
173. If the registry uses persistent storage (PVC), how do you find the associated PVC?
Check Operator Config: Get the storage configura on as shown above (oc get config.imageregistry...
-o yaml). If spec.storage.pvc is configured, it will contain the claim name.
Get PVC: Use the claim name found in the config to get the PVC details in the openshi -image-
registry namespace.
174. Explain the purpose of oc adm prune images. What op ons can control its behavior?
Purpose: The oc adm prune images command removes unused image layers and manifests from the
internal OpenShi registry to reclaim storage space. It iden fies images that are no longer
referenced by any ImageStream tags and image layers (blobs) that are not part of any remaining
image manifest stored in the registry.
Common Op ons:
--all: Prune images even if they are not part of any image stream (use
cau ously).
175. How can you check if the automated image pruner CronJob is configured and running
successfully?
Check CronJob: Look for the image-pruner CronJob in the openshi -image-registry namespace.
Check Last Schedule/Run: The output shows the schedule (SCHEDULE), suspend status (SUSPEND),
last scheduled me (LAST SCHEDULE), and age.
Check Job History: List the jobs created by the CronJob to see recent runs and their comple on
status.
Check Job Logs: View the logs of a completed pruner job's pod for details on what was pruned or any
errors.
oc get is -n <project_name>
177. How can you view the different tags within an ImageStream and the image digests
they point to?
Use oc describe imagestream or oc describe is.
The output will list each tag (e.g., latest, v1.0, prod) and show the image digest (SHA)
it currently points to, along with the registry loca on and when it was created or
updated. It also shows the history of images previously associated with each tag.
178. What command imports an image from an external registry into an OpenShi
ImageStream?
Use the oc import-image command. This command inspects the external image and updates or
creates a tag within the specified ImageStream to point to that external image's digest.
oc import-image <imagestream_name_or_imagestreamtag_name> --
from=<external_registry/image:tag> --confirm -n <project_name>
# Example: Import a specific image and tag it as 'stable' in the 'my-app' imagestream
--confirm: Required to actually perform the import; otherwise, it's a dry run.
If the ImageStream doesn't exist, this command can create it if you specify
<name>:<tag>.
179. Where is the cluster's global pull secret stored, and how do you inspect its contents?
Loca on: The cluster-wide pull secret, containing creden als needed by nodes to pull images
(including for OpenShi components from Red Hat registries), is stored as a Secret named pull-secret
in the openshi -config namespace.
Inspec on:
The actual creden als are in the .data[".dockerconfigjson"] field, base64 encoded. To
decode and view the JSON content:
180. Describe the process for adding creden als for a new private registry to the cluster's
global pull secret.
Modifying the global pull secret requires care as it affects the en re cluster.
Get Current Secret: Extract the current decoded .dockerconfigjson data into a file:
Prepare New Creden als: Create a temporary Docker config.json file containing only the creden als
for the new private registry. You can o en generate this by running podman login
<your_private_registry> or docker login <your_private_registry> locally and copying the relevant
entry from your local ~/.docker/config.json or ~/.config/containers/auth.json. It will look something
like:
"auths": {
"my-private-registry.example.com": {
"auth": "BASE64_ENCODED_USERNAME:PASSWORD",
"email": "your-email@example.com"
Merge Creden als: Merge the new registry creden als into the current_config.json file downloaded
in step 1. You can do this manually by edi ng the JSON or using a tool like jq. Ensure the final
structure is correct JSON with mul ple entries under "auths".
Patch the Secret: Update the pull-secret in openshi -config with the new merged and base64
encoded content.
The cluster nodes will gradually pick up the updated secret. This process ensures exis ng creden als
(like Red Hat registry access) are preserved.
Usage for Mirroring: In disconnected or restricted network environments, ICSPs are crucial. They tell
nodes: "When you need to pull an image from registry.redhat.io/ubi8/ubi, try pulling it from
mymirror.internal:5000/ubi8/ubi instead." This redirects pulls for specific repositories (or en re
registries) to a local mirror that contains copies of the required images, avoiding the need for direct
internet access from cluster nodes. Mul ple mirrors can be specified for redundancy.
182. How do you list the currently configured ICSPs in the cluster?
Use oc get imagecontentsourcepolicy.
oc get imagecontentsourcepolicy
# Or using the short name:
oc get icsp
# Output Columns: NAME AGE
# redhat-mirror 120d
# my-app-mirror 55d
183. How can you check if policies related to image signature verifica on are configured?
Image signature verifica on policies are configured in the cluster-wide image configura on resource:
Look within the spec: sec on for fields related to policy, such as policyJson or
references to ClusterImagePolicy objects (if using the image-policy-
operator). The policyJson field (if used directly) contains the detailed policy
rules defining trusted registries, keys, and enforcement ac ons (reject,
allow). Alterna vely, list ClusterImagePolicy resources:
oc get clusterimagepolicy
These backups are crucial for disaster recovery scenarios where the etcd cluster
becomes corrupted or lost, as etcd holds the defini ve state of the cluster.
The operator ensures backups are taken consistently across the etcd members. While
manual triggering is possible via scripts on the master nodes, relying on the automated
backups configured via the operator is the standard prac ce.
185. Where are the etcd backups typically stored on the master nodes?
By default, the etcd operator stores these backups on the local filesystem of each master node.
Common default loca ons include:
/etc/kubernetes/sta c-pod-resources/etcd-backup/
/var/lib/etcd-backup/
The exact path can be confirmed by inspec ng the configura on of the etcd Cluster Operator
or the sta c pod defini on for etcd (/etc/kubernetes/manifests/etcd-pod.yaml on the
masters).
Crucially for Disaster Recovery: These on-node backups must be copied off the cluster nodes to a
secure, external loca on (e.g., remote storage like NFS, S3, or a dedicated backup server). Relying
solely on backups stored locally on the masters does not protect against complete node or site
failure.
186. What are common strategies for backing up applica on data stored in Persistent
Volumes? Men on OADP/Velero.
Backing up Persistent Volume (PV) data requires considering the storage backend and applica on
consistency needs. Common strategies include:
Storage-Level Snapshots: Many underlying storage systems (SAN, NAS, Cloud Provider Block Storage,
ODF/Ceph) offer na ve snapshot capabili es. These create point-in- me copies of volumes quickly
and o en efficiently at the block level. Integra on might require vendor-specific tools or APIs.
CSI VolumeSnapshots: The Kubernetes Container Storage Interface (CSI) standard includes support
for volume snapshots. If your storage driver supports this, you can create Kubernetes
VolumeSnapshot objects, which trigger the underlying storage provider to create a snapshot in a
vendor-neutral way. This is becoming the preferred method for Kubernetes-integrated volume
snapshots.
Applica on-Level Backups: For stateful applica ons like databases, simply snapsho ng the disk
might not guarantee data consistency. Using applica on-specific tools (pg_dump, mysqldump,
applica on export features) to create consistent backups is o en essen al. These backup files can
then be stored either within another PV or, more commonly, pushed to external backup storage (like
S3).
OADP (OpenShi API for Data Protec on) / Velero: This is the Red Hat recommended cloud-na ve
solu on. OADP, built upon the upstream Velero project, provides a framework for backing up and
restoring OpenShi applica ons.
187. How can you perform a basic backup of OpenShi resource defini ons (like
Deployments, Services) as YAML files?
You can use the oc get command combined with the -o yaml output format and shell redirec on.
Limita ons:
This method requires manual iden fica on of all necessary resource types.
It doesn't automa cally handle dependencies between resources.
Restoring requires applying files in the correct order.
It captures the state at that moment and doesn't include PV data.
Tools like OADP/Velero are generally preferred for comprehensive applica on backups as
they handle these complexi es be er.
188. How would you back up User and Group defini ons?
User and Group objects are cluster-scoped resources in OpenShi . You can export their defini ons
using oc get:
This only backs up the OpenShi representa on of the users and groups.
You must have a separate backup strategy for your IDP itself.
Restoring just these OpenShi objects without the backing IDP may result in
incomplete user/group informa on or login failures.
189. How would you back up cluster-wide and project-specific RBAC defini ons?
Role-Based Access Control (RBAC) defini ons include Roles, ClusterRoles, RoleBindings, and
ClusterRoleBindings.
Project-Specific (Roles, RoleBindings): These are namespaced. You can back them up per project or
across all projects.
Note: Backing up RBAC is crucial for restoring applica on permissions correctly. OADP/Velero
typically includes relevant RBAC resources when backing up namespaces.
190. If using OADP (Velero), how do you check the status of the Velero pods?
The OADP Operator installs Velero components, usually into the openshi -adp namespace (this can
be customized during installa on). Check the pods in that namespace:
Look for:
The main velero deployment pod(s): Responsible for coordina ng backups and restores.
The node-agent DaemonSet pods (one per node): Used if employing file-level PV backups
(like Res c/Kopia). Not always present if only using CSI snapshots.
Plugin pods for specific providers (e.g., velero-plugin-for-aws, velero-plugin-for-vsphere).
Ensure these pods are in the Running state and have their containers ready (e.g., 1/1 or 2/2).
191. How do you trigger an ad-hoc backup using the velero CLI?
Once the velero command-line tool is installed and configured to point to your cluster and backup
storage loca on, use the velero backup create command.
--wait: Waits for the backup to complete and reports the status.
Alterna vely, create a Backup Custom Resource defini on in YAML and apply it using oc
apply -f backup-crd.yaml -n openshi -adp.
192. How do you check the status and details of completed OADP/Velero backups?
Use the velero backup get and velero backup describe commands provided by the Velero CLI.
# List all backups, their status, crea on me, expira on, etc.
velero backup get
# Show detailed informa on about a specific backup
velero backup describe <backup_name>
# Example: velero backup describe my-app-backup-202504251100
# Download logs for a specific backup (useful for failures)
velero backup logs <backup_name>
velero backup get shows the PHASE (e.g., Completed, Par allyFailed, Failed).
You can also view the Backup Custom Resources using oc: oc get backups -n openshi -adp.
193. Describe, at a high level, the process involved in restoring the cluster from an etcd
backup. Why is it a DR scenario?
High-Level Process: Restoring from an etcd backup is a cri cal Disaster Recovery (DR) procedure
used only when the etcd cluster (which holds the en re cluster state) is corrupt, lost, or otherwise
unrecoverable. The general steps are:
1. Stop Control Plane: Ensure the Kubernetes API server and other control
plane components are stopped on all master nodes to prevent conflic ng
writes.
3. Ini alize Restore: On one master node, use etcd u li es (etcdctl snapshot
restore) or documented OpenShi recovery procedures/scripts to restore
the snapshot into a new etcd data directory.
4. Reconfigure Etcd: Adjust the etcd configura on to reflect the restored state
and poten ally a single-member ini al cluster.
5. Start Ini al Node: Start the etcd service and poten ally the API server on
this first restored master.
6. Clean & Join Other Masters: On the other master nodes, completely remove
their old etcd data directories. Configure them to join the etcd cluster hosted
by the first restored node.
7. Verify & Restart: Once etcd quorum is re-established and stable, restart all
control plane components across all masters and verify cluster health.
Restart nodes if necessary.
Why DR:
Requires Full Control Plane Outage: The API server must be down during the
core restore process.
Data Loss: All cluster changes (new applica ons, configura ons, secrets, etc.)
made a er the mestamp of the etcd backup being restored are
permanently lost. The cluster reverts en rely to the state captured in the
backup.
Last Resort: It's used only when the cluster state database is fundamentally
broken and cannot be repaired through normal operator recovery or
quorum adjustments.
194. How would you restore applica on resources if you only had YAML backups?
Restoring from individual YAML files requires careful planning and execu on, especially regarding
dependencies.
1. Prepare Target: Ensure the target project (namespace) exists (oc new-project ... if
needed). Verify cluster-level dependencies like required CRDs or StorageClasses are
present on the target cluster.
3. Apply Resources: Use oc apply -f <filename> -n <project_name> for each YAML file
in the determined order. Using oc apply is generally safer than oc create as it handles
exis ng resources.
4. Verify: Check the status of restored pods, services, and routes (oc get pods, oc
describe pod, oc logs).
Limita ons: This method does not restore PV data. It can be error-prone due to dependency
ordering. Generated resource names might cause issues if not handled correctly.
195. How is PV data typically restored when using storage-level snapshots or CSI
VolumeSnapshots?
Storage-Level Snapshots: The exact procedure depends on the storage vendor's tools and
capabili es:
2. Use the storage vendor's interface (CLI/GUI) to create a new volume cloned
from that snapshot. Restoring in-place over the original volume is possible
but o en riskier.
# Example restored-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-app-data-restored
spec:
storageClassName: ocs-storagecluster-ceph-rbd # Example
dataSource:
name: my-app-data-snapshot-20250425 # Name of VolumeSnapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
196. How do you ini ate a restore opera on using the velero CLI?
Use the velero restore create command, referencing the backup you want to restore from.
197. How do you monitor the progress and check the status of an OADP/Velero restore
opera on?
Use the Velero CLI commands:
velero restore get shows the PHASE (e.g., New, InProgress, Completed, Par allyFailed,
Failed).
velero restore describe is essen al for details. It lists the total items to restore, how many
have been processed, warnings (e.g., resource already exists), and cri cal errors that caused
failures.
Monitor the target namespace(s) directly using oc get pods, oc get events, etc., to see
resources being created and pods star ng up.
The Restore Custom Resources can also be checked via oc get restores -n openshi -adp.
The install-config.yaml file used by the OpenShi installer contains the fundamental, user-provided
configura on choices made before the cluster existed. Keeping a safe copy is crucial for several
reasons:
This command queries the cluster's metrics server (usually deployed by default)
and lists pods from all namespaces, ordered by their current CPU consump on
(typically shown in millicores). This helps quickly iden fy poten al CPU hotspots.
200. How do you iden fy the pods consuming the most memory across the en re cluster?
Similar to CPU, use oc adm top pods with the -A flag and sort by memory.
This lists pods from all namespaces ordered by their current memory
consump on (typically shown in MiB or GiB). This helps iden fy pods that might
be causing memory pressure on nodes.
201. How can you check the configured CPU/Memory requests and limits for a specific
running container?
Use the oc describe pod command and inspect the Resources sec on for the specific container.
In the output, navigate to the Containers: sec on, find the relevant container
name, and look under its Resources: subsec on. This will show the configured
Requests (amount guaranteed) and Limits (maximum allowed) for both cpu and
memory. If not explicitly set, defaults from a LimitRange might apply, or they
might be unset.
202. What is the most common way to check if a pod was terminated due to exceeding its
memory limit (OOMKilled)?
The most common way is to use oc describe pod.
1. Container Status: Under the State: or Last State: (if it terminated) of the
relevant container, the Reason: field will o en show OOMKilled.
2. Events: The Events sec on at the bo om might show events related to the
pod being killed due to OOM, o en indica ng which node it occurred on.
OOMKilled means the container used more memory than its configured limit, and
the Linux kernel terminated the process.
203. How can you inves gate if a container is being CPU thro led?
CPU thro ling occurs when a container tries to use more CPU me than its configured limit allows
over a period. You can inves gate this using:
1. Metrics: Query the cluster's Prometheus instance (via Grafana dashboards or direct
query) for metrics like:
2. oc adm top pod --containers: While this primarily shows current usage, consistently
high usage near the limit might correlate with thro ling, although metrics are more
defini ve.
204. What is the purpose of the Performance Addon Operator and PerformanceProfiles?
Purpose: The Performance Addon Operator is designed to op mize OpenShi nodes for high-
performance, low-latency workloads, o en required in fields like Telco (NFV), High-Performance
Compu ng (HPC), and real- me financial applica ons.
205. How can you verify the tuned profile currently ac ve on a node?
The tuned daemon applies system tuning profiles. To check the ac ve profile on an RHCOS node:
206. How do you check the status of the Performance Addon Operator components?
The operator typically runs in the openshi -performance-addon-operator namespace (or similar,
check operator installa on details).
Ensure the operator deployment is available and its pods are running. Check the
status condi ons of any applied PerformanceProfile CRs for errors.
Benefit: For applica ons that manage large amounts of memory (like databases,
JVMs, scien fic compu ng), using the standard small page size can lead to frequent
TLB misses, as the TLB can only hold a limited number of mappings. Hugepages
(typically 2MB or 1GB) allow single TLB entries to map much larger memory regions.
This significantly reduces TLB misses, improving memory access performance and
overall applica on throughput for memory-intensive workloads.
default_hugepagesz=<size>
3. Apply MachineConfig: Apply the custom MachineConfig. The MCO will roll out the
change to the nodes in the pool, requiring node reboots.
4. Pod Specifica on: Applica ons needing hugepages must request them in their pod
spec's resources.limits sec on (e.g., hugepages-1Gi: 8Gi).
The Performance Addon Operator can also automate hugepage configura on as part of a
PerformanceProfile.
209. How do you check the number of hugepages configured and available on a node?
Use oc describe node <node_name> and look at the Capacity and Allocatable sec ons in the output.
210. What Prometheus metrics would you check to assess etcd performance, par cularly
disk latency?
Monitoring etcd performance is cri cal for cluster stability. Key Prometheus metrics related to disk
latency include:
etcd_disk_wal_fsync_dura on_seconds_bucket: Histogram of WAL (Write Ahead
Log) fsync dura ons. High latencies here indicate slow disk writes for transac on
logging, which severely impacts performance. Check the higher percen le
buckets (e.g., le="0.1", le="0.5").
etcd_disk_backend_commit_dura on_seconds_bucket: Histogram of backend
commit dura ons (wri ng state to disk). High latencies indicate slow disk
performance for persis ng the main database.
etcd_server_leader_changes_seen_total: Frequent leader changes can indicate
network instability or performance issues.
General etcd health metrics (etcd_server_has_leader,
etcd_server_health_success, etcd_server_health_failures).
These are o en visualized in the default OpenShi etcd Grafana dashboard.
211. What Prometheus metric helps measure Kubernetes API server request latency?
The primary metric for API server request latency is:
212. How would you monitor the performance of the OpenShi Ingress Controllers?
OpenShi Ingress Controllers (routers) are typically based on HAProxy and expose HAProxy metrics
that can be scraped by Prometheus. Key aspects to monitor include:
Resource Usage: CPU and Memory usage of the router pods (oc adm top pod ... -
n openshi -ingress).
These metrics are usually available in the default HAProxy Grafana dashboard
provided by OpenShi monitoring.
213. What are some strategies to op mize container image pull mes within the cluster?
Slow image pulls delay applica on startup and scaling. Strategies include:
Op mize Images:
ImageStream Pre-pulling (Less common): For cri cal images, poten ally use DaemonSets or
CronJobs to explicitly pull specific ImageStreamTags onto nodes ahead of me, though this adds
complexity.
1. Check Node Condi ons: oc describe node <node_name>. Look at the Condi ons
sec on (e.g., MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable,
KubeletReady). Note the LastTransi onTime and Reason/Message for clues.
2. Check Kubelet Logs: This is o en the most cri cal step. Use oc debug
node/<node_name> to get a shell, then chroot /host journalctl -u kubelet -f --since
"10 minutes ago". Look for errors related to PLEG (Pod Lifecycle Event Generator),
communica on with the API server, resource pressure, CNI issues, or cer ficate
problems.
3. Check Node Resource Usage: Use oc adm top node <node_name> to check real- me
CPU/Memory usage. Use chroot /host df -h (in debug pod) for disk usage, especially
on /var/lib/containers or /var/log. High usage can cause instability.
4. Check Network Connec vity: From the node (via debug pod), try pinging/curling the
API server internal endpoint (api-int.<cluster_name>.<base_domain>). Check DNS
resolu on (chroot /host resolvectl status). Check node network interface status
(chroot /host ip a).
5. Check CRI-O Logs: Use chroot /host journalctl -u crio -f --since "10 minutes ago" to
check for container run me issues.
215. A er star ng a cluster upgrade, several Cluster Operators go into a DEGRADED state.
What is your troubleshoo ng approach?
This indicates problems applying the new version or configura on for those components.
1. Iden fy Degraded Operators: oc get co. Note which specific operators are
DEGRADED=True.
5. Check Operand Logs: Check logs of the components managed by the operator (e.g.,
for etcd operator, check etcd pods in openshi -etcd; for ingress operator, check
router pods in openshi -ingress).
6. Check Related Resources: oc describe co lists related objects. Check their status (oc
get deployment/daemonset/...).
7. Check Upgrade Progress: Run oc adm upgrade again. It might provide specific
blocking messages.
8. Consult Release Notes: Re-check the target version's release notes for known
upgrade issues related to the specific operators.
9. Consider Pausing: If the issue isn't immediately obvious, consider pausing the
upgrade (oc patch clusterversion version --type=merge -p '{"spec":{"paused": true}}')
to prevent further changes while inves ga ng.
216. A user reports their pod is stuck in the Pending state. What are the most likely causes
you would check first?
Pending means the scheduler cannot place the pod onto a suitable node.
0/X nodes are available: X node(s) had taints that the pod didn't
tolerate. (Pod cannot run on available nodes due to taints).
2. Check Resource Requests: Does the pod request more CPU/memory than any single
node can provide (oc describe node <node> shows Allocatable resources)?
3. Check Node Availability: Are there enough nodes in the Ready state (oc get nodes)?
Are worker nodes cordoned (SchedulingDisabled)?
4. Check Taints/Tolera ons: Do available nodes have taints (oc describe node <node> |
grep Taints) that the pod doesn't tolerate (oc describe pod <pod> shows
Tolera ons)?
5. Check Node Selectors/Affinity: Does the pod spec have nodeSelector or nodeAffinity
rules (oc describe pod <pod>) that don't match any available node labels (oc get
node --show-labels)?
6. Check PVC Status: If the pod mounts a PVC, is the PVC Bound (oc get pvc -n
<project>)? If the PVC is also Pending, troubleshoot the storage issue first (see Q12).
7. Check Quotas: Has the project hit its resource quota limits (oc describe
resourcequota -n <project>) for pods, CPU, or memory?
217. A pod is stuck in ContainerCrea ng. What are poten al reasons and how would you diagnose
them?
ContainerCrea ng means the node's Kubelet is trying to start the container but is encountering
problems before the container process itself begins.
2. Check Node Status: Is the node healthy (oc get node <node_name>)? Check disk
space (df -h via debug pod), especially /var/lib/containers.
3. Check Image: Although ImagePullBackOff is dis nct, some mes image issues
manifest here. Verify the image exists and can be pulled manually (podman pull ... on
a node or bas on). Check pull secrets.
4. Check Security Context/SCCs: While less common for ContainerCrea ng, some mes
restric ve SCCs might prevent ac ons needed before container start (like se ng up
certain volume types). Check pod's securityContext and allowed SCCs.
218. Pods are failing with ImagePullBackOff errors. List the poten al causes and checks you would
perform.
ImagePullBackOff means the Kubelet failed repeatedly to pull the container image.
2. Verify Image Name/Tag: Double-check the image: field in the pod spec/deployment
YAML for typos in the registry, repository name, or tag. Does the specified tag
actually exist in the registry?
3. Check Registry Connec vity: Can the node where the pod is scheduled reach the
image registry?
If pulling from a private registry, does the pod's Service Account reference
the correct image pull secret (oc describe sa <sa_name>)?
Does the secret contain valid, non-expired creden als for the registry (oc get
secret <secret_name> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d)?
6. Check Registry Status: Is the target registry itself opera onal? Check its status page
if external, or check internal registry components (oc get co image-registry, oc get
pods -n openshi -image-registry) if internal.
7. Check Image Manifest: Some mes the error indicates issues with the image
manifest itself (e.g., manifest for pla orm linux/arm64 requested on an amd64
node).
219. You are unable to connect to an RHCOS node using oc debug node. What could be
wrong?
Failure to start a debug session can stem from various issues:
1. Node Not Ready/Unreachable: Is the target node in the Ready state (oc get node
<node_name>)? If NotReady or unreachable from the API server, the debug pod
cannot be scheduled or started. Troubleshoot the node status first.
2. API Server Issues: Is the OpenShi API server responsive (oc cluster-info)? oc debug
needs to communicate with the API.
3. RBAC Permissions: Does your user account have the necessary permissions?
Running oc debug node requires privileges typically granted by the cluster-admin
role or a custom role allowing pod crea on with hostPath mounts and privileged
security contexts (o en needing the privileged SCC). Use oc auth can-i use scc
privileged and oc auth can-i create pods --subresource=debug -n default to check.
4. Scheduling Failure: The debug pod itself might fail to schedule onto the target node.
Check for pending pods in the default namespace (or specified namespace) on that
node: oc get pods -n default -o wide --field-selector spec.nodeName=<node_name>.
Describe the pending debug pod (oc describe pod <debug_pod_name>) to see why
it failed scheduling (e.g., resource constraints, taints).
220. Applica ons within the cluster are experiencing DNS resolu on failures. How do you
troubleshoot this?
DNS issues can be tricky. Follow these steps:
Verify the nameserver points to the ClusterIP of the dns-default service (oc
get svc dns-default -n openshi -dns -o jsonpath='{.spec.clusterIP}').
Check the search domains – they should include relevant suffixes like
<namespace>.svc.cluster.local, svc.cluster.local, cluster.local.
Try resolving different types of names using dig or getent hosts (install bind-
u ls or getent if needed):
4. Check Network Policies: Ensure NetworkPolicies are not blocking DNS traffic
(UDP/TCP port 53) from the applica on pods to the CoreDNS pods/service IP in the
openshi -dns namespace.
5. Check Node DNS: Use oc debug node/<node_name> and check the node's
/etc/resolv.conf (chroot /host cat /etc/resolv.conf) and test resolu on from the node
itself (chroot /host dig ...). Node issues can affect pod DNS.
221. The etcd Cluster Operator becomes DEGRADED, repor ng an unhealthy cluster. What
common issues would you inves gate?
Etcd is the distributed brain of the cluster; its health is paramount.
1. Describe the Operator: oc describe co etcd. Read the Degraded condi on message
carefully. It o en indicates quorum loss, slow requests, or member health issues.
2. Check Etcd Pod Logs: oc logs -n openshi -etcd -l k8s-app=etcd. Look for errors
related to peer communica on, leader elec on failures, slow disk writes (wal),
database corrup on, or snapshot issues.
oc debug node/<master_node>
chroot /host etcdctl endpoint health --cluster (Requires etcdctl and certs
configured, o en easier via oc exec if pods are running).
4. Check Master Node Resources: Are master nodes under high CPU, memory, or disk
I/O pressure (oc adm top nodes, iostat via debug pod)? Etcd is sensi ve to resource
starva on.
5. Check Disk Performance: Etcd requires low-latency disk writes. Use tools like fio (via
debug pod) to benchmark disk performance on /var/lib/etcd mount points if slow
writes are suspected based on logs/metrics.
6. Check Network Connec vity: Verify stable, low-latency network connec vity
between all master nodes on the etcd peer ports (usually 2380). Use ping or other
tools between master debug pods. Network par oning is a common cause of
quorum loss.
7. Check Clock Skew: Ensure me is synchronized accurately across all master nodes
using NTP. Significant clock skew can disrupt etcd. Check chronyc sources via debug
pod.
8. Check Etcd Metrics: Look at Prometheus metrics (see Performance sec on) for disk
latency, leader changes, etc.
222. Users report the Kubernetes API server is slow or ming out. What areas would you
check?API server performance issues impact all cluster interac ons.
1. Check API Server Operator/Pods:
2. Check Etcd Health: The API server relies heavily on etcd. If etcd is slow or unhealthy
(see Q8), the API server will be impacted. Troubleshoot etcd first if it shows issues.
3. Check Master Node Resources: Are the master nodes hos ng the API server pods
overloaded (CPU, Memory)? Use oc adm top nodes.
5. Check Network: Verify connec vity from clients (oc CLI, web console, controllers) to
the API server endpoints (external api.*, internal api-int.*). Check load balancers if
applicable.
6. Iden fy Problema c Clients: Are specific users, controllers, or applica ons making
excessive or inefficient API calls? Audit logs (oc get clusterrolebinding ... audit config)
can some mes help iden fy sources of high load, though parsing can be complex.
223. Traffic is not reaching an applica on exposed via a Route. How would you
troubleshoot the Ingress path?
Troubleshoot layer by layer from outside-in:
1. DNS Resolu on: Does the Route hostname (oc get route <route_name> -o
jsonpath='{.spec.host}') resolve correctly (using dig or nslookup from outside the
cluster) to the public IP address of the OpenShi router/Load Balancer?
2. External Connec vity/Firewall: Can you reach the router's public IP on the correct
port (usually 80/443) from outside? Check external firewalls, security groups, Load
Balancer health checks. Use curl -v h p(s)://<route_host>.
Are the router pods running and ready in openshi -ingress (oc get pods -n
openshi -ingress)?
Check router pod logs for errors related to the specific route or backend
connec ons (oc logs <router_pod> -n openshi -ingress).
Does the target Service exist (oc get svc <service_name> -n <project>)?
Does the Service have ac ve Endpoints (oc get endpoints
<service_name> -n <project>)? If not, the pods matching the Service
selector are not ready or don't exist.
6. Pod Status: Are the applica on pods targeted by the Service running, ready, and
passing readiness probes (oc get pods -l <service_selector> -n <project>)?
7. Network Policies: Is there a NetworkPolicy blocking traffic from the openshi -ingress
namespace to the applica on pods in the target project on the required port?
8. Applica on Logs: Check the applica on pod logs (oc logs <app_pod> -n <project>) to
see if requests are reaching the applica on but failing internally.
224. During a node update managed by the Machine Config Operator, a node gets stuck and
doesn't update. How do you inves gate?
Node updates involve cordoning, draining, applying config, and reboo ng. Issues can occur at any
stage.
1. Check MCP Status: oc get mcp. Note the status of the pool the node belongs to
(UPDATING, DEGRADED). Check MACHINECOUNT vs READYMACHINECOUNT vs
UPDATEDMACHINECOUNT.
3. Check MCD Logs: This is crucial. Find the Machine Config Daemon pod on the stuck
node (oc get pods -n openshi -machine-config-operator -o wide --field-selector
spec.nodeName=<node_name>) and check its logs (oc logs <mcd_pod> -n openshi -
machine-config-operator). Look for errors related to:
5. Check Node oc describe: oc describe node <node_name>. Look at recent Events for
drain failures, CNI errors, Kubelet issues.
6. Check Console Access: If possible (e.g., VM console, BMC), check the node's console
during boot/run me for kernel panics or systemd errors.
7. Check Rendered Config: Did the MCO successfully create the target rendered
MachineConfig (oc get mc <rendered_config>)?
8. Check MCO Logs: Check the Machine Config Operator logs (oc logs
deployment/machine-config-operator -n openshi -machine-config-operator) for
higher-level errors about managing the pool update.
225. A user cannot create a PVC; it remains Pending. What storage-related issues might be
the cause?
Pending PVCs usually mean the storage provisioner cannot fulfill the request.
2. Check StorageClass:
3. Check Provisioner Pods: Find the pods for the relevant storage provisioner (e.g., CSI
driver pods in openshi -cluster-csi-drivers, ODF pods in openshi -storage). Check
their logs for errors related to volume crea on.
4. Check Underlying Storage: Is the backend storage system (SAN, NAS, Cloud Provider,
Ceph) healthy and does it have sufficient capacity? Check the storage system's
console/logs.
226. You no ce gaps in metrics data in Grafana or alerts aren't firing as expected. How do
you troubleshoot the monitoring stack?
Issues in the monitoring pipeline can cause data loss or alert failures.
2. Check Prometheus:
3. Check Alertmanager:
5. Check Network: Is there network connec vity between Prometheus and its scrape
targets? Between Prometheus and Alertmanager? Between Alertmanager and
no fica on receivers? Check NetworkPolicies.
6. Check Grafana: If dashboards are failing, check Grafana pods (oc get pods -n
openshi -monitoring -l app.kubernetes.io/name=grafana) and logs. Check data
source configura on within Grafana UI.
227. Applica on logs are missing from Kibana or the logging stack reports errors. What
steps would you take?
Troubleshoot the logging pipeline (Fluentd -> Elas csearch -> Kibana).
1. Check Logging Operator: oc get co logging (if using Red Hat OpenShi Logging).
Ensure Available/not Degraded. oc describe co logging.
2. Check Fluentd:
Are Fluentd pods running on all nodes (oc get pods -n openshi -logging -l
component=fluentd)?
Check Fluentd pod logs on nodes where logs are missing. Look for errors
connec ng to Elas csearch, buffer overflows, parsing errors, or permission
issues reading container logs (/var/log/pods/...).
Check ES cluster health (via curl in ES pod or oc describe co logging). Look for
red or yellow status.
Check ES pod logs for errors (shard alloca on failures, disk watermark issues,
configura on errors).
Check ES PVCs/disk usage (oc get pvc -n openshi -logging ...). Is the cluster
running out of disk space?
4. Check Kibana:
5. Check Applica on: Is the applica on actually genera ng logs to stdout/stderr? Use
oc logs <app_pod> to confirm.
6. Check Network: Verify network connec vity between Fluentd pods and the
Elas csearch service. Check NetworkPolicies.
228. How would you iden fy which specific pods are causing consistently high resource
usage on a par cular node?
Use oc adm top pods with Node Selector: Filter the top pods output to show only pods running on
the specific node, then sort by the resource of interest (CPU or memory).
List Pods and Check Individually: Get all pods on the node and then check usage individually if
needed.
oc get pods -A -o wide --field-selector spec.nodeName=<node_name> | awk '{if(NR>1) print "-n "$1"
"$2}'
# Then check specific pods if needed (less efficient for finding top consumers)
Use Monitoring Dashboards: Grafana dashboards o en have views that allow filtering by node and
sor ng pods by resource consump on, providing a visual way to iden fy top consumers over me.
Look for dashboards related to "Node Details" or "Pod Resources".
Infrastructure Resource: Check the cluster-level Infrastructure object. The status.pla ormStatus.type
field indicates the underlying pla orm (AWS, vSphere, BareMetal, etc.), but not directly IPI/UPI.
However, IPI installa ons typically populate more fields under status.pla ormStatus.
Machine API Resources: The most reliable indicator is the presence and ac ve use of Machine API
resources (Machines, MachineSets). IPI heavily relies on these to manage cluster nodes. UPI
installa ons can op onally use them but o en manage nodes externally.
oc get machinesets -A
oc get machines -A
If mul ple MachineSets exist and correspond to your control plane and worker nodes, it's almost
certainly an IPI installa on. If these namespaces/resources are mostly empty or absent, it's likely UPI.
install-config.yaml: If available, the original install config clearly defines the pla orm and implies the
method (IPI usually has more pla orm-specific automa on fields).
230. How do you configure cluster-wide HTTP/HTTPS proxy se ngs for outbound traffic?
Cluster-wide proxy se ngs are configured using the Proxy cluster object named cluster.
3. Modify spec: Add or update the following fields within the spec: sec on:
h psProxy: URL of the HTTPS proxy (o en the same as HTTP proxy URL).
noProxy: Comma-separated list of domains, CIDRs, or IPs that should not use
the proxy (e.g., .cluster.local,.svc,.example.com,192.168.1.0/24). It's crucial
to include internal cluster domains (.svc, .cluster.local), API server endpoints,
and any internal registries/services.
4. Save Changes: The Cluster Network Operator watches this object and propagates the
proxy environment variables (HTTP_PROXY, HTTPS_PROXY, NO_PROXY) to relevant
cluster components (like operator pods) and newly created pods (via admission
webhook). Exis ng pods generally need to be recreated to pick up the new se ngs.
231. What is the process for adding a custom CA cer ficate bundle to be trusted by cluster
components and workloads?
To make cluster components (like operators pulling images) and poten ally workloads trust internal
CAs or proxies performing TLS inspec on:
1. Prepare CA Bundle: Concatenate all necessary CA cer ficates (in PEM format) into a
single file (e.g., custom-ca-bundle.crt).
4. Patch Cluster Proxy: If the CA is needed for trus ng the configured HTTP/HTTPS
proxy:
6. Patch Image Config: If the CA is needed for trus ng image registries (internal or
external mirrors):
8. Propaga on: Cluster operators and node services (like CRI-O) will detect these
changes and update their trust stores. Node updates might involve Machine Config
Operator rollouts. Pods generally need to be recreated to mount the updated trust
bundles (o en mounted via openshi -service-ca.crt ConfigMap which gets updated).
232. How do you check the status of the Machine API Operator and its associated pods?
Check Cluster Operator:
oc get co machine-api
Check Pods: The operator components run in the openshi -machine-api namespace.
233. What is a MachineSet, and how do you list the ones defined in the cluster?
MachineSet: A Machine API resource (similar concept to ReplicaSet for Pods) that ensures a specified
number of Machine objects exist for a given configura on. It defines a template for crea ng new
Machines (specifying instance type, image, availability zone, user data, etc.). If a Machine managed
by a MachineSet is deleted or fails health checks, the MachineSet controller creates a new one to
maintain the desired replica count. MachineSets are primarily used in IPI environments to manage
worker node scaling.
Lis ng:
234. In an IPI environment, how do you scale the number of worker nodes using
MachineSets?
Use the oc scale command, targe ng the specific MachineSet you want to adjust in the openshi -
machine-api namespace.
235. How can you monitor the provisioning status of new Machines created by a
MachineSet?
1. List Machines: Filter machines poten ally owned by the MachineSet (labels o en
help, or check ownerReferences).
3. Check Machine Phase: The PHASE column in oc get machines shows the status (e.g.,
Provisioning, Provisioned, Running, Dele ng, Failed).
4. Describe Machine: Get detailed status and events for a specific machine:
236. How do you find the underlying cloud provider instance ID (e.g., AWS EC2 instance ID,
vSphere VM name) associated with an OpenShi Node object in an IPI cluster?
The spec.providerID field on the Node object usually holds this informa on.
237. What is the Node Tuning Operator used for? How do you check its status?
Purpose: The Node Tuning Operator manages the tuned daemon on RHCOS nodes. It allows
administrators to apply custom system-level performance tunings (beyond the defaults) to groups of
nodes based on labels. It uses Tuned Custom Resources to deliver these profiles, which can adjust
kernel parameters, CPU affini es, disk schedulers, etc., o en for specific workload requirements (like
low latency or high throughput).
Checking Status:
Operator Pods:
Tuned DaemonSet: Check the tuned DaemonSet pods running on each node:
238. How can you list any custom Tuned profiles applied in the cluster?
List the Tuned Custom Resources in the operator's namespace. Custom profiles are typically created
by administrators in addi on to the default rendered profiles managed by other operators (like the
Performance Addon Operator).
239. If using MetalLB for bare metal LoadBalancer services, how do you check the status of
its components?
MetalLB typically runs components in the metallb-system namespace.
Look for:
speaker DaemonSet pods (one per node): Announce service IPs using
BGP or L2 protocols.
Ensure these pods are Running and Ready. Check their logs for any configura on or announcement
errors.
ConfigMap (Older Method): Edi ng the config ConfigMap in the metallb-system namespace. Define
address-pools within the data.config sec on.
CRDs (Operator Method - Recommended): If installed via the MetalLB Operator, use Custom
Resources like MetalLB, AddressPool, BGPAdver sement, L2Adver sement. Create/edit AddressPool
CRs to define ranges of IPs MetalLB can use.
241. If using the Local Storage Operator, how do you check the status of its pods?
The Local Storage Operator components usually run in the openshi -local-storage namespace.
Look for:
242. How do you list the LocalVolume resources managed by the Local Storage Operator?
The operator creates LocalVolume Custom Resources represen ng the discovered storage devices on
nodes that match the operator's configura on.
This shows the discovered volumes, their capacity, node affinity, and
status. These are then used to provision Persistent Volumes with node
affinity.
243. If using OpenShi Data Founda on (ODF), how do you check the status of its core
component pods?
ODF (formerly OpenShi Container Storage/OCS) deploys its components primarily in the openshi -
storage namespace.
Ensure key pods (operator, MONs, OSDs, CSI drivers) are Running and Ready.
244. How do you quickly check the health status of the underlying Ceph cluster managed by ODF?
Check CephCluster CR: The CephCluster resource provides a high-level health summary.
Use Ceph Tools Pod: For detailed status, exec into the Rook Ceph tools pod and run ceph status.
This provides detailed health checks, MON/OSD status, pool status, PGs (Placement Groups) status,
IO ac vity, etc.
245. How can you check the overall storage capacity and usage within ODF?
Ceph Status: The ceph status command (run via the tools pod as above) shows overall capacity
(SIZE), used space (USED), and available space (AVAIL).
Ceph Block Pools: Check capacity and usage per storage pool (o en backing StorageClasses).
ODF Dashboards: The OpenShi Console o en includes ODF-specific dashboards (under Storage)
that visualize capacity, usage, performance (IOPS, throughput), and health. Grafana dashboards for
Ceph are also usually available via cluster monitoring.
246. What are two ways to find the URL for the OpenShi web console?
oc whoami --show-console: If logged in via oc, this command directly outputs the console URL.
oc whoami --show-console
oc get route console -n openshi -console: Get the Route object for the console and extract the
hostname.
247. How do you check the status of the OpenShi Console Operator and its pods?
oc get co console
# Look for 'console-*' pods (main UI) and 'downloads-*' pods (serving CLI tools etc.)
248. How can the appearance (e.g., login page, branding) of the OpenShi Console be
customized?
Customiza ons are applied by edi ng the Console cluster resource named cluster.
Refer to the official documenta on for specific fields and ConfigMap structure for logos.
249. What command lists all Custom Resource Defini ons (CRDs) installed in the cluster?
Use oc get crd.
oc get crd
# Or use the full name
oc get customresourcedefini ons
This lists all the custom resource types (beyond core Kubernetes types like Pods, Services) that have
been defined in the cluster, o en installed by Operators.
250. How can you check if etcd encryp on at rest is enabled and what mode is used?
Check the APIServer cluster resource named cluster.
The output will show the encryp on type currently configured. Common values:
An empty output or absence of the spec.encryp on field usually implies iden ty (no
encryp on).
251. Describe the high-level process for rota ng etcd encryp on keys.
Rota ng etcd encryp on keys is a sensi ve opera on performed to enhance security. It involves
genera ng new keys and migra ng exis ng data to be encrypted with them.
3. API Server Reconfigura on: The operator updates the API server configura on to
use both the old and new keys for decryp on but only the new key for encryp ng
new data. API servers are rolled out with this new config.
4. Data Migra on: The operator ini ates a background process where the API server
reads all resources from etcd, decrypts them (using old or new key), and rewrites
them encrypted with the new key. This happens gradually.
6. Finaliza on: Once migra on is complete, the operator may automa cally (or via
another trigger) update the API server config again to only use the new key,
effec vely re ring the old key.
Important: This process requires the cluster to be healthy and should be done during a maintenance
window, following official documenta on precisely.
252. How can you check which instance of a scaled control plane component (like kube-
controller-manager) holds the leader elec on lease?
Core Kubernetes control plane components use a leader elec on mechanism (usually based on
Leases or Endpoints) to ensure only one instance is ac ve at a me.
Iden fy Namespace: Find the namespace where the component runs (e.g., openshi -kube-
controller-manager).
Get Lease/Endpoint: Check for a Lease object (newer Kubernetes versions) or an Endpoint object
(older versions) o en named a er the component itself within that namespace.
Inspect Holder Iden ty: Look for fields like holderIden ty (in Leases) or an annota on like control-
plane.alpha.kubernetes.io/leader (in Endpoints). The value typically contains the hostname or pod
name of the current leader instance.
253. How would you verify the NTP server configura on being used by an RHCOS node?
Use oc debug node and the chronyc command:
This command queries the chronyd daemon running on the node. The output lists the configured
NTP sources (servers), their status (e.g., ^* indicates the current sync source), stratum, poll interval,
and offset/ji er details.
254. How do you check if the chronyd service is running and synchronized on a node?
Use oc debug node and systemctl / chronyc:
Look at Reference ID (should point to the sync source server), Stratum (should be reasonable, e.g., 2,
3, 4), Last offset (should be small, close to zero), and Leap status (should be Normal).
255. How can you inspect the effec ve Kubelet configura on arguments being used on a
node?
The Kubelet configura on comes from mul ple sources (files, MachineConfigs).
1. Check Kubelet Config File: The primary config file is o en referenced by the systemd
unit.
oc debug node/<node_name>
2. Check MachineConfig: Find the rendered MachineConfig applied to the node's pool
(oc get node <node_name> -o
jsonpath='{.metadata.annota ons.machineconfigura on\.openshi \.io/currentConfi
g}'). Then get the MachineConfig YAML (oc get mc <rendered_mc_name> -o yaml)
and look for the Kubelet configura on snippet within igni on.config.systemd.units or
related sec ons.
3. Check Running Process (Less reliable): chroot /host ps aux | grep kubelet might
show some command-line arguments, but many se ngs are loaded from files.
256. How can you inspect the effec ve CRI-O configura on se ngs on a node?
CRI-O se ngs are primarily defined in configura on files.
oc debug node/<node_name>
257. How do you check the configured maximum number of pods allowed to run on a
specific node?
1. Node Status: The node object reports its capacity.
3. Kubelet Configura on: The ul mate source is the Kubelet's --max-pods se ng.
Check the Kubelet config file or effec ve arguments (see Q27). If not explicitly set,
Kubernetes calculates a default based on resources or uses a pla orm default (o en
110 or 250).
Look for the main operator deployment pod(s) and poten ally pods related to specific scans or
remedia ons (e.g., ocp4-cis-scanner-*).
260. How do you list the results of compliance scans run by the Compliance Operator?
The operator uses several CRDs to manage scans and results:
261. If using the File Integrity Operator, how do you check the status of its pods?
The File Integrity Operator usually runs in the openshi -file-integrity namespace.
Look for the operator deployment pod(s) and the aide-daemon-* DaemonSet pods (one per node)
which perform the integrity checks using AIDE (Advanced Intrusion Detec on Environment).
262. How do you view the results of file integrity checks performed on nodes?
The operator stores results in the FileIntegrity Custom Resource, typically one per node pool.
Inspect the status field of the relevant FileIntegrity object. It shows the
overall status (Phase: Pending, Ac ve, Re-ini alizing, Failed) and detailed
results, including counts of added/removed/changed files detected during
the last scan compared to the baseline database.
263. Describe a method to test network latency between two cluster nodes.
Use ping from within debug pods running on the source and target nodes.
4. Find the debug pod names and the internal IP of node 2 (oc get node
<node2_name> -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}').
8. ping <node2_internal_ip>
Observe the round-trip me (RTT) values. Consistent, low RTT (e.g., <1-2ms within same DC/AZ) is
expected. High or variable latency indicates network issues.
264. Describe a method to test network bandwidth between two cluster nodes.
Use the iperf3 tool within debug pods.
Start debug pods on both nodes, ensuring the image contains iperf3 (e.g., a custom image or
poten ally registry.redhat.io/rhel8/support-tools might have it).
The client will report the measured bandwidth between the two nodes. Run mul ple mes for
consistency.
265. How do you inspect the cer ficate currently being used by the default Ingress
Controller?
The default Ingress Controller uses a cer ficate stored in a secret, typically named router-certs,
within the openshi -ingress namespace.
3. The cer ficate data is in .data."tls.crt", base64 encoded. Decode it and pipe to
openssl to view details:
This shows the Issuer, Subject (Common Name, SANs), Validity period (Not Before, Not A er), etc.
266. What is the general process for replacing the default Ingress cer ficate with a custom
one?
1. Prepare Custom Cer ficate: Obtain your custom cer ficate and private key files
(PEM format). Ensure the cer ficate covers the necessary wildcard domain
(*.apps.<cluster_name>.<base_domain>) and poten ally other specific hostnames.
Include any necessary intermediate CA cer ficates in the cer ficate file (server cert
first, then intermediates).
2. Create/Update Secret: Create a new TLS secret in the openshi -ingress namespace
containing your custom cer ficate and key.
4. # Or 'oc replace secret tls router-certs ...' if overwri ng the default (less common)
7. Rollout: The Ingress Operator will detect the change and roll out updates to the
router pods, which will start using the new cer ficate. Monitor router pods (oc get
pods -n openshi -ingress -w).
267. How do you inspect the cer ficate authority used for signing the API server's serving
cer ficate?The API server's serving cer ficate is typically signed by an internal CA managed by the
cluster. The CA cer ficate is o en stored in secrets within operator namespaces. A common one to
check is the kube-apiserver's client CA, used for aggrega on:
# Check the CA that signs the serving cert itself (o en managed internally)
# Example: Check the secret referenced by the Kube API Server operator status
The exact secret name might vary slightly depending on the OCP version and configura on.
268. How are internal cer ficates for services typically managed in OpenShi 4, and how
could you check their validity?
Management: Internal service cer ficates (used for secure communica on between pods within the
cluster) are primarily managed automa cally by the Service CA Operator. When a Service is
annotated with service.beta.openshi .io/serving-cert-secret-name: <secret_name>, this operator
automa cally generates a TLS cer ficate and key, signed by a cluster-internal CA, and stores them in
the specified secret <secret_name> within the service's namespace. Applica ons mount this secret
to use the cer ficate. The operator also handles automa c rota on of these cer ficates before they
expire.
Checking Validity:
1. Iden fy the secret name from the Service annota on (oc get svc
<service_name> -o yaml).
2. Get the secret from the service's namespace (oc get secret <secret_name> -
n <namespace> -o yaml).
3. Decode the cer ficate (.data."tls.crt") and check its validity period using
openssl:
269. How would you find the process ID (PID) of the kubelet process running on a node?
Use oc debug node and standard Linux process tools:
oc debug node/<node_name>
Inside the debug pod:
chroot /host pgrep -o kubelet # '-o' shows the oldest/original process if mul ple
match
# Or more detailed:
chroot /host ps aux | grep '/usr/bin/kubelet'
270. How would you find the process ID (PID) of the main crio process running on a node?
Use oc debug node and standard Linux process tools:
oc debug node/<node_name>
Inside the debug pod:
chroot /host pgrep -o crio # '-o' shows the oldest/original process
# Or more detailed:
chroot /host ps aux | grep '/usr/bin/crio'
271. How do you check the current SELinux enforcement mode (Enforcing, Permissive,
Disabled) on an RHCOS node?
Use oc debug node and SELinux tools:
oc debug node/<node_name>
Inside the debug pod:
chroot /host getenforce
# Or for more detail:
chroot /host sestatus
OpenShi nodes must run in Enforcing mode for proper opera on and security.
272. How can you view the ac ve firewall rules (iptables or n ables) on an RHCOS node?
Use oc debug node and the appropriate firewall command:
1. oc debug node/<node_name>
These commands display the complex rules managed by components like kube-proxy and the CNI
plugin to handle pod/service networking.
273. How do you display the IP rou ng table configured on an RHCOS node?
Use oc debug node and the ip command:
oc debug node/<node_name>
Inside the debug pod:
chroot /host ip route show
# Or 'ip r' for short
This shows how the node routes traffic to different des na ons, including default gateways, pod
networks, and service networks.
274. How do you view the kernel ring buffer messages (dmesg) on an RHCOS node?
Use oc debug node and the dmesg command:
oc debug node/<node_name>
Inside the debug pod:
chroot /host dmesg -T
# '-T' adds human-readable mestamps
This is useful for diagnosing low-level hardware, driver, or kernel-related issues.
275. How might you check CPU affinity se ngs if performance tuning has been applied?
CPU affinity restricts processes to specific CPU cores.
2. Check Pod Spec: Some high-performance pods might have CPU manager policies set
(sta c) and request specific exclusive CPUs (resources.limits.cpu matching
resources.requests.cpu).
Use taskset (if available in the container image): taskset -cp <PID> - This
shows the current CPU affinity mask for the process.
276. How do you check if Transparent Huge Pages (THP) are enabled or disabled on a
node?
Use oc debug node and check sysfs entries:
oc debug node/<node_name>
Inside the debug pod:
# Check if THP is enabled (always, madvise, never)
chroot /host cat /sys/kernel/mm/transparent_hugepage/enabled
# Check if background defragmenta on for THP is enabled
chroot /host cat /sys/kernel/mm/transparent_hugepage/defrag
277. How can you determine the I/O scheduler being used for a specific block device on a
node?
Use oc debug node and check sysfs:
oc debug node/<node_name>
Iden fy the block device name (e.g., sda, nvme0n1) using chroot /host lsblk.
The output shows the available schedulers, with the ac ve one enclosed in square brackets (e.g.,
[mq-deadline] kyber bfq none). Common op ons include mq-deadline, bfq, kyber, none.
278. How do you check the disk space usage of the systemd journal on a node?
Use oc debug node and journalctl:
oc debug node/<node_name>
This reports the current disk space occupied by archived and ac ve journal files. Configura on in
/etc/systemd/journald.conf (e.g., SystemMaxUse=) controls size limits.
279. How do you check the status of the kube-proxy pods running on the cluster nodes?
kube-proxy runs as a DaemonSet managed by the cluster-network-operator in the openshi -kube-
proxy namespace.
Ensure a pod is running on each relevant node and is in the Running state with ready containers.
Check logs (oc logs <kube-proxy-pod> -n openshi -kube-proxy) if issues are suspected (e.g., errors
applying firewall rules).
280. How do you check the status of the dns-operator and its pods?
Operator Status: oc get co dns, oc describe co dns
Operator Pods:
The dns-operator manages the CoreDNS deployment (dns-default) in the openshi -dns namespace.
281. How do you check the status of the authen ca on operator and its pods?
Operator Status: oc get co authen ca on, oc describe co authen ca on
Operator Pods:
This operator manages authen ca on components like the internal OAuth server and the OAuth API
server.
283. How do you check the status of the internal oauth-openshi server pods?
This is the built-in OAuth server that handles token issuance and interac on with configured Iden ty
Providers.
Look for pods named oauth-openshi -*. Check deployment status and pod readiness/logs.
Operator Pods:
This operator manages the lifecycle (deployment, backups, scaling) of the etcd cluster itself, whose
pods run in openshi -etcd.
285. How do you check the status of the kube-storage-version-migrator operator and
pods?
This operator handles the migra on of stored Kubernetes objects when their storage version changes
between Kubernetes releases.
Operator Pods:
286. What command lists all MachineConfig objects (base and rendered)?
Use oc get machineconfig or its short name oc get mc.
oc get mc
This lists all MachineConfigs, including:
Base configs (e.g., 00-worker, 01-master-kubelet).
Custom configs created by administrators.
Rendered configs applied to pools (e.g., rendered-worker-<hash>).
288. How do you list all PodDisrup onBudgets (PDBs) configured across all projects?
Use oc get poddisrup onbudgets --all-namespaces or the short name oc get pdb -A.
oc get pdb -A
This lists all PDBs defined cluster-wide, showing the minimum available/maximum unavailable pods
allowed for the associated applica on during voluntary disrup ons.
289. How can you determine if a PDB is currently preven ng pods from being evicted (e.g.,
during a node drain)?
Use oc describe pdb <pdb_name> -n <project_name>.
Look at the Status: sec on, specifically the Allowed Disrup ons field. If this value is 0, it means
evic ng another pod covered by this PDB would violate the budget (minAvailable or
maxUnavailable), and therefore, voluntary evic ons (like those during a node drain) for these pods
are currently blocked. The drain process will wait un l Allowed Disrup ons becomes greater than 0.
290. How would you get a rough es mate of the total CPU and memory resources
requested by all pods currently running in the cluster?
There isn't a single built-in oc command for this exact sum. Methods include:
Monitoring Dashboards: Grafana dashboards o en have panels summarizing total cluster resource
requests and limits based on Prometheus metrics scraped from kube-state-metrics. This is usually
the easiest way. Look for cluster overview or capacity planning dashboards.
Scrip ng oc get pods: You can write a script to iterate through all pods in all namespaces, extract
their container resource requests (spec.containers[*].resources.requests), and sum them up.
oc describe nodes: Summing the Allocated resources across all nodes (oc describe node <node>
shows allocated requests per node) gives an approxima on, though it might include terminated pod
resources temporarily.
2. The cer ficate's Common Name (CN) is typically set to the service's internal DNS
name (<service_name>.<namespace>.svc).
3. The cer ficate is signed by a cluster-internal Cer ficate Authority (the "Service CA").
4. It stores the key (tls.key), cer ficate (tls.crt), and the CA cer ficate (ca.crt) in the
specified secret (<secret_name>) within the service's namespace.
This allows pods to easily mount these secrets and establish secure TLS communica on with other
internal services, trus ng the Service CA.
292. How does the Machine Config Operator (MCO) apply changes to nodes? Describe the
flow.
The MCO orchestrates node configura on updates using MachineConfigs:
1. Detec on: The Machine Config Controller (part of MCO) watches for changes to
MachineConfig objects.
3. Pool Update: The Controller updates the MachineConfigPool object for that pool,
poin ng its spec.configura on.name to the new rendered config.
4. MCD No fica on: The Machine Config Daemon (MCD) running on each node within
the pool watches its corresponding MachineConfigPool object. It sees the desired
configura on has changed.
5. Node Cordon & Drain: The MCO (o en via the MCD coordina ng) selects a node to
update (respec ng maxUnavailable). It cordons the node (oc adm cordon) and then
drains it (oc adm drain), evic ng pods gracefully (respec ng PDBs).
6. Apply Config: Once drained, the MCD on the node applies the changes defined in
the new rendered MachineConfig (e.g., writes files, poten ally runs rpm-ostree
commands for RHCOS updates).
7. Reboot: If the changes require it (e.g., kernel update, OS update), the MCD triggers a
node reboot.
8. Uncordon & Verify: A er the node reboots and the Kubelet reports Ready, the MCD
verifies the update and the MCO uncordons the node (oc adm uncordon), making it
available for scheduling again.
9. Repeat: The process repeats for the next node in the pool un l all nodes are updated
to the new rendered config.
293. What is the purpose of the oc adm must-gather command and when would you use
it?
Purpose: oc adm must-gather is a diagnos c tool designed to collect a comprehensive snapshot of
cluster state, configura on, and logs. It gathers informa on from various sources (Cluster Operators,
nodes, resource defini ons, events) relevant to troubleshoo ng complex cluster issues.
Engaging with Red Hat Support for a cluster problem. Support engineers will
o en request must-gather output for analysis.
It packages the collected data into a compressed archive, making it easier to share for offline
analysis.
Scaling MachineSet (IPI Clusters): This controls the number of Nodes (physical or virtual machines)
belonging to a specific pool (e.g., worker nodes in a par cular availability zone). oc scale machineset
my-cluster-worker-us-east-1a --replicas=3 -n openshi -machine-api tells the Machine API Operator
(and underlying cloud provider) to ensure 3 actual machine instances matching the MachineSet's
template exist. It manages the cluster's infrastructure capacity itself. Scaling a MachineSet adds or
removes nodes from the cluster.
Operator Reconcilia on: Operators con nuously watch the resources they
manage and try to reconcile their state back to a desired configura on
defined by the operator logic or its Custom Resource.
296. How can you iden fy which nodes belong to the 'master' pool vs. a 'worker' pool?
Nodes have labels indica ng their role.
Check Node Labels: Use oc get nodes --show-labels. Look for labels like:
Filter by Label:
# List worker nodes (that aren't also masters, if masters have worker role)
# Or simply list all workers if masters don't have the worker role label
Check MachineConfigPools: The default MCPs are usually named master and worker. You can list
nodes associated with an MCP label:
297.What considera ons are important when choosing a Persistent Volume Reclaim
Policy?
The persistentVolumeReclaimPolicy field in a PersistentVolume (PV) or StorageClass determines what
happens to the underlying storage volume when the corresponding PVC is deleted. Key
considera ons:
Delete:
Pros: Automa cally cleans up the underlying storage volume when the PVC
is deleted. Prevents orphaned volumes and associated costs. Simple
workflow for dynamically provisioned volumes where data persistence
beyond the PVC lifecycle isn't needed.
Retain:
Pros: Protects against accidental data loss via PVC dele on. The
underlying storage volume persists even a er the PVC is gone.
Allows data recovery or re-a achment to a new PV/PVC later.
Suitable for cri cal data.
Blocking Legi mate Traffic: Overly restric ve ingress or egress rules can block
necessary communica on between applica on ers (e.g., frontend to backend),
connec ons to databases, access to cluster services (like DNS, API server,
monitoring), or outbound connec ons to external services. This leads to
applica on malfunc on or complete failure.
Allowing Unintended Traffic: Overly permissive rules (or the absence of policies,
resul ng in default-allow) can negate security segmenta on. A compromised
pod could poten ally access sensi ve services or data in other pods/namespaces
that it shouldn't be able to reach, increasing the blast radius of a security breach.
DNS Failures: Incorrectly configured policies might block pods from reaching
CoreDNS (port 53 UDP/TCP) in the openshi -dns namespace, causing applica on
failures due to inability to resolve service names or external hosts.
Troubleshoo ng Difficulty: Debugging connec vity issues caused by complex or
incorrect Network Policies can be challenging, requiring careful examina on of
selectors and rules across mul ple policies.
Operator/Pla orm Issues: Blocking traffic needed by OpenShi operators or
pla orm components can lead to operator degrada on or cluster instability.
299. Why is running containers as root discouraged, and how do SCCs help enforce this?
Why Discouraged: Running container processes as the root user (UID 0) poses significant security
risks:
Principle of Least Privilege: Applica ons rarely need full root privileges to
func on. Running as root violates the principle of gran ng only the minimum
necessary permissions.
Filesystem Permissions: Root processes can modify any file within the
container's writable layers, poten ally damaging the container image or other
processes.
How SCCs Help: Security Context Constraints (SCCs) enforce restric ons on pods and containers,
including user ID control:
Default SCCs: OpenShi applies restric ve default SCCs (like restricted-v2) to standard users, which
typically enforce MustRunAsNonRoot or MustRunAsRange, preven ng root execu on unless
explicitly granted access to a more permissive SCC (like anyuid or privileged).
300. What are some key differences in managing an OpenShi 4 cluster compared to
managing a standard Kubernetes cluster?
While OpenShi is built on Kubernetes, it adds layers of opiniona on, automa on, and integrated
components, leading to management differences:
Security Context Constraints (SCCs): OpenShi 's SCCs provide a more granular
and restric ve security model by default compared to Kubernetes' Pod Security
Policies (deprecated) or Pod Security Admission (newer).
Machine API: IPI installa ons use the Machine API for declara ve node
management, abstrac ng underlying infrastructure provisioning.
oc vs kubectl: While kubectl works, the oc CLI includes addi onal OpenShi -
specific commands for managing Routes, Builds, Projects, ImageStreams, oc adm
tasks, etc.
Update Process: Cluster updates are managed centrally via the Cluster Version
Operator (CVO) and update channels, providing a more automated and
controlled upgrade experience for the en re pla orm stack.
Get one to one assistance for OpenShi Hands on labs (50 labs).
WhatsApp Dhinesh +91 9444410227 and get started today!