---
name: kubernetes
description: |
Comprehensive Kubernetes and OpenShift cluster management skill covering operations, troubleshooting, manifest generation, security, and GitOps. Use this skill when:
(1) Cluster operations: upgrades, backups, node management, scaling, monitoring setup
(2) Troubleshooting: pod failures, networking issues, storage problems, performance analysis
(3) Creating manifests: Deployments, StatefulSets, Services, Ingress, NetworkPolicies, RBAC
(4) Security: audits, Pod Security Standards, RBAC, secrets management, vulnerability scanning
(5) GitOps: ArgoCD, Flux, Kustomize, Helm, CI/CD pipelines, progressive delivery
(6) OpenShift-specific: SCCs, Routes, Operators, Builds, ImageStreams
(7) Multi-cloud: AKS, EKS, GKE, ARO, ROSA operations
metadata:
author: cluster-skills
version: "1.0.0"
---
# Kubernetes & OpenShift Cluster Management
Comprehensive skill for Kubernetes and OpenShift clusters covering operations, troubleshooting, manifests, security, and GitOps.
## Current Versions (January 2026)
| Platform | Version | Documentation |
|----------|---------|---------------|
| **Kubernetes** | 1.31.x | https://kubernetes.io/docs/ |
| **OpenShift** | 4.17.x | https://docs.openshift.com/ |
| **EKS** | 1.31 | https://docs.aws.amazon.com/eks/ |
| **AKS** | 1.31 | https://learn.microsoft.com/azure/aks/ |
| **GKE** | 1.31 | https://cloud.google.com/kubernetes-engine/docs |
### Key Tools
| Tool | Version | Purpose |
|------|---------|---------|
| **ArgoCD** | v2.13.x | GitOps deployments |
| **Flux** | v2.4.x | GitOps toolkit |
| **Kustomize** | v5.5.x | Manifest customization |
| **Helm** | v3.16.x | Package management |
| **Velero** | 1.15.x | Backup/restore |
| **Trivy** | 0.58.x | Security scanning |
| **Kyverno** | 1.13.x | Policy engine |
## Command Convention
**IMPORTANT**: Use `kubectl` for standard Kubernetes. Use `oc` for OpenShift/ARO.
---
## 1. CLUSTER OPERATIONS
### Node Management
```bash
# View nodes
kubectl get nodes -o wide
# Drain node for maintenance
kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data --grace-period=60
# Uncordon after maintenance
kubectl uncordon ${NODE}
# View node resources
kubectl top nodes
```
### Cluster Upgrades
**AKS:**
```bash
az aks get-upgrades -g ${RG} -n ${CLUSTER} -o table
az aks upgrade -g ${RG} -n ${CLUSTER} --kubernetes-version ${VERSION}
```
**EKS:**
```bash
aws eks update-cluster-version --name ${CLUSTER} --kubernetes-version ${VERSION}
```
**GKE:**
```bash
gcloud container clusters upgrade ${CLUSTER} --master --cluster-version ${VERSION}
```
**OpenShift:**
```bash
oc adm upgrade --to=${VERSION}
oc get clusterversion
```
### Backup with Velero
```bash
# Install Velero
velero install --provider ${PROVIDER} --bucket ${BUCKET} --secret-file ${CREDS}
# Create backup
velero backup create ${BACKUP_NAME} --include-namespaces ${NS}
# Restore
velero restore create --from-backup ${BACKUP_NAME}
```
---
## 2. TROUBLESHOOTING
### Health Assessment
Run the bundled script for comprehensive health check:
```bash
bash scripts/cluster-health-check.sh
```
### Pod Status Interpretation
| Status | Meaning | Action |
|--------|---------|--------|
| `Pending` | Scheduling issue | Check resources, nodeSelector, tolerations |
| `CrashLoopBackOff` | Container crashing | Check logs: `kubectl logs ${POD} --previous` |
| `ImagePullBackOff` | Image unavailable | Verify image name, registry access |
| `OOMKilled` | Out of memory | Increase memory limits |
| `Evicted` | Node pressure | Check node resources |
### Debugging Commands
```bash
# Pod logs (current and previous)
kubectl logs ${POD} -c ${CONTAINER} --previous
# Multi-pod logs with stern
stern ${LABEL_SELECTOR} -n ${NS}
# Exec into pod
kubectl exec -it ${POD} -- /bin/sh
# Pod events
kubectl describe pod ${POD} | grep -A 20 Events
# Cluster events (sorted by time)
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
```
### Network Troubleshooting
```bash
# Test DNS
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default
# Test service connectivity
kubectl run -it --rm debug --image=curlimages/curl -- curl -v http://${SVC}.${NS}:${PORT}
# Check endpoints
kubectl get endpoints ${SVC}
```
---
## 3. MANIFEST GENERATION
### Production Deployment Template
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ${APP_NAME}
namespace: ${NAMESPACE}
labels:
app.kubernetes.io/name: ${APP_NAME}
app.kubernetes.io/version: "${VERSION}"
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
template:
metadata:
labels:
app.kubernetes.io/name: ${APP_NAME}
spec:
serviceAccountName: ${APP_NAME}
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: ${APP_NAME}
image: ${IMAGE}:${TAG}
ports:
- name: http
containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
topologyKey: kubernetes.io/hostname
```
### Service & Ingress
```yaml
apiVersion: v1
kind: Service
metadata:
name: ${APP_NAME}
spec:
selector:
app.kubernetes.io/name: ${APP_NAME}
ports:
- name: http
port: 80
targetPort: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ${APP_NAME}
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- ${HOST}
secretName: ${APP_NAME}-tls
rules:
- host: ${HOST}
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ${APP_NAME}
port:
name: http
```
### OpenShift Route
```yaml
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: ${APP_NAME}
spec:
to:
kind: Service
name: ${APP_NAME}
port:
targetPort: http
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect
```
Use the bundled script for manifest generation:
```bash
bash scripts/generate-manifest.sh deployment myapp production
```
---
## 4. SECURITY
### Security Audit
Run the bundled script:
```bash
bash scripts/security-audit.sh [namespace]
```
### Pod Security Standards
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: ${NAMESPACE}
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: baseline
pod-security.kubernetes.io/warn: restricted
```
### NetworkPolicy (Zero Trust)
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ${APP_NAME}-policy
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: database
ports:
- protocol: TCP
port: 5432
# Allow DNS
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
```
### RBAC Best Practices
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: ${APP_NAME}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ${APP_NAME}-role
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ${APP_NAME}-binding
subjects:
- kind: ServiceAccount
name: ${APP_NAME}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: ${APP_NAME}-role
```
### Image Scanning
```bash
# Scan image with Trivy
trivy image ${IMAGE}:${TAG}
# Scan with severity filter
trivy image --severity HIGH,CRITICAL ${IMAGE}:${TAG}
# Generate SBOM
trivy image --format spdx-json -o sbom.json ${IMAGE}:${TAG}
```
---
## 5. GITOPS
### ArgoCD Application
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ${APP_NAME}
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: ${GIT_REPO}
targetRevision: main
path: k8s/overlays/${ENV}
destination:
server: https://kubernetes.default.svc
namespace: ${NAMESPACE}
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
```
### Kustomize Structure
```
k8s/
āāā base/
ā āāā kustomization.yaml
ā āāā deployment.yaml
ā āāā service.yaml
āāā overlays/
āāā dev/
ā āāā kus
... (truncated)Create and manage Docker sandboxed VM environments
Agent continuity and cognitive health infrastructure.
Agent continuity and cognitive health infrastructure.
Enables the bot to manage Docker containers, images, and stacks.