[Session Report] My Pods aren’t responding! A kubernetes troubleshooting journey #BOA205
In this blog we are going to cover the troubleshooting journey when Kubernetes pods aren't responding.
Understanding of technology matters. Creating a cluster and app all done is not the thing. A lot of things happen when you maintain a cluster. This is important because the application is not simple anymore. Failures will come and go and the impact of failure can cost.
In the Kubernetes context there are many places things can go wrong !!!
Context
- K8 is an application platform not just plug and play mechanism anymore.
- Applications are going global and scalable.
- k8 abstracts a lot for developers as a platform for applications.
- Abstraction works well but abstraction is also a form of process, tools right?
- In 2023, verbosity of error, and troubleshooting is a problem.
Why verbosity is important?
Applications come in different shapes and sizes, now there are cloud-native apps, and these apps have distinguished themselves into micro-services vs the monolithic realm. When complexity has increased to such an extent so did the failure modes and their impact and thus verbosity plays an important role in troubleshooting such failures.
How to start troubleshooting
!(most-troubleshoot-component)[https://devio2023-media.developers.io/wp-content/uploads/2023/12/WhatsApp-Image-2023-12-17-at-21.30.25.jpeg]
Before talking about any kind of troubleshooting related to Kubernetes, it is important to understand the common notion of users around it and the lifecycle of k8 request
Lifecycle of K8 request
The cycle of troubleshooting always revolves around 4 stages; Observe, orient, decide and act.
Thinking about troubleshooting
- 5 major failure pillar
- manifest files during deployment can throw errors
- k8 abstracts a lot of network configuration as long as it works.
- Misconfigurations: control plane not configured, DNS.
- Out-of-memory, storage and application code issues
Troubleshoot different pillars
Best Practices when creating YAML
Schema/YAML validation
- Instead of creating schemas and yaml from scratch use templating tools/package management like kustomize, cdk8s, Helm
- Use validation admission controller: admit or deny specific schema before cluster starts deploying that.
Things to remember for Kubernetes networking
When it comes to k8 networking 4 areas should be given special attention:
- CNIs
- Policy management
- Security groups
- VPC address space(IP exhaustion)
K8s errors
- There are many error codes for which Kubectl in itself allows us to observe and troubleshoot them efficiently.
- for example stern: a cli utility which allows to tail multiple pods at the same time on the console.
Observing Kubernetes with Kubectl
- YAML looks fine but has issues
- kubectl submitted the job but won't tell you the error before summiting the request
- The YAML file failed because no service account name within the namespace existed
- second error: image pull back off because it doesn't exist in ECR.
- so fixed the image name
- another error, irsa does not have dynamodb permission but the application dynamodb name does not exist.
- now readiness and liveness probe fails now
- fix: fix limits and requests to increase resources
Common network issues
The second part of the session talks about AWS native ways to enable "verbosity"
Second part of the session talks about AWS native ways which allow to increase observability and simplify operations around it by using EKS add-on's, collecting Opentelemetry data using ADOT collector, using observability accelerator module using terraform
Observability add-on for EKS
The latest add-on enables a one-stop solution for collecting metrics across the EKS cluster and visualising them using cloud watch significantly increasing observability for troubleshooting and failure.
- Open telemetry ADOT
- Observability accelerator for terraform