[Session Report] My Pods aren’t responding! A kubernetes troubleshooting journey #BOA205

In this blog, we are going to answer questions related to Kubernetes such as "What could possibly go wrong?", "What can we do to recover?", "How to mitigate against common problems" which will help in troubleshooting when pods don't respond !!!

#Kubernetes

#Amazon EKS

Jatin Mehrotra

2023.12.19

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

In this blog we are going to cover the troubleshooting journey when Kubernetes pods aren't responding.

Understanding of technology matters. Creating a cluster and app all done is not the thing. A lot of things happen when you maintain a cluster. This is important because the application is not simple anymore. Failures will come and go and the impact of failure can cost.

In the Kubernetes context there are many places things can go wrong !!!

Context

K8 is an application platform not just plug and play mechanism anymore.
Applications are going global and scalable.
k8 abstracts a lot for developers as a platform for applications.
Abstraction works well but abstraction is also a form of process, tools right?
In 2023, verbosity of error, and troubleshooting is a problem.

Why verbosity is important?

Applications come in different shapes and sizes, now there are cloud-native apps, and these apps have distinguished themselves into micro-services vs the monolithic realm. When complexity has increased to such an extent so did the failure modes and their impact and thus verbosity plays an important role in troubleshooting such failures.

How to start troubleshooting

!(most-troubleshoot-component)[https://devio2023-media.developers.io/wp-content/uploads/2023/12/WhatsApp-Image-2023-12-17-at-21.30.25.jpeg]

Before talking about any kind of troubleshooting related to Kubernetes, it is important to understand the common notion of users around it and the lifecycle of k8 request

Lifecycle of K8 request

The cycle of troubleshooting always revolves around 4 stages; Observe, orient, decide and act.

Thinking about troubleshooting

5 major failure pillar
- manifest files during deployment can throw errors
- k8 abstracts a lot of network configuration as long as it works.
- Misconfigurations: control plane not configured, DNS.
- Out-of-memory, storage and application code issues

Troubleshoot different pillars

Best Practices when creating YAML

Schema/YAML validation

Instead of creating schemas and yaml from scratch use templating tools/package management like kustomize, cdk8s, Helm
Use validation admission controller: admit or deny specific schema before cluster starts deploying that.

Things to remember for Kubernetes networking

When it comes to k8 networking 4 areas should be given special attention:

CNIs
Policy management
Security groups
VPC address space(IP exhaustion)

K8s errors

There are many error codes for which Kubectl in itself allows us to observe and troubleshoot them efficiently.

for example stern: a cli utility which allows to tail multiple pods at the same time on the console.

Observing Kubernetes with Kubectl

YAML looks fine but has issues
kubectl submitted the job but won't tell you the error before summiting the request
- The YAML file failed because no service account name within the namespace existed
second error: image pull back off because it doesn't exist in ECR.
- so fixed the image name
another error, irsa does not have dynamodb permission but the application dynamodb name does not exist.
now readiness and liveness probe fails now
- fix: fix limits and requests to increase resources

Common network issues

The second part of the session talks about AWS native ways to enable "verbosity"

Second part of the session talks about AWS native ways which allow to increase observability and simplify operations around it by using EKS add-on's, collecting Opentelemetry data using ADOT collector, using observability accelerator module using terraform

Observability add-on for EKS

The latest add-on enables a one-stop solution for collecting metrics across the EKS cluster and visualising them using cloud watch significantly increasing observability for troubleshooting and failure.