[Session Report] My Pods aren’t responding! A kubernetes troubleshooting journey #BOA205

In this blog, we are going to answer questions related to Kubernetes such as "What could possibly go wrong?", "What can we do to recover?", "How to mitigate against common problems" which will help in troubleshooting when pods don't respond !!!
2023.12.19

session start

In this blog we are going to cover the troubleshooting journey when Kubernetes pods aren't responding.

Understanding of technology matters. Creating a cluster and app all done is not the thing. A lot of things happen when you maintain a cluster. This is important because the application is not simple anymore. Failures will come and go and the impact of failure can cost.

In the Kubernetes context there are many places things can go wrong !!!

Context

first things first

  • K8 is an application platform not just plug and play mechanism anymore.
  • Applications are going global and scalable.
  • k8 abstracts a lot for developers as a platform for applications.
  • Abstraction works well but abstraction is also a form of process, tools right?
  • In 2023, verbosity of error, and troubleshooting is a problem.

Why verbosity is important?

Applications come in different shapes and sizes, now there are cloud-native apps, and these apps have distinguished themselves into micro-services vs the monolithic realm. When complexity has increased to such an extent so did the failure modes and their impact and thus verbosity plays an important role in troubleshooting such failures.

verbosity-is-important

How to start troubleshooting

!(most-troubleshoot-component)[https://devio2023-media.developers.io/wp-content/uploads/2023/12/WhatsApp-Image-2023-12-17-at-21.30.25.jpeg]

Before talking about any kind of troubleshooting related to Kubernetes, it is important to understand the common notion of users around it and the lifecycle of k8 request

kuberntes-set context

Lifecycle of K8 request

lifecycle of request

The cycle of troubleshooting always revolves around 4 stages; Observe, orient, decide and act.

troubshooting stages

Thinking about troubleshooting

  • 5 major failure pillar
    • manifest files during deployment can throw errors
    • k8 abstracts a lot of network configuration as long as it works.
    • Misconfigurations: control plane not configured, DNS.
    • Out-of-memory, storage and application code issues

Troubleshoot different pillars

Best Practices when creating YAML

Schema/YAML validation

schema yaml error

  • Instead of creating schemas and yaml from scratch use templating tools/package management like kustomize, cdk8s, Helm
  • Use validation admission controller: admit or deny specific schema before cluster starts deploying that.

BP schema yaml

Things to remember for Kubernetes networking

k8 networking concerns

When it comes to k8 networking 4 areas should be given special attention:

  • CNIs
  • Policy management
  • Security groups
  • VPC address space(IP exhaustion)

K8s errors

error codes

summary of kubernetes error

  • There are many error codes for which Kubectl in itself allows us to observe and troubleshoot them efficiently.

kubectl

  • for example stern: a cli utility which allows to tail multiple pods at the same time on the console.

Observing Kubernetes with Kubectl

kubectl observation

  • YAML looks fine but has issues
  • kubectl submitted the job but won't tell you the error before summiting the request
    • The YAML file failed because no service account name within the namespace existed
  • second error: image pull back off because it doesn't exist in ECR.
    • so fixed the image name
  • another error, irsa does not have dynamodb permission but the application dynamodb name does not exist.
  • now readiness and liveness probe fails now
    • fix: fix limits and requests to increase resources

Common network issues

network issues

The second part of the session talks about AWS native ways to enable "verbosity"

Second part of the session talks about AWS native ways which allow to increase observability and simplify operations around it by using EKS add-on's, collecting Opentelemetry data using ADOT collector, using observability accelerator module using terraform

Observability add-on for EKS

The latest add-on enables a one-stop solution for collecting metrics across the EKS cluster and visualising them using cloud watch significantly increasing observability for troubleshooting and failure.

observability add on

  • Open telemetry ADOT

add on architecture

  • Observability accelerator for terraform

observability accelerator