Category: Data, Kubernetes

But even more difficult was to figure out what’s going on under the hood, and how to prevent it from happening again.

This is what brought us to think about troubleshooting in the context of three pillars: I’m going to dive into how we envision these three pillars, and how they helped us to conceive of what’s needed to be able to properly troubleshoot real-world Kubernetes stacks that are the hallmark of complex, distributed systems.

To try and derive some understanding of what actually happened in the system that triggered this failure, developers will start by analyzing the changes to the system and what was changed that could have caused this to happen.

We then take a look at the fancy metrics, dashboards, and data that we created for just this very moment, to extract some kind of understanding of what is going wrong, based on tangible data sources.

Related Articles