When I joined Honeycomb two years ago, we were entering a phase of growth where we could no longer expect to have the time to prevent or fix all issues before things got bad. All the early parts of the system needed to scale, but we wouldn’t have the bandwidth to tackle some of them graciously.
Incidents are normal; they’re the rolled-up history of decisions made a long time ago, then thrown into a dynamic environment with rapidly shifting needs and people.
This vision is what influences how I try to shape, with the help of many coworkers, the way Honeycomb deals with incidents.
This list is in flux, but it represents a decent high-level snapshot in time of what I think is important today.