Member-only story
Approaching Reliability
Improving reliability can feel like a daunting task and it is. If it is not a priority for the company, then dont sign up for improving reliability. It costs capital — compute, architecture, people, time.
Reliability will be your most expensive feature.
If you want to successfully execute the goal of improving reliability, ensure the goal has exec sponsorship, exec visibility, a sense of urgency and priority, and enough capital to run the project for at least 12 months.
Let's talk about a simple framework to approach reliability.
What is reliability? Site reliability describes the stability and quality of an offering.
How is it measured? By leveraging golden signals ( latency, traffic, saturation, error) and the corresponding metrics become p90/95/99 of latency, expected traffic volume, how close is the traffic to the tested capacity ( green line below — saturation), error rate.
How do we approach it —
Reliability work will bleed into your product architecture but it is always good for one to draw the boundary around app change vs ecosystem changes.