Member-only story

Approaching Reliability

2 min readAug 11, 2023

Improving reliability can feel like a daunting task and it is. If it is not a priority for the company, then dont sign up for improving reliability. It costs capital — compute, architecture, people, time.

Reliability will be your most expensive feature.

If you want to successfully execute the goal of improving reliability, ensure the goal has exec sponsorship, exec visibility, a sense of urgency and priority, and enough capital to run the project for at least 12 months.

Let's talk about a simple framework to approach reliability.

What is reliability? Site reliability describes the stability and quality of an offering.

How is it measured? By leveraging golden signals ( latency, traffic, saturation, error) and the corresponding metrics become p90/95/99 of latency, expected traffic volume, how close is the traffic to the tested capacity ( green line below — saturation), error rate.

How do we approach it —

Reliability work will bleed into your product architecture but it is always good for one to draw the boundary around app change vs ecosystem changes.

Approaching Reliability

Written by Rachit Lohani

No responses yet