Roles and Responsibilities in a microsevices world



  • The architects will be responsible for laying the foundation of STA ( Standard Template Architectures) that will be pre approved and assessed. The teams wont have to go through the assessment process if they are using the architecture as is. The architectects will also work with the respective teams to guide them through the risk assessment and management process. The goal for the architects would be to minimize the POF ( Point Of Failures) so that we can have the operational optimized designs

Ops (DevOps model)

  • The Ops engineer will be responsible for doing the risk assessment and making sure that resiliency is built in from the get go. The resiliency not only mean in case of failures but also detection of degradation of the service ( either of its own or a downstream service).
  • Keeping the risk management mindset in mind the leads will make sure they code has the necessary intelligence to mitigate and recover gracefully from failures, making sure to adhere to the following principles :
  1. Isolate client network interaction using the bulkhead and circuit breaker patterns.
  2. Fallback and degrade gracefully when possible.
  3. Fail fast when fallbacks aren’t available and rapidly recover.
  4. Monitor, alert and push configuration changes with low latency (seconds).
  • The ops engineers are bound to the STA and if not standard, he needs to get an architecture approval before the implementation. During the approval process he will have to present an FMEA and make sure that the design make the infra auto-healable and recoverable.
  • The ops engineer will be responsible for making sure that the STA has the monitoring and alerting configured.
  • The engineer is responsible for making sure that there is alerting on the trending KPI’s defined by the managers.


  • The managers are responsible for making sure that the risk management ideologies are being upheld all the time and if there is an exception then it has been formally approved by the SLT , the Architecture team and has been communicated to the stakeholders ( ops,consumers of the applications, PM, clients).
  • The managers are also responsible for defining the critical KPI’s and making sure they are measured.They will make sure that the different metrics that are published has a direct usability impact measurement, thresholds and an actionable item associated with it.

Product managers

  • The product managers will be responsible for defining the degraded state. There will always be a degraded state definition. We are going the microservices route, the services will go down and there is no exception to it.
  • The PM will be responsible for defining the degraded and down state of the downstream services. The downstream services would also encompass the infrastructure components.

Quality Engineers

  • The Quality engineers will play the devils advocate for all of the above. They are responsible for making sure that all the Failure modes and effects are tested and they work as expected.
  • They will make sure that the effects of the failures are documented and will help in coming up with the actionable items and auto recovery working with the Dev and the ops team.

Process and Procedure

  1. Impact on the user of the event
  2. likelihood of the occurrence of the event.
  3. Velocity of the occurrence of the event.
  4. Vulnerability/Susceptibility of the risk

Following are the benefits of the above system :

  1. Helps in the development of the more resilient systems
  2. Pushes for self healing intelligent systems
  3. The clients are well aware of the system that they are onboarding on and builds a sense of security as they are aware of the flaws and have acknowledged the risks.
  4. Helps the team to build a better monitoring and recovery system .
  5. Lays the foundation of predictive monitoring.
  6. Helps in building an interactive model for architecture and helps improve the design/implementation as the production failures that result in RCA feed into risk management thus making sure that whoever is using them gets aware of the edge cases encountered in production and can improve his/her system accordingly.




Technology leader

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Couchbase SDK for JAVA

Dynamic Multi Column Search with JPA Criteria.

Jan Xie AMA: Part Three — The Nervos Address System Addressed

Leta OS. By Mihindu Ranasinghe

Professional DPF EGR Remover 3.0 Lambda Remover Full 2017.5 Version Software Unlocked

Social Media Automation through Project Management Software

C# Interface inheritance

Orienting Mental Models to the Analog Method

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rachit Lohani

Rachit Lohani

Technology leader

More from Medium

What project management framework should your team use?

Finding the right project management framework can be the key to complete projects efficiently.

Why Every Company Needs DevOps

Designing and implementing a multi-cloud architecture

Beyond CI/CD: Top 10 challenges in operating modern cloud applications