Roles and Responsibilities in a microsevices world
Microservices have changed the view how we view the projects. They are not loaded with resources anymore, which is great in a lot of sense but scary if you think about the risk exposed as the teams are way smaller one usually does not have a way to control 30 different projects solving 300 different problems. We adopted microservices model 2 years ago and learnt a lot from the successes and failures, one of the learning was around risks management and how to streamline the process to make sure that we have a smooth delivery of projects with all the assumptions delineated . The risk mindset was inculcated in the teams from the very inception of the project, they were free to pick and choose any technology or implementation they like but will have to do a FMEA on it, get a buy in from the clients and product manager. We published STA ( Standard template architectures) that the engineers could start using right away as they were were came with FMEA complete and accepted risks. This won’t prevent the engineers from taking the creative route and at the same time will help them by making sure if they dont like the existing solution they think their solution through thus creating one more option for the posterity to solve an issue. The most beautiful aspect of the microservices is that one gets to don multiple hats, one could be a manager in a project , engineer in another, architect in a different project and guide the prod support for a different project. With this the role that one plays changes and we have outlined the function of the role that one is playing. Risk gets magnified as you are a service provider to a few and consumer for some. Now your service relies on n number of services and that makes it critical for one to analyze the risk/limitations and make sure that they are communicated to the downstream clients.
This document is about an approach on Risk assessment and management. We will broadly define risk, expectations from the respective members , delineate the responsibilities , provide a framework and guidelines on how to conduct the assessment and at what stage. The audience of the document involves the developers, QA,Ops, Architecture, project and the product team.
Outcome/Expectations
With the implementation of a formal risk management process we expect more resilient applications, lower downtimes, good coverage of the edge cases, acknowledgement of the risks.The success/failure of the implementation of Risk Management will be gauged on the count of the FCI, uptime, MTT {Resolve, Identify, Know , Fix , Verify }.
Architects
- The architects will be responsible for laying the foundation of STA ( Standard Template Architectures) that will be pre approved and assessed. The teams wont have to go through the assessment process if they are using the architecture as is. The architectects will also work with the respective teams to guide them through the risk assessment and management process. The goal for the architects would be to minimize the POF ( Point Of Failures) so that we can have the operational optimized designs
Ops (DevOps model)
- The Ops engineer will be responsible for doing the risk assessment and making sure that resiliency is built in from the get go. The resiliency not only mean in case of failures but also detection of degradation of the service ( either of its own or a downstream service).
- Keeping the risk management mindset in mind the leads will make sure they code has the necessary intelligence to mitigate and recover gracefully from failures, making sure to adhere to the following principles :
- Isolate client network interaction using the bulkhead and circuit breaker patterns.
- Fallback and degrade gracefully when possible.
- Fail fast when fallbacks aren’t available and rapidly recover.
- Monitor, alert and push configuration changes with low latency (seconds).
- The ops engineers are bound to the STA and if not standard, he needs to get an architecture approval before the implementation. During the approval process he will have to present an FMEA and make sure that the design make the infra auto-healable and recoverable.
- The ops engineer will be responsible for making sure that the STA has the monitoring and alerting configured.
- The engineer is responsible for making sure that there is alerting on the trending KPI’s defined by the managers.
Managers
- The managers are responsible for making sure that the risk management ideologies are being upheld all the time and if there is an exception then it has been formally approved by the SLT , the Architecture team and has been communicated to the stakeholders ( ops,consumers of the applications, PM, clients).
- The managers are also responsible for defining the critical KPI’s and making sure they are measured.They will make sure that the different metrics that are published has a direct usability impact measurement, thresholds and an actionable item associated with it.
Product managers
- The product managers will be responsible for defining the degraded state. There will always be a degraded state definition. We are going the microservices route, the services will go down and there is no exception to it.
- The PM will be responsible for defining the degraded and down state of the downstream services. The downstream services would also encompass the infrastructure components.
Quality Engineers
- The Quality engineers will play the devils advocate for all of the above. They are responsible for making sure that all the Failure modes and effects are tested and they work as expected.
- They will make sure that the effects of the failures are documented and will help in coming up with the actionable items and auto recovery working with the Dev and the ops team.
Process and Procedure
Risk assessment will be initiated in the design phase. If the team is not following the STA templates then will have to do a risk assessment and go through a formal process of risk assessment and approval. If all approved then the architecture becomes part of the STA, so that others can consume it knowing the pros and cons of the design.
You can pick a tool like FMEA for risk assessment and there are plenty of jira plugins for risk tracking/management.
At the very inception of the project you start with the risk involved, the risk that we are mostly focused on is around the operational risk of the project/product. This is where you analyze the services that you will be consuming, make sure that you understand their systems, SLA, support agreement and the data contract. The Risk management document that you will publish will cover all the underlying services, softwares, failure scenarios , Eg, if you are using Redis then how would your system act/behave if the master is down, slave is down, redis infra is down, master and slave are flapping, getting intermittent timeout from the master/slave and other cases .
Once the risks are identified they will be tracked. The next step is to create a matrix and see what could be the potential impact on clients, users, system, logs upstream services of such a failure. Once we know a good deal about the impact we then move to prioritization.
After the impact analysis we prioritize the risk according to the following assessment criteria
- Impact on the user of the event
- likelihood of the occurrence of the event.
- Velocity of the occurrence of the event.
- Vulnerability/Susceptibility of the risk
After the above step we know the impact on the overall system and we are in a state to prioritize it accordingly. Once prioritized the team can then work on mitigation , workaround, or just acceptance of the risk.
Following are the benefits of the above system :
- Helps in the development of the more resilient systems
- Pushes for self healing intelligent systems
- The clients are well aware of the system that they are onboarding on and builds a sense of security as they are aware of the flaws and have acknowledged the risks.
- Helps the team to build a better monitoring and recovery system .
- Lays the foundation of predictive monitoring.
- Helps in building an interactive model for architecture and helps improve the design/implementation as the production failures that result in RCA feed into risk management thus making sure that whoever is using them gets aware of the edge cases encountered in production and can improve his/her system accordingly.