In this article I will describe the four essential things that we need to take care of to implement fault tolerance in a micro service architecture.
Micro service is in fashion now a days. It gives immense flexibility to different teams working on different components of the same project. Each team can use the technology they like, to implement the specific component and this kind of architecture gives a lot of freedom and flexibility to the whole project.
But with great flexibility comes great responsibility. In a micro service architecture, each service is interacting with other services to achieve a particular end goal. For example - for an online store, we can have services like Order, Payment, Special Offer etc.
Order service may call the Special Offer service to show some special offer when an user is placing the order. We can have different teams working in these different services and we can update and release the services independent of each other. This increases the flexibility but each team has to be very cautious when they call a service which does not belong to them.
Say for example, when the Order service calls the Special Offer service, they have to keep in mind that Special offer service may not be available or down. Below we will discuss the things that may go wrong.
Special Offer Service is Down
Order service wants to display the special offers to the user. But unfortunately the Special Offer service is down. But it's still waiting for the response. So, the user is blocked. Another user comes in and the Order service uses an another thread from it's thread pool to serve the second user.
The second thread also now calls the Special Offer service and it gets blocked as well. So, as more and more users keep coming in, the Order service is running out of threads in thread pool as all it's threads are blocked waiting for response from the Special Offer service.
So, it's no longer able to serve any more user and all the existing users are blocked. So, the bad Special Offer service has also brought down the Order service and this can have a cascading effect.
When you are calling a third party service that you do not control, always put time outs. This will prevent your threads from waiting indefinitely for a response. So, when the thread times out, you know that something is wrong with the other service and it's not a big deal to NOT show Special Offer at this time.
So, the thread can go ahead with the other task at it's hand instead of waiting for a response from a dead service.
Now you have put time outs when calling Special Offer service from your Order service. So, everything looks good now. But after some days, your Order service has again become unresponsive. Users are waiting for ever and user experience has become very poor. What went wrong this time ?
You found that Special Offer service is down again. But all your time outs are in place. But in the latest release, one of your team member put the Special Offer feature on each and every page of your Order service which is leading to increase in the number of calls to the Special Offer service and your threads are busy, just waiting to time out.
So, the threads are not doing any useful work except for timing out. So, even if you put the time outs in place, when there is increased number of requests in a short duration, eventually your threads will get blocked which will lead to poor user experience.
Use circuit breaker. When you find that your last X requests to a particular third party service has failed, you do not call that service any more for a particular duration of time. This means you have opened the circuit switch.
After that duration is over, you let go one request and call the service. If this time it works, you close the circuit again and call the the service as usual.
If it does not work, you leave the circuit open and try again after some time. This saves you from calling a dead service again and again and block your crucial resources. Netflix Hystrix is very popular circuit breaker.
So, we have put the time outs and as well as the circuit breaker and all looks green now. But again one fine day our Order service has become unresponsive. We investigated and found that it's the Special Offer service again.
We found that the Special Offer service is not down but it's very slow. So, our requests are not timing out but it's taking a lot of time. So, if you have set the time out as 10 ms, it's taking 9.5 ms to process.
So, if there are lot of requests in a short period of time, your thread pool will not be able to handle all requests and user experience will be very poor.
Bulk Heads. Bulk head is like a filter or a Wall which will allow only N number of requests to the Special Offer service in a duration S and will reject all other requests to the service.
To give an example, if you think that in normal situation the Special Offer service should take at most 6 ms to process a request, you can set the bulk head to allow only 2 requests in every 15 ms and reject any other incoming request.
This way we can prevent the Special Offer service from slowing down our entire Order service thread pool.
Now it definitely looks good. We have put time outs, circuit breaker and now bulk heads. Nothing can go wrong now. But again one fine day, our Order service has become unresponsive. And yes it's the Special Offer service which is down again.
But what happened to the time outs. It seems that in the last release someone has accidentally wiped out all the time outs. Human error it seems.
Use service level thread pool. The basic purpose of a service level thread pool is to work as a Facade and intercept User request and delegate the task to worker thread pool and set a time out on that task.
This is a simple layer at the top which you would not update or change at all ( or at least frequently). So, even if your worker level thread is blocked, you have the guarantee that the service level thread will timeout and send a meaningful response to the User instead of blocking the user down.
So, use the four pillars Time Outs, Circuit Breaker, Bulk Heads and Service Level thread Pool and develop fault tolerant Micro Services. Please leave your queries in comment section.
Sharing is Caring!
RECOMMENDED POSTS FOR YOU