Microservices are important. They can lead to some pretty big wins for our architecture and teams, but microservices also have a lot of costs. As microservices, serverless, and other distributed system architectures become more prevalent in the industry, it's critical that we internalize their problems and strategies for solving them. In this article, we'll examine one example of the many thorny problems that network boundaries can introduce: timeouts.

Before you dread the term "distributed system," remember that even a small React app with a Node backend, or a simple iOS client talking to AWS Lambda, represents a distributed system. As you read this blog post, you're already participating in a distributed system that includes your Web browser, content delivery network, and file storage system.

In terms of background, I'm going to assume that you understand how to make API calls in your language of choice and handle their success and failure, but whether those API calls are synchronous or asynchronous, HTTP or not. If you come across unfamiliar terms or ideas, don't worry! I'm happy to have more discussions on Twitter or elsewhere, and I've tried to add links where appropriate.

The problem we're going to explore is this: If we have a very, very slow API call that eventually times out, and we assume (a) it succeeds or (b) it fails, we have a bug. Timeouts (or worse, infinitely long waits) are a fundamental fact of distributed systems, and we need to know how to deal with them.

question

Let's start with a thought experiment: Have you ever emailed a co-worker to ask them for something?

[Tuesday, 9:58 AM] You: "Hey, can you add me to our company's list of potential mentors?"
colleague:"……"
[Friday, 2:30pm] You: [?]

what should you do?

If you want your request to be fulfilled, you will eventually need to be sure that there is no reply. Will you wait longer? How long do you want to wait?

So, once you've decided how long to wait, what action do you take? Did you try sending the email again? Do you try different communication mediums? Do you think they won't do that?

OK, now what the hell is going on here? We would like to see this request-response behavior:

But something went wrong. There are several possibilities:

They never got the message.

They got the mail, processed it successfully, and sent you back a reply that they never heard from you (or went to your spam folder).

They got the information, but they're still thinking about it, or they lose it, or [gasp! ] They forgot.

Ultimately, we just don't know!

It is this problem that arises with any communication on a distributed system.

We may delay our requests, processing or responses, and these delays may be arbitrarily long. So, as with the email example, we need to make sure there is an answer to the "how long do we have to wait?" question, and we call that duration a timeout.

If you only take one lesson from this article, so be it: use timeouts . Otherwise, you run the risk of waiting forever for an operation that will never complete.

But what do we do once we hit the timeout, the upper limit of waiting?

method

There are several common ways people encounter timeouts in remote system calls. I don't claim this list is exhaustive, but it does cover many of the most common scenarios I've seen.

method 1

When you get a timeout, assume it succeeded and move on.

Please don't do this. [1] Unfortunately, I have to say that this is a common unintentional choice, even in production applications, with some pretty bad UX consequences. If we assume the surgery was a success, we poor consumers will reasonably assume things went well - only to be disappointed and confused later when they find out about the results.

Anytime you have a networking call, look for successes and failures. For example, if you're using an asynchronous API in JavaScript via Promise.then(...), ask yourself where the corresponding .catch(...) is. If it's missing, you almost certainly have a bug.

In some very special cases, you might as a matter of course not care whether the request succeeds or fails. UDP is a very successful protocol with this property. In addition, a lot of software is broken, just keep making money! But don't let this be your default—exhaust your other options first.

Method #2

For read requests, use cache or default.

This might be a good choice if your request is a read request and you don't intend to have any impact on the remote side. In this case, you can use cached values from previous successful requests. Alternatively, you can use the default if there are no successful requests yet or caching doesn't make sense in your case. This approach is relatively simple: it doesn't add much performance overhead or implementation complexity. But keep in mind that if you're using an out-of-process cache accessed over the network (e.g. memcached, Redis, etc.), then you're going back to something like that, where your requests might timeout to the cache itself .

Method #3

When you encounter a timeout, assume the remote operation failed and retry automatically.

This raises more questions:

What if it's not safe to retry ? Is it just annoying that the service on the other end of the network connection gets duplicates? Or are you double charging your credit card? (!)
Should you retry synchronously or asynchronously ?
If you retry synchronously, those retries are going to slow you down from the consumer's point of view - is there any chance that you're not meeting their expectations? This is especially important in services, rather than end-user applications.
If you retry asynchronously, what are you telling your consumers about the success of the operation ? Do you try them one at a time, or do you retry in batches over a period of time?
How many times should you retry ? (Once? Twice? 10 times? Until success?)
How should you delay between retries ? (exponential backoff [e.g., 1s, 2s, 4s, 8s, 16s, ...] bounded by max latency? using dithering?)
If remote servers have performance issues due to overload , will retries make them worse?

If the remote API can safely retry, we call it idempotent. Without the idempotent property, you could create duplicate data (as in the case of credit card charges) or cause a race condition (i.e. if you try to change your email address twice and the first retries after the second completes ).

Making automatic retries safe can require significant architectural effort in many cases. However, if you can safely retry (for example, by sending request UUIDs, and have the remote side keep track of these), things become very, very simple. Check out the Stripe API for a good example of this in action.

Method #4

Check if the request was successful, and try again if it is safe.

The idea here is that, in some cases, we can follow up with another request after a timed-out request, asking us about the status of the original request. This approach obviously requires the existence of an endpoint that can give us the information we want. Given such an endpoint, if the endpoint says our request was successful, we can explicitly say we don't need to retry.

But there's a serious problem here, we can't really know if it's safe to retry. Because usually our remote service can receive the request but is still processing it, the query endpoint we're checking won't be able to confirm success . Of course, the check itself may time out! It's possible that the remote server is completely unreachable for the same reasons as the initial failure, but even if that's true, we still have no way of knowing whether the problem occurred before or after the initial request was processed.

Method #5

Give up and let the user figure it out.

This requires the least amount of effort and arguably prevents us from making bad decisions, so in many cases this may be the best option. We also need to ask ourselves: Can our users figure out the right thing to do? Do they have enough information and insight into other systems to determine how to move forward ?

In some cases, it may be best to let our consumers know about the issue. For any method that involves retries, we may still end up falling back down this path if we don't want to allow an infinite number of retries!

in conclusion

So at this point, things may look bleak. Distributed systems are hard, and it seems we can't just pick one of these solutions as a panacea. If you feel like a failure, take heart and don't let the perfect be the enemy of the good.

Use timeout.

Every network request should have some timeout, even if the timeout is long, like 5 seconds, 10 seconds, or [gulp!] even more. Choosing a timeout can be tricky - you don't want too many failures (false positives) when requests eventually succeed, and you don't want to waste too much time and risk an unhealthy application. You can determine a good value by looking at the distribution and trends of historical requests and your application's own performance guarantees or risk profile.

Under no circumstances do we want our application server's queue, connection pool, ring buffer, or whatever bottleneck to be clogged with something that will wait forever. You can definitely research and add fancier stuff like circuit breakers and bulkheads depending on your production needs, but timeouts are cheap and well supported by libraries. use them!

Make retry safe by default.

It's interesting that you talk about "idempotency" in addition to making your code simpler and safer.

Consider delegating work differently.

Asynchronous messaging has some attractive properties here, because your remote service no longer needs to be fast and available; only your message broker can. However, messaging/asynchronicity is not a panacea - you still need to make sure the broker receives the message . Unfortunately, this can be difficult! Message brokers also have tradeoffs. Your users will have an idea of when they need to retry. For example, if there is a delay in message processing, they may decide to resubmit because their order has not yet shown up in the order history. Similar issues can arise with distributed logging/streaming platforms. If you're considering the messaging route (even if you're not!), take a close look at Enterprise Integration Patterns -- despite its age, the patterns in it are extremely relevant to today's architectures.

And at the risk of being a party poo, don't forget that you may be able to move or remove that network perimeter entirely! There's no shame in turning a difficult problem into an easy one. So maybe you can use one network request instead of five, or you can inline two services together. Or maybe you take one of the approaches above to handle timeouts in a reliable and safe manner. Whichever way you choose, remember that your users don't care if you use microservices - they just want things to work.

This article: https://architect.pub/microservices-arent-magic-handling-timeouts
Discussion: Knowledge Planet [Chief Architect Circle] or add WeChat trumpet [ca_cto] or add QQ group [792862318]
No public	【jiagoushipro】【Super Architect】 Brilliant graphic and detailed explanation of architecture methodology, architecture practice, technical principles, and technical trends. We are waiting for you, please scan and pay attention.
WeChat trumpet	[ca_cea] Community of 50,000 people, discussing: enterprise architecture, cloud computing, big data, data science, Internet of Things, artificial intelligence, security, full-stack development, DevOps, digitalization.
QQ group	[285069459] In-depth exchange of enterprise architecture, business architecture, application architecture, data architecture, technical architecture, integration architecture, security architecture. And various emerging technologies such as big data, cloud computing, Internet of Things, artificial intelligence, etc. Join the QQ group to share valuable reports and dry goods.
video number	[Super Architect] Quickly understand the basic concepts, models, methods, and experiences related to architecture in 1 minute. 1 minute a day, the structure is familiar.
knowledge planet	[Chief Architect Circle] Ask big names, get in touch with them, or get private information sharing.
Himalayas	[Super Architect] Learn about the latest black technology information and architecture experience on the road or in the car.	[Intelligent moments, Mr. Architecture will talk to you about black technology]
knowledge planet	Meet more friends, workplace and technical chat.	Knowledge Planet【Workplace and Technology】
LinkedIn	Harry	https://www.linkedin.com/in/architect-harry/
LinkedIn group	LinkedIn Architecture Group	https://www.linkedin.com/groups/14209750/
Weibo‍‍	【Super Architect】	smart moment‍
Bilibili	【Super Architect】
Tik Tok	【cea_cio】Super Architect
quick worker	【cea_cio_cto】Super Architect
little red book	[cea_csa_cto] Super Architect
website	CIO (Chief Information Officer)	https://cio.ceo
website	CIOs, CTOs and CDOs	https://cioctocdo.com
website	Architect practical sharing	https://architect.pub
website	Programmer cloud development sharing	https://pgmr.cloud
website	Chief Architect Community	https://jiagoushi.pro
website	Application development and development platform	https://apaas.dev
website	Development Information Network	https://xinxi.dev
website	super architect	https://jiagou.dev
website	Enterprise technical training	https://peixun.dev
website	Programmer's Book	https://pgmr.pub
website	developer chat	https://blog.developer.chat
website	CPO Collection	https://cpo.work

Thank you for your attention, forwarding, likes and watching.

[Microservice architecture design] Microservices are not magic: processing timeout