Today, even small startups may have to process terabytes of data or build services that support hundreds of thousands of events per minute (or even a second!). The so-called "scale" usually refers to the large number of requests/data/events that the system should handle in a short period of time.
Attempting to implement it in a naive way requires handling services at scale and is doomed at worst, or expensive at best.
This article will describe some principles and design patterns that enable systems to handle large scale. When we talk about large (and mostly distributed) systems, we usually judge their goodness and stability by looking at three properties:
-
Availability: The system should be as available as possible. Uptime percentages are key to customer experience, not to mention that an app is useless if no one is available to use it. Availability is measured with a "9".
-
Performance: Even under heavy load, the system should continue to run and perform its tasks. Plus, speed is critical to customer experience: experiments show that it's one of the most important factors in preventing churn!
-
Reliability: The system should process data accurately and return correct results. A reliable system does not fail silently or return incorrect results or create corrupted data. A reliable system is built in such a way that it strives to avoid failures, and when it's not possible, it detects, reports, and possibly even attempts to repair them automatically.
We can scale the system in two ways:
-
Vertical scaling (scale up): Deploy the system on a more powerful server, which means a machine with a stronger CPU, more RAM, or both
-
Scale Out (Scale Out): Deploy the system on more servers, which means launching more instances or containers, enabling the system to serve more traffic or process more data/events
Scaling up is generally not advisable, mainly for two reasons:
-
it usually requires some downtime
-
There are limits (we can't scale "forever")
On the other hand, in order to be able to extend a system, it must have certain properties that allow such extensions. For example, to be able to scale horizontally, the system must be stateless (e.g. most databases cannot scale out).
The purpose of this article is to give you a taste of the many different design patterns and principles that enable systems to scale out while maintaining reliability and resiliency. Due to this nature, I'm not able to delve into each topic, but just provide an overview. That said, within each topic, I try to add helpful links to more comprehensive resources on that topic.
So let's dig in!
idempotence
The term is borrowed from mathematics, where it is defined as:
f(f(x)) = f(x)
This might seem a little scary at first, but the idea behind it is simple: no matter how many times we call the function f
on x
, we get the same result. This property provides great stability to the system, as it allows us to simplify our code and also makes our operational life easier: failed HTTP requests can be retried, and crashed processes can be restarted without worrying about side effects.
Additionally, a long-running job can be split into multiple parts, each of which can be itself idempotent, meaning that when the job crashes and restarts, all already executed parts will be skipped (recoverable sex).
embrace async
When we make a synchronous call, the execution path is blocked until a response is returned. This blocking has resource overhead, mainly the cost of memory and context switching. We can't always design our system using only asynchronous calls, but when we can make our system more efficient. An example showing how async can provide good efficiency/performance is Nodejs, which has a single threaded event loop, but it is struggling with many other concurrency languages and frameworks.
health examination
This pattern is specific to microservices: each service should implement a /health route that should return shortly after the system is running fast. It should return HTTP code 200 assuming all is well, and error 500 if the service fails. Now, we know that some bugs won't be caught by health checks, but assuming a system under stress will behave poorly and become latent, it will also be reflected by health checks, which will also become more latent, which can also help We identify problems and automatically generate alerts that can be answered by on-call personnel. We can also choose to temporarily remove the node from the queue (see service discovery below) until it is stable again.
breaker
A circuit breaker is a term borrowed from the world of electricity: when the circuit is closed, current is flowing, and when it is open, the flow stops.
When a dependency is unreachable, all requests to it will fail. According to the Fail Fast principle, we want our system to fail fast when we try to make a call, rather than wait until it times out. This is a good use case of the circuit breaker design pattern: by wrapping a call to a function with a circuit breaker, the circuit breaker will recognize when a call to a specific destination (e.g. a specific IP) fails, and start failing calls without actually making call, causing the system to fail fast.
The circuit breaker will maintain a state (open/closed) and refresh its state by retrying the actual call every so often.
The circuit breaker implementation was introduced and widely adopted in Netflix's Hystrix library, and is now common in other libraries as well.
kill switch/feature flag
Another common practice today is to perform a "silent deployment" of new functionality. It does this by controlling functionality with an if condition that checks whether a feature flag is enabled (or, alternatively, by checking whether the associated kill-switch flag is disabled). This practice doesn't guarantee 100% that our code is bug-free, but it does reduce the risk of deploying new bugs to production. Also, if we have a feature flag enabled and we see a new bug in the system, it's easy to disable the flag and "return to normal", which is a huge win from an operations standpoint.
Bulkhead
A bulkhead is a dividing wall or barrier between compartments on a ship's bottom. Its job is to insulate an area in case there is a hole in the bottom - to prevent water from flooding the entire boat (it will only flood the compartment without the hole).
The same principles can be applied to software by building it with modularity and isolation in mind. An example can be thread pools: when we create different thread pools for different components to ensure that a bug that exhausts all threads in one of them - does not affect other components.
Another good example is making sure different microservices don't share the same database. We also avoid shared configuration: different services should have their own configuration settings, even if it requires some kind of duplication, to avoid situations where configuration errors in one service affect different services.
service discovery
In a dynamic microservices world, where instances/containers come and go, we need a way to know when new nodes join/leave the queue. Service discovery (also known as service registry) is a mechanism that solves this problem by allowing nodes to register with a central location such as the Yellow Pages. This way, when service B wants to call service A, it will first call service discovery to request a list of available nodes (IPs), which it will cache and use for a while.
Timeouts, sleeps and retries
Any network can suffer from transient errors, delays, and congestion issues. When service A calls service B, the request may fail, and if a retry is initiated, the second request may succeed. That said, it's important not to implement retries in a simple way (loop), rather than "bake in" a delay mechanism between retries (aka "sleep"). The reason is that we should be aware of the service being called: there may be multiple other services calling Service B at the same time, and if they all keep retrying, the result will be a "retry storm": Service B will be bombarded with requests, which may make it Overwhelmed and crashed. To avoid "retry storms", it is common practice to use an exponential backoff retry mechanism, which introduces an exponentially growing delay between retries and eventually a "timeout", which stops any additional retries.
reserve
Sometimes we just need a "Plan B". Suppose we are using a recommendation service in order to get the best and most accurate recommendations for our customers. But what can we do when a service is down or temporarily inaccessible?
We could have another service as a fallback: the other service might keep a snapshot of all of our customers' recommendations, refresh itself every week, and when it's called, all it needs to do is return the relevant record for that particular customer. This information is static and easy to cache and serve. These fallback suggestions are indeed a bit old, but it's much better to have suggestions that aren't completely up-to-date than nothing at all.
Good engineers consider these options when building systems!
Note that circuit breaker implementations may include the option to provide a fallback service!
Metrics, Monitoring and Alerting
When running a large-scale system, it is not a question of if the system will fail, but when the system will fail: due to the large scale, even a rare event of one in a million will happen. eventually happened.
Now that we understand and accept that mistakes are "part of life," we must figure out the best way to deal with them.
In order to have a reliably usable system, we need to be able to detect (MTTD) and repair (MTTR) bugs quickly, and for this we need to gain observability into the system. This is accomplished by publishing metrics, monitoring those metrics, and raising alerts when our monitoring systems detect "off" metrics.
Google defines 4 metrics as golden signals, but that doesn't mean we shouldn't publish other metrics. We can divide the metrics into 3 buckets:
-
Business Metrics: Metrics derived from business context, for example, we may publish metrics every time an order is placed, approved or canceled
-
Infrastructure Metrics: Metrics that measure the size/usage of parts of our infrastructure, for example we can monitor CPU usage, memory and disk space used by our applications
-
Feature Metrics: Metrics that publish information about specific features in our system. An example could be a metric published in an A/B test we are running to provide insights on users assigned to different units of the experiment
Small anecdote: During my days working at Netflix, one of the things my team and I did was develop Watson to enable teams to automatically remediate their services from known scenarios by creating programmatic runbooks!
speed limit
Rate limiting or throttling is another useful mode that can help reduce stress on your system.
There are 3 types of throttling:
-
User rate limit (client)
-
Server speed limit and
-
geographic speed limit
back pressure
Backpressure is a technique used to handle situations where the request load from an upstream service is higher than it can handle. One way to deal with backpressure is to signal the upstream service that it should rate limit itself.
There is a dedicated HTTP response code 429 "Too Many Requests" which is intended to signal to the client that the server is not ready to accept more requests at the current rate. Such responses typically return a Retry-After header to indicate how many seconds the client should wait before retrying.
Two other ways to deal with backpressure are throttling (aka "throwing requests on the floor") and buffering.
Additional recommended reading on backpressure can be found here.
canary release
Canary testing is a technique used to gradually roll out changes to a production environment. When the monitoring system finds a problem - the canary is automatically rolled back with minimal damage to production traffic.
Remember that in order to enable canary releases we need to be able to monitor the canary cluster separately from the "normal" nodes, we can then use the fleet of "regular" nodes as a baseline and compare it to the metrics we receive from canary. For example, we can compare the 500 error rates we receive in both, and if the canary produces a higher error rate, we can roll it back.
A more conservative approach is to use shadow traffic in production as a canary.
A former colleague of mine, Chris Sanden, co-authored a good article on Kayenta: A Tool for Automated Canary Analysis Developed at Netflix.
That's all for today's content, I hope you can learn something new!
If you think I've missed an important pattern/principle - please write a comment and I'll add it.
This article: https://architect.pub/design-patterns-and-principles-support-large-scale-systems | ||
Discussion: Knowledge Planet [Chief Architect Circle] or add WeChat trumpet [ca_cto] or add QQ group [792862318] | ||
No public |
【jiagoushipro】 【Super Architect】 Brilliant graphic and detailed explanation of architecture methodology, architecture practice, technical principles, and technical trends. We are waiting for you, please scan and pay attention. |
|
WeChat trumpet |
[ca_cea] Community of 50,000 people, discussing: enterprise architecture, cloud computing, big data, data science, Internet of Things, artificial intelligence, security, full-stack development, DevOps, digitalization. |
|
QQ group |
[285069459] In-depth exchange of enterprise architecture, business architecture, application architecture, data architecture, technical architecture, integration architecture, security architecture. And various emerging technologies such as big data, cloud computing, Internet of Things, artificial intelligence, etc. Join the QQ group to share valuable reports and dry goods. |
|
video number | [Super Architect] Quickly understand the basic concepts, models, methods, and experiences related to architecture in 1 minute. 1 minute a day, the structure is familiar. |
|
knowledge planet | [Chief Architect Circle] Ask big names, get in touch with them, or get private information sharing. | |
Himalayas | [Super Architect] Learn about the latest black technology information and architecture experience on the road or in the car. | [Intelligent moments, Mr. Architecture will talk to you about black technology] |
knowledge planet | Meet more friends, workplace and technical chat. | Knowledge Planet【Workplace and Technology】 |
Harry | https://www.linkedin.com/in/architect-harry/ | |
LinkedIn group | LinkedIn Architecture Group | https://www.linkedin.com/groups/14209750/ |
Weibo | 【Super Architect】 | smart moment |
Bilibili | 【Super Architect】 | |
Tik Tok | 【cea_cio】Super Architect | |
quick worker | 【cea_cio_cto】Super Architect | |
little red book | [cea_csa_cto] Super Architect | |
website | CIO (Chief Information Officer) | https://cio.ceo |
website | CIOs, CTOs and CDOs | https://cioctocdo.com |
website | Architect practical sharing | https://architect.pub |
website | Programmer cloud development sharing | https://pgmr.cloud |
website | Chief Architect Community | https://jiagoushi.pro |
website | Application development and development platform | https://apaas.dev |
website | Development Information Network | https://xinxi.dev |
website | super architect | https://jiagou.dev |
website | Enterprise technical training | https://peixun.dev |
website | Programmer's Book | https://pgmr.pub |
website | developer chat | https://blog.developer.chat |
website | CPO Collection | https://cpo.work |
website | chief security officer | https://cso.pub |
website | CIO cool | https://cio.cool |
website | CDO information | https://cdo.fyi |
website | CXO information | https://cxo.pub |
Thank you for your attention, forwarding, likes and watching.