From Zero to Millions of Users: The Process of Building a System That Supports Scale

This article was first published on the official account: More AI (power_ai), welcome to pay attention, programming and AI dry goods will be delivered in time!

Designing a system that supports millions of users is challenging, and it's a journey that requires constant refinement and continuous improvement. In this chapter, we'll build a system that supports a single user and scales up to serve millions of users. After reading this chapter, you will have some techniques to help you solve system design interview questions.

single server setup

A journey of a thousand miles begins with a single step, and so does building a complex system. To start off with something simple, we're running everything on a single server. Figure 1 shows a schematic diagram of a single-server setup, where everything runs on one server: web application, database, cache, etc.

1-1

To understand this setup, it helps to investigate the request flow and traffic sources. Let's first look at the request flow (Figure 1-2).

image-20230517204734179

  1. A user accesses a website through a domain name, such as api.mysite.com. Typically, Domain Name System (DNS) is a paid service provided by a third party and not hosted on our servers.
  2. The Internet Protocol (IP) address is returned to the browser or mobile application. In this example, the returned IP address is 15.125.23.214.
  3. Once an IP address is obtained, a Hypertext Transfer Protocol (HTTP)[1] request is sent directly to your web server.
  4. The web server returns an HTML page or JSON response for rendering.

Next, let's look at traffic sources. Traffic to your web server comes from two sources: web applications and mobile applications.

  • Web Application: It uses a set of server-side languages ​​(Java, Python, etc.) for business logic, storage, etc., and client-side languages ​​(HTML and JavaScript) for presentation.
  • Mobile Apps: The HTTP protocol is a communication protocol between mobile apps and web servers. Because of its simplicity, JavaScript Object Notation (JSON) is often used as an API response format for transferring data. Here is an example of an API response displayed in JSON format:

GET /users/12 – retrieves the user object with id 12

image-20230517204830036

database

As the user base grew, one server was no longer enough, and we needed multiple servers: one for website and mobile traffic, and one for the database (Figure 1-3). Separating website and mobile traffic (web tier) from database (data tier) servers allows them to scale independently.

image-20230517204939183

Which database to use?

You can choose a traditional relational database or a non-relational database. Let's see how they differ.

A relational database is also known as a relational database management system (RDBMS) or SQL database. The most popular ones are MySQL, Oracle database, PostgreSQL, etc. Relational databases use tables and rows to represent and store data. You can use SQL to perform join operations between different database tables.

Non-relational databases are also known as NoSQL databases. Popular ones include CouchDB, Neo4j, Cassandra, HBase, Amazon DynamoDB, etc. [2]. These databases fall into four categories: key-value stores, graph stores, column stores, and document stores. Non-relational databases generally do not support join operations.

For most developers, relational databases are the best choice because they have been around for over 40 years and have historically performed well. However, exploring beyond relational databases is critical if a relational database is not suitable for your particular use case. A non-relational database might be the right choice if:

  • Your application requires ultra-low latency.
  • Either your data is unstructured, or you don't have any relational data.
  • You just need to serialize and deserialize the data (JSON, XML, YAML, etc.).
  • You need to store a lot of data.

Vertical Scaling Versus Horizontal Scaling

Vertical scaling, also known as "vertical scaling," is the process of adding more processing power (CPU, RAM, etc.) to a server. Horizontal scaling, also known as "horizontal scaling," allows you to scale out by adding more servers to a resource pool.

Vertical scaling is a great option when traffic is low, and the simplicity of vertical scaling is its main advantage. Unfortunately, it also has some serious limitations.

  • There is a hard limit to vertical scaling. It is not possible to add unlimited CPU and memory to a single server.
  • Vertical scaling without failover and redundancy. If one server goes down, the website/app will be completely inaccessible.

Due to the limitations of vertical scaling, horizontal scaling is more ideal for large-scale applications.

In previous designs, users connected directly to the web server. If the web server goes offline, users will not be able to access the website. In another case, if many users access the web server at the same time and the load limit of the web server is reached, the users often encounter problems with slow response or being unable to connect to the server. Load balancers are the best technology to solve these problems.

load balancer

The load balancer will evenly distribute incoming traffic to the web servers defined in the load balancing collection. Figure 1-4 shows how a load balancer works.

image-20230517205101773

As shown in Figure 1-4, users directly connect to the public IP of the load balancer. With this setup, the web server can no longer be directly accessed by clients. For better security, private IPs are used for communication between servers. A private IP is an IP address that is only accessible between servers on the same network, not over the Internet. The load balancer communicates with the web server through the private IP.

In Figure 1-4, when the load balancer and the second web server were added, we successfully resolved the failover issue and increased the availability of the web tier. The details are as follows:

  • If server 1 goes offline, all traffic will be routed to server 2. This prevents website downtime. We can also add a new healthy web server to the server pool to balance the load.
  • If website traffic grows rapidly and two servers cannot handle the traffic, a load balancer can gracefully handle this. You just need to add more servers to the web server pool and the load balancer will automatically start sending requests to them.

Now the web tier looks fine, what about the data tier? The current design has only one database, so failover and redundancy are not supported. Database replication is a common technique for solving these problems. Let's take a look.

database replication

Quoting from Wikipedia: "Database replication can be used in many database management systems, typically to establish a master/slave relationship between an original database (the master database) and a copy (the slave database)." [3]

The primary database usually only supports write operations. The slave database obtains a copy of the data from the master database and only supports read operations. All commands that modify data such as insert, delete, or update must be sent to the primary database. Most applications require a higher ratio of reads to writes, so the number of slaves in a system is usually greater than the number of masters. Figure 1-5 shows a master database with multiple slave databases.

image-20230517205239480

Advantages of database replication:

  • Better performance: In the master-slave model, all write operations and update operations occur on the master node, while read operations are distributed among the slave nodes. This model improves performance because it allows more queries to be processed in parallel.
  • Reliability: If one of your database servers is destroyed by a natural disaster such as a typhoon or earthquake, the data is still preserved. You don't need to worry about data loss because data is replicated to multiple locations.
  • High Availability: By replicating data in different locations, even if one database goes offline, your website can still function because you can access the data stored in another database server.

In the previous section, we discussed how load balancers can help increase the availability of a system. We ask the same question here: what if one of the databases goes offline? The architectural design discussed in Figure 1-5 handles this situation:

  • If only one slave is available and it goes offline, read operations will temporarily be directed to the master. Once the problem is discovered, a new slave database will replace the old one. If there are multiple slaves available, read operations will be redirected to other healthy slaves. A new database server will replace the old database.
  • If the master goes offline, a slave is promoted to be the new master. All database operations will be temporarily performed on the new primary database. A new slave database will immediately replace the old database for data replication. In a production system, promoting a new master is more complicated because the data in the slave may not be up to date. Missing data needs to be updated by running a data recovery script. While some other replication methods, such as multi-master and ring replication, can help, these setups are more complex and their discussion is beyond the scope of this book. Interested readers can refer to the listed references [4] [5].

Figure 1-6 shows the system design with the addition of a load balancer and database replication.

image-20230517205326460

Let's take a look at the design:

  • The user obtains the IP address of the load balancer from DNS.
  • Users use this IP address to connect to the load balancer.
  • HTTP requests are routed to either Server 1 or Server 2.
  • The web server reads user data from the database.
  • The web server routes any operations that modify data to the main database. Includes write, update, and delete operations.
  • Now that you have a solid understanding of the web and data layer, it's time to improve load/response times by adding a caching layer and moving static content (JavaScript/CSS/image/video files) to a Content Delivery Network (CDN) .

cache

A cache is a temporary storage area used to store expensive response results or frequently accessed data in memory so that subsequent requests can be served more quickly. As shown in Figure 1-6, each time a new web page is loaded, one or more database calls are made to fetch data. Repeated calls to the database can severely impact application performance. Caching can alleviate this problem.

caching layer

The cache layer is a temporary data storage layer that is much faster than the database. Having a separate cache tier has the following benefits: better system performance, reduced database workload, and the ability to scale the cache tier independently. Figure 1-7 shows a possible cache server setup:

image-20230517205419376

After receiving a request, the web server first checks whether there is a response available in the cache. If so, send the data back to the client. If not, query the database, store the response in cache, and send it back to the client. This caching strategy is called read-through caching. Depending on data type, size, and access patterns, other caching strategies may also be used. A previous study explained how different caching strategies work [6]. Interacting with cache servers is very simple, as most cache servers provide APIs for common programming languages. The following code snippet shows a typical Memcached API:

image-20230517205444346

Considerations for using caches

Here are a few issues you should consider when using a caching system:

  • Decide when to use caching. Consider using caching when data is frequently read but rarely modified. Since cached data is stored in volatile memory, cache servers are not suitable for persistent data. For example, if the cache server is restarted, all data in memory will be lost. Therefore, important data should be stored in persistent data storage.

  • expiration policy. It is a good practice to implement an expiration policy. Once the cached data expires, it will be removed from the cache. When there is no expiration policy, cached data is permanently stored in memory. It is recommended not to set the expiration date too short, or the system will reload the data from the database too frequently. At the same time, it is not recommended to set the expiration date too long, lest the data become stale.

  • Consistency: This involves keeping data stores and caches in sync. Inconsistencies can occur because data modification operations on the data store and cache are not within a single transaction. Maintaining consistency between data stores and caches is challenging when scaling across multiple regions. For details, please refer to the paper titled "Scaling Memcache at Facebook" published by Facebook [7].

  • Mitigating Failure: A single caching server represents a potential single point of failure (SPOF), which Wikipedia defines as follows: "A single point of failure (SPOF) is a part of a system that, if it fails, will cause the entire system to stop working" [8 ]. Therefore, it is recommended to use multiple cache servers in different data centers to avoid single point of failure. Another recommended method is to overcommit the required memory by a certain percentage. This provides a buffer in case of increased memory usage.

    image-20230517205532221

  • Eviction policy: Once the cache is full, any request to add an item to the cache may cause existing items to be removed. This is known as cache eviction. Least Recently Used (LRU) is the most common cache eviction strategy. Other elimination strategies, such as least frequently used (LFU) or first-in-first-out (FIFO), can be adopted according to different use cases.

Content Delivery Network (CDN)

A CDN is a network of geographically distributed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc.

Dynamic content caching is a relatively new concept and is beyond the scope of this book. It enables caching of HTML pages based on request path, query string, cookies and request headers. For more information, please refer to the article mentioned in reference [9]. This book focuses on how to use a CDN to cache static content.

Here's how a CDN works at a high level: When a user visits a website, the CDN server closest to the user will deliver the static content. Intuitively, the farther the user is from the CDN server, the slower the website will load. For example, if the CDN server is located in San Francisco, users in Los Angeles will get content faster than users in Europe. Figure 1-9 is a good example of how a CDN can improve load times.

image-20230517205608012

Figure 1-10 shows the CDN workflow.

image-20230517205620502

  1. User A tries to fetch image.png using the image URL. The domain name of the URL is provided by the CDN provider. The following two image URLs are used to demonstrate sample image URLs on Amazon and Akamai CDNs:
    • https://mysite.cloudfront.net/logo.jpg
    • https://mysite.akamai.com/image-manager/img/logo.jpg
  2. If the CDN server does not have a cache for image.png, the CDN server requests the file from the source (which can be a web server or online storage like Amazon S3).
  3. The origin returns image.png to the CDN server, including an optional HTTP header Time-to-Live (TTL), describing how long the image was cached.
  4. The CDN caches the image and returns it to user A. Images are cached in the CDN until the TTL expires.
  5. User B sends a request to get the same image.
  6. As long as the TTL has not expired, the image will be returned from the cache.

Considerations for CDN Usage

  • Cost: The CDN is operated by a third-party provider, and you will be charged for data transfer in and out of the CDN. For infrequently used resources, there is no significant benefit from caching, so you should consider moving them off the CDN.
  • Set an appropriate cache expiration time: For time-sensitive content, setting a cache expiration time is very important. The cache expiration time should be neither too long nor too short. If the time is too long, the content may no longer be fresh. If the time is too short, it may cause repeated reloading of content from the origin server to the CDN.
  • CDN fallback: You should consider how your website/application will handle CDN failures. In the event of a temporary CDN outage, clients should be able to detect the problem and fetch resources from the origin.
  • Invalidate files: You can remove files from the CDN before they expire in the following ways:
    • Invalidate CDN objects using the API provided by the CDN provider.
    • Use object versioning to serve different versions of objects. To version an object, you can add parameters to the URL, such as a version number. For example, add version number 2 to the query string: image.png?v=2.

Figure 1-11 shows the design after adding the CDN and caching.

image-20230517205701930

  1. Static resources (JS, CSS, images, etc.) are no longer served by the web server. They are fetched from CDN for better performance.
  2. The load on the database is reduced by caching data.

stateless web layer

Now is the time to think about scaling the web tier horizontally. To do this, we need to move state (such as user session data) out of the web tier. A good practice is to store session data in a persistent store such as a relational or NoSQL database. Every web server in the cluster can access state data from the database. This is called a stateless web layer.

stateful architecture

There are some key differences between stateful and stateless servers. A stateful server remembers client data (state) from one request to the next. A stateless server does not store any state information.

Figure 1-12 shows an example of a stateful architecture.

image-20230517205732733

In Figure 1-12, user A's session data and profile picture are stored in server 1. To authenticate User A, the HTTP request must be routed to Server 1. If the request is sent to some other server, say server 2, the authentication will fail because server 2 does not contain user A's session data. Likewise, all HTTP requests from User B must be routed to Server 2; all requests from User C must be sent to Server 3.

The problem is that every request from the same client must be routed to the same server. In most load balancers, this can be achieved with sticky sessions [10]; however, this adds overhead. Adding or removing servers is more difficult using this method. Dealing with server failures is also a challenge.

stateless architecture

Figure 1-13 shows the stateless architecture.

image-20230517205802833

In this stateless architecture, a user's HTTP request can be sent to any web server, which fetches state data from a shared data store. State data is stored in a shared data store and is not kept in the web server. Stateless systems are simpler, more robust and scalable.

Figure 1-14 shows the updated design with a stateless web layer.

image-20230517205815284

In Figure 1-14, we moved session data out of the web tier and stored it in a persistent data store. Shared data storage can be a relational database, Memcached/Redis, NoSQL, etc. A NoSQL data store was chosen because of its ease of scaling. Autoscaling means automatically adding or removing web servers based on traffic load. Automatic scaling of the web tier is easily achieved by adding or removing servers based on traffic load after state data is removed from the web servers.

Your website is growing rapidly and attracting a large number of international users. To increase availability and provide a better user experience across a wider geographic area, supporting multiple data centers is critical.

data center

Figure 1-15 shows an example setup with two data centers. In normal operation, users are routed via geoDNS to the closest data center based on geographic location, with x% of traffic in US-East and *(100 - x)%* of traffic in US-West. geoDNS is a DNS service that resolves domain names to IP addresses based on the user's location.

image-20230517205854112

In the event of any major data center failure, we will direct all traffic to a functioning data center. In Figure 1-16, Data Center 2 (US-West) is offline and 100% of traffic is routed to Data Center 1 (US-East).

image-20230517205913252

To implement a multi-datacenter setup, several technical challenges need to be addressed:

  • Traffic Redirection: Effective tools are needed to direct traffic to the correct data center. Depending on the user's location, geoDNS can be used to direct traffic to the closest data center.
  • Data synchronization: Users from different regions may use different local databases or caches. In a failover situation, traffic may be routed to a data center whose data is unavailable. A common strategy is to replicate data across multiple data centers. A previous study showed how Netflix achieves asynchronous multi-datacenter replication [11].
  • Testing and Deployment: For multi-datacenter setups, it is important to test your website/application in different locations. Automated deployment tools are essential to keep services consistent across all data centers [11].

To further scale our system, we need to decouple the different components of the system so that they can scale independently. Message queues are a key strategy used by many practical distributed systems to solve this problem.

message queue

A message queue is a persistent component, stored in memory, used to support asynchronous communication. It acts as a buffer and dispatches asynchronous requests. The basic architecture of a message queue is simple. Input services called producers/publishers create messages and publish them to message queues. Other services or servers, called consumers/subscribers, connect to the queue and perform the operations defined by the message. The model is shown in Figure 1-17.

image-20230517205941209

Decoupling makes message queues the architecture of choice for building scalable and reliable applications. With message queues, producers can post messages to the queue when consumers are unable to process them. Consumers can read messages from the queue even if the producer is unavailable.

Consider the following use case: Your application supports photo customization, including cropping, sharpening, blurring, and more. These customization tasks take time to complete. In Figure 1-18, the web server posts a photo-processing job to a message queue. Photo processing workers receive jobs from message queues and perform photo customization tasks asynchronously. Producers and consumers can scale independently. When the size of the queue grows, more workers can be added to reduce processing time. However, if the queue is empty most of the time, the number of workers can be reduced.

image-20230517210008712

Logging, Metrics, Automation

Logging, metrics, and automation support are good practice but not required when dealing with small websites running on just a few servers. However, now that your website has grown into one serving large businesses, investing in these tools is essential.

Logging: Monitoring error logs is very important as it helps to identify errors and problems in the system. You can monitor error logs at a per-server level, or use tools to aggregate them into a centralized service for easy searching and viewing.

Metrics: Collecting different types of metrics helps us gain business insights and understand the health of the system. Here are some useful metrics:

  • Host-level metrics: CPU, memory, disk I/O, etc.
  • Metrics at the aggregation level: such as the performance of the entire database tier, caching tier, etc.
  • Key business metrics: daily active users, retention, revenue, etc.

Automation: When systems become large and complex, we need to build or leverage automation tools to increase productivity. Continuous integration is a good practice that enables teams to detect problems early by automatically validating each code commit. Additionally, automating build, test, deployment processes, and more can significantly increase developer productivity.

Add message queues and other tools

Figure 1-19 shows the updated design. Due to space constraints, only one data center is shown in the diagram.

  1. A message queue is included in the design to help make the system more loosely coupled and fault tolerant.
  2. Includes logging, monitoring, metrics, and automation tools.

image-20230517210040343

As your data grows every day, your database load is getting heavier and heavier. It's time to scale the data layer.

database extension

There are two main approaches to database scaling: vertical scaling and horizontal scaling.

vertical expansion

Also known as vertical scaling, scaling is done by adding more performance (CPU, RAM, DISK, etc.) to an existing machine. There are some powerful database servers out there. According to Amazon Relational Database Service (RDS) [12], you can get a database server with 24 TB RAM. This powerful database server can store and process large amounts of data. For example, stackoverflow.com in 2013 had more than 10 million unique visitors per month, but it had only 1 master database [13]. However, vertical scaling also has some serious disadvantages:

  • You can add more CPU, RAM, etc. to the database server, but there are hardware limitations. If you have a large number of users, a single server is not enough.
  • Increased risk of single point of failure.
  • The overall cost of vertical scaling is higher. Powerful servers are more expensive.

Horizontal expansion

Also known as sharding, it is the practice of adding more servers. Figure 1-20 compares vertical scaling with horizontal scaling.

image-20230517210149838

Sharding divides a large database into smaller, more manageable parts, called shards. Each shard shares the same schema, although the actual data on each shard is unique.

Figure 1-21 shows an example of a sharded database. User data is assigned to database servers based on user IDs. Whenever you access data, a hash function is used to find the corresponding shard. In our example, user_id % 4 is used as the hash function. If the result is equal to 0, use shard 0 to store and fetch data. If the result equals 1, use shard 1. The logic of other shards is the same.

image-20230517210216761

Figure 1-22 shows the user table in a sharded database.

image-20230517210232296

When implementing a sharding strategy, the most important factor is choosing a shard key. A shard key (also known as a partition key) consists of one or more columns that determine how data is distributed. As shown in Figure 1-22, "user_id" is the shard key. Shard keys allow you to efficiently retrieve and modify data by routing database queries to the correct database. When choosing a shard key, one of the most important criteria is to choose a key that evenly distributes data.

Sharding is a great technique for scaling your database, but it's far from a perfect solution. It introduces complexity and new challenges to the system:

Re-sharding data : When 1) a single shard can no longer hold more data due to rapid growth, data needs to be re-sharded. 2) Some shards may exhaust shards faster due to uneven data distribution. When the shards are exhausted, the sharding function needs to be updated and the data moved. A commonly used technique to solve this problem is consistent hashing, discussed in Chapter 5.

Hot key problem : Also known as the star problem. Excessive access to a particular shard can overload the server. Imagine that data for Katy Perry, Justin Bieber, and Lady Gaga all end up on the same shard. For social applications, the shard will be flooded with read operations. To solve this problem, we may need to assign a shard to each celebrity. Even each shard may require further partitioning.

Joins and denormalization : Once a database is sharded across multiple servers, it becomes difficult to perform join operations across database shards. A common workaround is to denormalize the database so that queries can be performed on a single table.

In Figure 1-23, we shard the database to support rapidly growing data traffic. At the same time, some non-relational functions were moved to NoSQL data stores to reduce database load. Here is an article that covers many use cases for NoSQL [14].

image-20230517210309600

Scale over millions of users

System expansion is an iterative process. Iterating on what we've learned in this chapter might get us a long way. To surpass millions of users, more optimizations and new strategies are required. For example, you may need to optimize your system and decouple it into smaller services. All the techniques learned in this chapter should provide a good foundation for tackling new challenges. To conclude this chapter, we provide a summary of how our system scales to support millions of users:

  • Keep the web tier stateless
  • Build redundancy in at every level
  • cache data as much as possible
  • Support for multiple data centers
  • Host static resources in a CDN
  • Scaling the data layer with sharding
  • Split the hierarchy into independent services
  • Monitor your system and use automated tools

Congratulations on making so much progress! Give yourself a pep talk now. well done!

References

[1] Hypertext Transfer Protocol: https://zh.wikipedia.org/wiki/Hypertext Transfer Protocol

[2] Should you look beyond relational databases? :

https://blog.teamtreehouse.com/should-you-go-beyond-relational-databases

[3] Replication: https://zh.wikipedia.org/wiki/Replication_(Computer)

[4] Multi-master replication:

https://zh.wikipedia.org/wiki/Multi-Master Replication

[5] NDB Cluster Replication: Multi-Master and Ring Replication:

https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-replication-multi-master.html

[6] Caching strategy and how to choose an appropriate strategy:

https://codeahoy.com/2017/08/11/caching-strategies-and-how-to-choose-the-right-one/

[7] R. Nishtala, "Cache Scaling for Facebook", Tenth USENIX Symposium on Network System Design and Implementation (NSDI '13).

[8] Single point of failure: https://zh.wikipedia.org/wiki/Single point of failure

[9] Amazon CloudFront Dynamic Content Delivery:

https://aws.amazon.com/cloudfront/dynamic-content/

[10] Configure the sticky session of the classic load balancer:

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-sticky-sessions.html

[11] Active-Active (Active-Active) for multi-region fault tolerance:

https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b

[12] Amazon EC2 High Memory Instance:

https://aws.amazon.com/ec2/instance-types/high-memory/

[13] Requirements to run Stack Overflow:

http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow

[14] What exactly are you doing with NoSQL:

http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html

本文翻译自《System Design Interview: An Insider’s Guide》第一章,如有侵权,请联系本人删除

Hello, I am Shisan, a veteran driver who has been developing for 7 years, and a foreign company for 5 years in the Internet for 2 years. I can beat Ah San and Lao Mei, and I have also been ruined by PR comments. Over the years, I have worked part-time, started a business, took over private work, and mixed upwork. Made money and lost money. Along the way, my deepest feeling is that no matter what you learn, you must keep learning. As long as you can persevere, it is easy to achieve corner overtaking! So don't ask me if it's too late to do what I do now. If you still have no direction, you can follow me [public account: More AI (power_ai)], where I will often share some cutting-edge information and programming knowledge to help you accumulate capital for cornering and overtaking.

Guess you like

Origin blog.csdn.net/smarter_AI/article/details/131818964