Full stack essential: 10 thought experiments for system architecture design

Experience in architectural design of software systems is hard to come by. Even if you work for many years, the opportunities to complete the system architecture design are very limited. How to improve your system architecture design ability? Continuous practice is of course indispensable, and thought experiments may also be an effective way.

98c7473c3eee36ad83033bed12b660e1.jpeg

In general, it is crucial to fully understand the problem and its requirements before diving into architectural design. Take the time to clear up any ambiguities and make sure you have a clear understanding of the scope and goals of the system. Don't hesitate to clarify the question and restate the question in your own words to confirm your understanding. Take a step-by-step approach to analyze the problem, identify key components, and explore different design options before identifying a solution.

Keep scalability, reliability, and performance in mind throughout the design process, and be prepared to make trade-offs and optimizations related to these factors by proactively discussing trade-offs and the rationale behind design choices. Only by understanding the complexities of system architectural design can an informed decision be made.

This article preliminarily lists 10 common knowledge points in system architecture design, and uses the method of thinking experiment to try system design. Such deliberate practice may have a certain auxiliary effect.

f10d67e37b632a96556a9cd1816b8108.jpeg

1. Cache

A cache is a high-speed storage layer that sits between an application and an original data source such as a database, file system, or remote web service. When an application requests data, it first checks the data in the cache. If the data is found in the cache, it is returned to the application. If the data is not found in the cache, it is retrieved from its original source, stored in the cache for future use, and returned to the application. In a distributed system, caching can be done in multiple places such as client, DNS, CDN, load balancer, API gateway, server, database, etc.

Experiment 1: Design a key-value store (such as Redis)

Key-value stores are used for fast, scalable data storage and retrieval, and like popular key-value storage systems such as Redis, are often used for caching, session management, and real-time analytics.

Experimental steps:

  1. Know your requirements: Determine the expected number of keys, value sizes, and access patterns.

  2. Design of data partitioning: Implement data partitioning techniques, such as consistent hashing or range partitioning, to distribute keys across multiple nodes.

  3. Implement data replication: Use quorum-based or master-slave replication strategies to ensure data durability and availability.

  4. Optimize data access: Implement caching and indexing strategies to improve read and write performance.

  5. Handle data pruning: Use purge strategies such as most recently used (LRU) or time-to-live (TTL) to manage memory usage.

  6. Ensure fault tolerance: Implement mechanisms to monitor and recover from node failures, such as heartbeat checks and automatic failover.

These questions can improve the ability to design scalable, efficient and reliable systems. Understanding the key concepts and trade-offs involved in each problem is very important and requires teasing your own thought process.

38daa0c4773b0bb7bb4b85cb5e2df23f.jpeg

2. CDN

A Content Delivery Network (CDN) is a distributed network of servers that are deployed in multiple locations around the world. These servers are designed to serve web content, such as images, videos and other static files, to users based on their geographic location. The main goal of a CDN is to improve the performance and availability of web content by caching it on servers closer to the users requesting it.

Using CDN technology can improve the performance of websites and applications as it can significantly reduce data transfer times. When a user requests content from a remote server, network latency and other factors can cause slower load times, which can negatively affect the user experience. However, CDNs can solve this problem by caching content on servers closer to the user, thereby providing faster response times and faster page loads.

In addition to providing faster page loads, CDNs can also improve the availability of websites and applications. When content is cached on multiple servers, if one of the servers fails or becomes overloaded, the others can continue to serve the content. This ensures that a website or application remains available in the face of high traffic or server failures.

Experiment 2: Design a CDN

The well-known CDN company is probably Akamai, and now various public cloud providers are also providing CDN services, dedicated to caching and serving content from edge servers near end users, improving performance and reducing latency.

Experimental steps:

  1. Understand the needs: Determine the type of content to serve, the expected number of users, and their geographic distribution.

  2. Design CDN Architecture: Use a layered or flat architecture based on desired scalability and performance.

  3. Implement caching policies: Use cache eviction policies such as least recently used (LRU) or time-to-live (TTL) to manage content in edge servers.

  4. Optimize content delivery: Implement techniques such as request routing, prefetching, and compression to improve content delivery performance.

  5. Manage cache consistency: Implement an update mechanism for the cache to ensure that the latest content is served to users.

  6. Monitor and analyze performance: Collect and analyze performance indicators to continuously optimize CDN performance and resource allocation.

These steps help us improve our ability to manage cache coherency and optimize content delivery, with a better balance in resource allocation.

b470a8465950d15cb6c83864e17532b4.jpeg

3. Load balancing

A load balancer is a network device that distributes incoming network traffic across multiple backend servers or services to improve system performance and availability. A load balancer typically sits between clients and servers and uses various algorithms to distribute incoming requests among available servers to maximize performance and ensure that no single server is overwhelmed. This improves the overall reliability and responsiveness of the system because it distributes the workload more evenly and enables the system to handle a higher volume of requests.

The concept that is easily confused with load balancing is "network proxy", which is divided into three categories: forward proxy, reverse proxy and transparent proxy. Transparent proxy is easy to understand, now briefly introduce forward proxy and reverse proxy. A forward proxy is a server that sits in front of one or more client computers and acts as an intermediary between the client computers and the Internet. When a client machine makes a request for a resource on the Internet, the request is first sent to a proxy. The forward proxy then forwards the request to the Internet on behalf of the client machine and returns the response to the client machine. A reverse proxy is a server that sits in front of one or more web servers and acts as an intermediary between the web servers and the Internet. When a client makes a request for a resource on the Internet, the request is first sent to the reverse proxy. The reverse proxy then forwards the request to one of the web servers, which returns the response to the reverse proxy. Finally, the reverse proxy returns the response to the client.

Experiment 3: Designing a Load Balancer

ELBs on Amazon and other cloud platforms are cloud-based load balancers that automatically distribute incoming traffic across multiple servers to ensure high availability and fault tolerance.

Experimental steps:

  1. Understanding of requirements: Define the expected number of clients, servers, and traffic patterns.

  2. Choose a load balancing algorithm: Implement algorithms such as round robin, least connections, or least response time based on the desired distribution behavior.

  3. Architectural design of the load balancer: Use a hardware or software based load balancer depending on the required performance and flexibility.

  4. Handle session persistence: implement mechanisms such as session association to ensure that the client maintains a consistent connection with a specific server.

  5. Manage health checks: Monitor the health of your servers and automatically remove unhealthy servers from your load balancer.

  6. Ensure Fault Tolerance: Implement redundant load balancers and automatic failover mechanisms to prevent single points of failure.

These steps help us improve our ability to distribute traffic across a multi-server network while ensuring high availability and fault tolerance.

6a11397c379e1f2bd4b47d03a77a1b84.jpeg

4. API Gateway

API gateways are an integral component of modern applications, providing a way to simplify and manage microservices architectures. An API gateway acts as an entry point for the entire application, providing a single entry point for the entire application by receiving client requests and forwarding them to the appropriate microservices, and then returning the server's response to the client. This architecture can make applications more modular and scalable, while also providing better performance and security.

In addition to providing request routing and distribution capabilities, API gateways can also be used to perform other important tasks such as authentication, rate limiting, and caching. Authentication is a method of protecting microservices from unauthorized access, preventing malicious users or attackers from accessing protected resources. Rate limiting is a method of controlling access rates that prevents an application's resources from being overused, thereby protecting the stability and reliability of the application. Caching is a way to improve application performance by avoiding frequent retrieval of data from backend services.

In modern applications, API Gateway has become an essential component. It not only provides a more modular and extensible way to build applications, but also provides better performance and security. Therefore, choosing an appropriate API gateway is very critical. There are many API gateways to choose from, such as Kong, Tyk, and Apigee, etc. Each of these API gateways has its own advantages and disadvantages, and needs to be selected according to the needs of the application.

Experiment 4: Designing a Scalable Flow Controller

Flow control is critical to protecting the system from a flood of requests. Services like Amazon API Gateway provide scalable rate limiting features that protect web applications and APIs from excessive requests and abuse.

Experimental steps:

  1. Know your needs: Determine a rate limiting policy, such as requests per minute or per second.

  2. Choose a traffic-limiting algorithm: use token bucket or leaky bucket algorithm depending on the desired behavior.

  3. Design data storage: store user tokens in memory or use a decentralized file system like Redis.

  4. Implement middleware: Create middleware to handle rate limiting logic before requests reach the main application.

  5. Dealing with distributed systems: Distribute tokens among multiple servers using a consistent hashing algorithm.

  6. Monitoring and Tuning: Continuously monitor system performance and adjust rate limits as needed.

These questions help to improve our understanding of distributed systems and technologies (such as token bucket algorithm, etc.). Regarding the token bucket algorithm, it is a common flow control algorithm, which can help us limit the rate of requests to the service, thereby protecting the service from the risk of crashing due to excessive requests. Beyond that, an understanding of distributed systems is extremely important as it has become an integral part of modern computer science, and its importance will only increase with time.

2aba46e75d51338b6409bb2513ba18b9.jpeg

5. Domain address

DNS is a hierarchical distributed system consisting of multiple servers that work together to translate human-readable domain names (such as www.abc.com) into IP addresses (such as 192.168.1.128). Computers need to use these addresses to identify each other on the Internet or private network.

The main purpose of DNS is to make it easier for users to access websites and other network resources by using meaningful and easy-to-remember domain names, rather than having to remember numerical IP addresses. DNS also has some other functions, such as it can help network administrators diagnose and solve network problems, and provide security protection for DNS queries.

DNS can also realize load balancing and failover by mapping a domain name to multiple IP addresses to ensure high availability of network services. In addition, DNS also supports iterative and recursive queries to ensure that clients can get the most accurate and fastest response.

Experiment 5: Design a URL shortening service

bit.ly and goo.gl are popular URL shortening services that generate unique short URLs, provide parsing, and effectively redirect users to the original URL.

Experimental steps:

  1. Identify requirements: Identify key features such as URL shortening, redirection, and analytics.

  2. Assumptions: Define expected number of users, number of requests, and storage capacity.

  3. Choose a hashing algorithm: Choose an algorithm like MD5 or Base62 to generate unique short URLs.

  4. Design of the database: Use a key-value store or a relational database to store the mapping between original and shortened URLs.

  5. API development and design: Create a RESTful API that shortens URLs and redirects users to the original URL.

  6. Consider edge cases: handle duplicate URLs, conflicting and expired URLs.

  7. Optimize performance: Use a caching mechanism, such as Redis or Memcached, to speed up redirection.

These questions helped us improve our ability to design a service that experimented with generating short, unique URLs for longer web addresses. Key concepts include hashing, database design, and API development.

8c229c5f29a0b581d4713b861c92191f.jpeg

6. Data Partitioning and Replication

In databases, horizontal partitioning (also known as sharding) involves dividing the rows of a table into smaller tables and storing them on different servers or database instances. This is done to distribute the load of the database among multiple servers and to improve performance. Vertical partitioning involves dividing the columns of a table into separate tables. This is done to reduce the number of columns in the table and to improve the performance of queries that access only a small number of columns.

The goal of horizontal partitioning is to distribute data and workload among multiple servers so that each server can handle a smaller portion of the total data and workload. This helps improve the performance and scalability of the database because each server can handle queries and updates more efficiently when handling smaller amounts of data. The main partition methods are as follows:

  • Range-based sharding: In this approach, data is sharded based on key values ​​such as user IDs or timestamps, and data is distributed among shards based on the range of the key values. For example, all user IDs in the range 1-1000 might be stored on one shard, while user IDs in the range 1001-2000 might be stored on another shard.

  • Hash-based sharding: In this approach, data is distributed across shards based on key values ​​using a hash function. For example, all data for user ID 123 might be stored on one shard, while data for user ID 456 might be stored on another shard.

  • Directory-based sharding: In this approach, a central directory is used to map key-values ​​to specific shards where data is stored. This directory can be used to determine which shard a data block belongs to, and data can be retrieved from the corresponding shard.

  • Custom Sharding: In some cases, it may be necessary to implement a custom sharding method specific to the database and the application that uses the database.

Database replication is the process of copying and synchronizing data from one database to one or more other databases. This is often used in distributed systems where multiple replicas are required to ensure data availability, fault tolerance and scalability.

Experiment 6: Design a social media platform like Weibo

Weibo in China and Twitter and Facebook abroad are examples of large social media platforms. They handle user registration, relationships, posting, and timeline generation while managing large amounts of data and traffic.

Experimental steps:

  • Understanding of requirements: Identify key features such as user registration, follow/follow relationships, tweet and timeline generation.

  • Data Model Design: Define schema for users, microblog content, and relationships.

  • Choose the right database: Use a combination of databases, such as a relational database for user data and a NoSQL database for tweet relations.

  • Implement the API: Develop RESTful APIs for user registration, tweets, and timeline generation.

  • Optimize timelines: Use scatter-on-write or scatter-on-read methods to efficiently generate user timelines.

  • Handle scalability: use sharding, caching, and load balancing to ensure the system maintains performance under high load.

  • Ensure fault tolerance: Implement a data replication and backup strategy to prevent data loss.

This problem will test our skills in designing scalable and fault-tolerant systems.

30b1b2156dddb8795fb78eee30dc012a.jpeg

7. Distributed file system

A distributed file system is a very popular storage solution that manages and provides access to files and directories distributedly across multiple servers, nodes or machines. These servers, nodes or machines are usually distributed over a network so users and applications can access and manipulate files as if they were stored on a local file system. Such storage solutions are becoming increasingly important in modern computer systems, especially in large-scale or distributed computing environments, to provide fault tolerance, high availability, and improved performance.

There are many different implementations of distributed file systems, such as Hadoop Distributed File System (HDFS), GlusterFS, Ceph, etc., each of which has its own unique advantages and limitations. HDFS, part of the Apache Hadoop project, is an open source, highly scalable distributed file system designed to provide high throughput and data access performance for large-scale data applications. GlusterFS is an open source, distributed file system that allows users to store and access files on different computing nodes, and is a highly scalable storage solution. Ceph is a distributed, unified, and scalable file system and object storage solution designed to provide fault tolerance, high availability, and good performance.

Experiment 7: Design a distributed file system (such as HDFS)

Distributed file systems are essential for storing and managing large amounts of data across multiple machines. HDFS and S3 are widely used distributed file systems designed to store and manage large amounts of data across multiple machines while providing high availability and fault tolerance.

Experimental steps:

  • Know your needs: Determine the expected number of files, file sizes, and access patterns.

  • Design the file system architecture: Use a master-slave architecture or a P2P architecture based on the required scalability and fault tolerance.

  • Handle file partitioning: Implement data partitioning techniques such as consistent hashing or range partitioning to distribute files across multiple nodes.

  • Implement data replication: Use quorum-based or eventual consistency-based replication strategies to ensure data durability and availability.

  • Optimize data access: Implement caching and prefetching strategies to improve read performance.

  • Manage Metadata: Use a centralized or distributed metadata store to maintain file metadata and directory structure.

  • Handling of fault tolerance and recovery: Implement mechanisms to detect and recover from node failures, such as heartbeat checks and automatic failover.

These questions contribute to a deeper understanding of data replication and consistency models in distributed systems, and their real-world applications. We can discuss how to deal with possible data conflicts and errors, and how to innovate in data replication and consistency models to meet future needs.

db3b14e62239d7b6d57adb1b956723cf.jpeg

8. Service coordination control

Distributed Coordination Service is a system for managing and coordinating the activities of distributed applications, services or nodes in a reliable, efficient and fault-tolerant manner. They help maintain consistency, handle distributed synchronization, and manage the configuration and state of various components in a distributed environment. In addition, distributed coordination services can provide additional features such as load balancing, failover, and security. Therefore, in large-scale or complex systems, such as services in microservice architectures, distributed computing environments, or clustered databases, the importance of distributed coordination services is increasing day by day.

Experiment 8: Designing an API limiter

API rate limiting is critical to maintaining the stability and security of web services. Examples of services such as GitHub and the Baidu Maps API implement API rate limiting to maintain stability and security while allowing developers to access resources within specified limits.

Experimental steps:

  1. Understand the requirements: Define the rate limit policy, such as requests per minute or per second, and the scope of the rate limit (per user, IP address, or API endpoint).

  2. Design a rate limiting mechanism: Implement a fixed window, sliding window, or token bucket algorithm depending on the desired rate limiting behavior.

  3. Store rate limit data: Use an in-memory data structure or a distributed data store like Redis to store and manage rate limit information.

  4. Implement middleware: Create middleware to handle the rate limiting logic and enforce the rate limit before the request reaches the main application.

  5. Dealing with distributed systems: Use consistent hashing algorithms or distributed locks to synchronize rate limits across multiple servers.

  6. Monitoring and Tuning: Continuously monitor system performance and adjust rate limits as needed to balance user experience and system stability.

These questions can deepen our understanding of API design, token-based authentication, and rate limiting algorithms.

eb94b338cab03138627b92b950502b4b.jpeg

9. Distributed message system

Distributed messaging systems enable the exchange of messages between multiple applications, services, or components that may be geographically dispersed in a reliable, scalable, and fault-tolerant manner. They facilitate communication by decoupling sender and receiver components, allowing them to evolve and operate independently. Distributed messaging systems are especially useful in large-scale or complex systems, and as a special kind of distributed messaging system, notification systems are used to send notifications or alerts to users, such as emails, push notifications, or text messages.

Experiment 9: Design an online chat system

WeChat, Feishu, DingTalk, etc. are examples of online chat systems that support real-time messaging, group chat, and offline messaging while ensuring security and privacy through end-to-end encryption.

Experimental steps:

  1. Understand requirements: Identify key features such as one-to-one messaging, group chat, and offline messaging.

  2. Design data models: Define schemas for users, messages, and chat rooms.

  3. Choose the right database: Use a combination of databases, such as relational databases for user data, and NoSQL databases for messaging and chat rooms.

  4. Implementation of communication protocols: use WebSocket or long polling for real-time messaging, and HTTP for offline messaging.

  5. Design message storage: Store messages in a distributed database or message queue for scalability and fault tolerance.

  6. Handle data synchronization: implement mechanisms to ensure that information is passed and synchronized between devices.

  7. Optimizing Performance: Use caching and indexing strategies to speed up message retrieval and searching.

  8. Ensure Security and Privacy: Implement end-to-end encryption and authentication to protect user data and communications.

Key considerations for these issues include message storage, data synchronization, and efficient communication protocols.

2e23d3557b78db9d4f3ef52b728eca5d.jpeg

10. Full text search

Full-text search is the ability to search for specific words or phrases within an application or website. When a user enters a query in the search box, the application or website returns the most relevant results to help the user find what they are looking for quickly. To do this efficiently, full-text search uses a data structure called an inverted index, which maps a word or phrase to the documents in which it appears. Elasticsearch is an example of a search engine that uses this technique, which provides powerful search capabilities and scalability to handle large amounts of data with ease.

Experiment 10: Designing a web crawler

A web crawler is used to extract information from websites and index them for search engines. Search engines such as Google and Baidu use web crawlers, which scrape, index, and rank websites based on various factors such as their relevance and popularity.

Experimental steps:

  1. Understanding of requirements: Define the scope of crawling, such as the number of websites, the depth of crawling, and the types of content to be indexed.

  2. Choose the right strategy: implement a breadth-first (BFS) or depth-first search (DFS) algorithm depending on the desired crawling behavior.

  3. Handling URLs: Use URL prefixes to store and manage URLs to crawl.

  4. Parser Design: Create a parser that extracts relevant information from web pages, such as links, metadata, and text.

  5. Store data: Use a combination of databases, such as relational databases for structured data and NoSQL databases for unstructured data.

  6. Dealing with Parallel Processing: Parallel processing is achieved using multithreading or distributed computing frameworks such as Apache Spark or Hadoop.

  7. Management Policy: Respect site crawl delay directives to avoid server overload.

These questions can help us gain a deeper understanding of network technologies, parallel processing, and data storage. We can understand how network technology works by studying network protocols, and parallel processing can help us improve computing efficiency and the ability to process large amounts of data. In addition, the issue of data storage is also an area that we must understand in depth, because data processing and storage are very important to many fields, including artificial intelligence, big data, and cloud computing.

one sentence summary

"Deliberate practice", this article introduces 10 thought experiments on system architecture design, including distributed file system, service coordination control, API gateway, distributed message system and full-text search, etc. Each experiment includes steps and key considerations, and involves technologies such as data partitioning, caching, persistent connections, web crawlers, and distributed computing frameworks.

【Associated reading】

Guess you like

Origin blog.csdn.net/wireless_com/article/details/131507366