Understanding large distributed websites requires knowing these concepts

1. I/O optimization

  1. Increase the cache to reduce the number of disk accesses.
  2. Optimize the disk management system, design the optimal disk mode strategy, and the disk addressing strategy, which are considered at the level of the underlying operating system.
  3. Reasonable design of disk storage data blocks, and strategies for accessing these databases, are considered at the application level. For example, we can design indexes for stored data, speed up and reduce disk access by addressing indexes, and speed up disk access in an asynchronous and non-blocking manner.
  4. Apply a reasonable RAID strategy to improve disk I/O.

2. Web front-end tuning

  1. Reduce the number of network interactions (multiple requests to merge)
  2. Reduce the size of the amount of data transmitted over the network (compression)
  3. Minimize encoding (try to convert characters to bytes in advance, or reduce the conversion process from characters to bytes.)
  4. Use browser cache
  5. Reduce cookie transmission
  6. Reasonable layout of the page
  7. Use page compression
  8. lazy loading pages
  9. CSS at the top, JS at the bottom
  10. CDN
  11. reverse proxy
  12. page static
  13. offsite deployment

3. Service degradation (automatic graceful degradation)

Denial of Service and Shutdown of Service

4. Idempotent Design

Some services are naturally idempotent. For example, if the user gender is set to male, the result will be the same no matter how many times it is set. However, for operations such as transfer transactions, the problem will be more complicated. It is necessary to verify the validity of service calls through information such as transaction numbers. Only valid operations can continue to be executed.

(Note: idempotency is a promise (rather than implementation) of the interface of the system to the outside world. It promises that as long as the interface is called successfully, the impact of multiple external calls on the system is consistent. An interface declared as idempotent will consider that the failure of an external call is The norm, and there must be a retry after a failure.)

5. Failover

If any server in the data server cluster goes down, all read and write operations of the application against this server need to be rerouted to other servers to ensure that data access will not fail. This process is called failover. 
Failover includes: failure confirmation (heartbeat detection and application access failure reporting), access transfer, and data recovery. 
Failover ensures that when a data copy is inaccessible, other copies of the data can be quickly switched to ensure the system is available.

6. Performance optimization

According to the hierarchical structure of the website, performance optimization can be divided into: web front-end performance optimization, application server performance optimization, storage server performance optimization.

  1. web front-end performance optimization 
    • Browser access optimization: reduce http requests; use browser caching; enable compression; put css at the top of the page and javaScript at the bottom of the page; reduce cookie transmission
    • CDN acceleration
    • reverse proxy
  2. Application server performance optimization 
    • Distributed cache (Redis, etc.)
    • Asynchronous operation (message queue)
    • Using a cluster (load balancing)
    • Code optimization
  3. Storage performance optimization 
    • HDD vs SSD
    • B+ tree vs LSM tree
    • RAID vs HDFS

7. Code optimization

  • Multithreading (Q: How to ensure thread safety? What are the lock-free mechanisms?)
  • Resource reuse (singleton mode, connection pool, thread pool)
  • data structure
  • garbage collection

8. Load Balancing

  • HTTP redirection load balancing 
    When a user sends a request, the web server returns a new url by modifying the Location tag in the HTTP response header, and then the browser continues to request the new url, which is actually a page redirection. Through redirection, the goal of "load balancing" is achieved. For example, when we download the PHP source package, when we click the download link, in order to solve the problem of download speed in different countries and regions, it will return a download address that is close to us. The HTTP return code for the redirect is 302. 
    Advantages: relatively simple. 
    Disadvantages: The browser needs two requests to the server to complete one visit, and the performance is poor. The processing capacity of the redirection service itself may become a bottleneck, and the scalability of the entire cluster is limited. Using HTTP302 response code redirection may cause search engines to judge SEO cheating and lower search rankings.
  • DNS domain name resolution load balancing 
    DNS (Domain Name System) is responsible for the service of domain name resolution. The domain name url is actually the alias of the server, and the actual mapping is an IP address. The resolution process is that DNS completes the domain name to IP mapping. A domain name can be configured to correspond to multiple IPs. Therefore, DNS can also be used as a load balancing service. 
    In fact, large websites always use DNS domain name resolution in part, and use domain name resolution as the first-level load balancing method, that is, a group of servers obtained from domain name resolution are not physical servers that actually provide web services, but also provide load balancing services. Internal server, this group of internal load balancing servers performs load balancing and distributes requests to real web servers. 
    Advantages: The work of load balancing is transferred to DNS, which saves the trouble of website management and maintenance of load balancing servers. At the same time, many DNSs also support domain name resolution based on geographic location, that is, the domain name will be resolved to a server address that is geographically closest to the user. This speeds up user access and improves performance. 
    Disadvantages: Rules cannot be defined freely, and it is very troublesome to change the mapped IP or the machine fails, and there is also the problem of DNS validation delay. Moreover, the control of DNS load balancing is in the domain name service provider, and the website cannot make more improvements and more powerful management.
  • Reverse Proxy Load Balancing 
    A reverse proxy service can cache resources to improve website performance. In fact, at the deployment location, the reverse proxy server is in front of the web server (so that it is possible to cache the web response and speed up access), this location is also the location of the load balancing server, so most reverse proxy servers provide load balancing at the same time. It manages a group of web servers and forwards requests to different web servers according to the load balancing algorithm. The response completed by the web server also needs to be returned to the user through the reverse proxy server. Since the web server does not directly provide external access, the web server does not need to use an external IP address, while the reverse proxy server needs to be configured with dual network cards and two sets of internal and external IP addresses. 
    Advantages: It is integrated with the reverse proxy server function, and the deployment is simple. 
    Disadvantages: The reverse proxy server is a transit point for all requests and responses, and its performance can become a bottleneck.
  • LVS-NAT: Modify IP address
  • LVS-TUN: The technology of encapsulating one IP packet in another IP packet.
  • LVS-DR: Change the MAC address of the data frame to the MAC address of the selected server, and then send the modified data frame on the LAN with the server group.

9. Cache

Caching is to store data in the closest location to the calculation to speed up processing. Cache is the first means to improve software performance. One of the important factors for the faster and faster CPU is the use of more cache. In complex software design, cache is almost everywhere. Large-scale website architecture designs use cache designs in many ways.

  • CDN: and content distribution network, deployed at the network service provider closest to the end user, the user's network request always arrives at his network service provider first, where some static resources (less changing data) of the website are cached, which can To return to the user at the fastest speed nearby, such as video websites and portal websites, the hot content that is heavily visited by users will be cached in the CDN.
  • Reverse proxy: The reverse proxy is a part of the front-end architecture of the website and is deployed at the front end of the website. When the user requests to the data center of the website, the reverse proxy server is the first to be accessed, where the static resources of the website are cached, and there is no need to store the static resources of the website. The request continues to be forwarded to the application server to be returned to the user.
  • Local cache: Hotspot data is cached locally on the application server, and the application can directly access the data in the local memory without accessing the database.
  • Distributed cache: The amount of data in large websites is very large. Even if only a small part of the data is cached, the required memory space cannot be afforded by a single machine. Therefore, in addition to the local cache, a distributed cache is also required to cache the data in a special distributed cache. In a cache cluster, applications access cached data through network communication.

There are two prerequisites for using the cache. One is that the data access hotspots are not balanced, and some data will be accessed more frequently, and these data should be placed in the cache; the other is that the data is valid for a certain period of time and will not expire soon. , otherwise the cached data will have a dirty read due to invalidation, affecting the correctness of the result. In website applications, cache processing can speed up data access and reduce the load pressure on back-end applications and data storage. This is crucial to website database architecture. Almost all website databases are designed with a load capacity based on the premise of caching. .

10. Load Balancing Algorithm

Polling Round Robin 
Enhanced Polling Weight Round Robin 
Random 
Enhanced Random Weight Random 
Least Connections Least Connections 
Enhanced  Least Connections
Source Address Hash Hash

other algorithms

  • Fastest Algorithm: Pass the connection to the server that responds the fastest. When one of the servers suffers from Layer 2 to Layer 7 failures, BIG-IP takes it out of the server queue and does not participate in the allocation of the next user request until it returns to normal.
  • Observed algorithm (Observed): The number of connections and response time are based on the best balance of these two as the basis for selecting a server for new requests. When one of the servers suffers from Layer 2 to Layer 7 failures, BIG-IP takes it out of the server queue and does not participate in the allocation of the next user request until it returns to normal.
  • Predictive algorithm (Predictive): BIG-IP uses the collected current performance indicators of the server to perform predictive analysis, and selects a server whose performance will reach the best performance in the next time slice to correspond to the user's request. (detected by BIG-IP)
  • Dynamic Performance Allocation Algorithm (Dynamic Ratio-APM): BIG-IP collects various performance parameters of applications and application servers to dynamically adjust traffic allocation.
  • Dynamic Server Replenishment Algorithm (Dynamic Server Act.): When the number of primary server farms decreases due to failures, backup servers are dynamically added to the primary server farm.
  • Quality of Service Algorithm (QoS): Allocate data flows according to different priorities.
  • Type of Service Algorithm (ToS): Load balancing distributes data streams according to different service types (identified in the Type of Field).
  • Rule mode algorithm: set guiding rules for different data streams, users can

11. The difference between scalability and scalability

Scalability: Refers to the ability to continuously expand or stand-in for system functions with minimal impact on the existing system. It shows that the system infrastructure is stable and does not require frequent changes, there is less dependence and coupling between applications, and it can respond quickly to changes in demand. It is the open and closed principle at the system architecture design level (open for extension, closed for modification).

The main criterion for measuring the scalability of the website architecture is whether it can achieve transparency and no impact on existing products when new business products are added to the website, and new products can be launched without any changes or with few changes to existing business functions. Whether there is little coupling between different products, changes in one product have no effect on other products, and other products and functions do not need to be affected for changes.

Scalability: The so-called scalability of a website means that the service processing capacity of the website can be expanded or reduced by changing the number of deployed servers without changing the software and hardware design of the website.

Refers to the way that the system can increase (reduce) the scale of its own resources to enhance (reduce) its ability to calculate and process transactions. If this increase or decrease is proportional, it is called linear scalability. In website architecture, it usually refers to the use of clusters to increase the number of servers and improve the overall transaction throughput of the system.

The main criterion for measuring the scalability of the architecture is whether it is possible to build a cluster with multiple servers and whether it is easy to add new servers to the cluster. Whether the new server can provide services that are indistinguishable from the original service, and whether there is a limit to the total number of servers that can be accommodated in the cluster.

12. Consistent hashing of distributed caches

The specific algorithm process: first construct an integer ring with a length of 2^32 (this ring is called a consistent Hash ring). According to the hash value of the node name (its distribution range is [0, 2^32 - 1]), the cache server Stages are set on this Hash ring. Then calculate the Hash value according to the Key value of the data to be cached (the distribution range is also [0,2^32 - 1]), and then search clockwise on the Hash ring to find the nearest cache server node with the hash value of this KEY. , complete the Hash mapping lookup from KEY to server.

Optimization strategy: virtualize each physical server into a set of virtual cache servers, place the hash value of the virtual server on the Hash ring, and find the virtual server node before changing the key, and then obtain the information of the physical server.

How many virtual server nodes are appropriate for a physical server? Experience Points: 150.

13. Cybersecurity

  1. XSS attack 
    (Cross Site Script) refers to an attack method in which hackers tamper with web pages and inject malicious HTML scripts to control users' browsers to perform malicious operations when users browse web pages. 
    Prevention means: disinfection (XSS attackers generally aim to attack by embedding malicious scripts in the request. These scripts are not used in general user input. If filtering and disinfection are performed, that is, certain html dangerous characters are transferred. Such as ">" translated to ">"); HttpOnly (prevent XSS attackers from stealing cookies).
  2. Injection attack: SQL injection and OS injection 
    SQL prevention: Prepared statement PreparedStatement; ORM; avoid password storage in plain text; handle corresponding exceptions.
  3. CSRF (Cross Site Request Forgery, Cross Site Request Forgery). It sounds a bit similar to XSS, but in fact the two are very different. XSS exploits trusted users on the site, while CSRF exploits trusted websites by masquerading requests from trusted users. 
    Prevention: httpOnly; increase token; identify by Referer.
  4. File upload vulnerability
  5. DDos attack

14. Cryptography

  1. Digest encryption: MD5, SHA
  2. Symmetric encryption: DES algorithm, RC algorithm, AES
  3. Asymmetric encryption: RSA 
    asymmetric encryption technology is usually used in information security transmission, digital signature and other occasions. 
    The digital certificate used by the browser in HTTPS transmission is essentially a public key of asymmetric encryption certified by an authority.

15. Flow Control (Flow Control)

  1. traffic drop
  2. It is simple and rude to perform limited waiting through a single-machine memory queue and directly discard user requests, and if it is an I/O-intensive application (including network I/O and disk I/O), the bottleneck is generally no longer CPU and I/O. RAM. Therefore, appropriate waiting can not only stand in for user experience, but also improve resource utilization.
  3. Asynchronous user requests through distributed message queues.

References 
1.  LVS: Comparison of Three Load Balancing Methods + Another Three Load Balancing Methods 
2. "Technical Architecture of Large Websites - Core Principles and Technical Analysis" by Li Zhihui. 
3.  Construction of a 300-million-level Web System: Single Machine to Distributed Cluster 
4. "Architecture Design and Implementation of Large-Scale Distributed Websites" by Chen Kangxian.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324822076&siteId=291194637