Distributed system problems and solutions

1. Introduction to Distributed Systems

1.1 Background Distributed Systems

Single Application Architecture: When in site traffic, all functions are integrated in one application, only need to deploy an application, thus reducing costs and deploying nodes.
Vertical Application Architecture: When large site traffic time, by the number of new machines to improve efficiency below standard, the system will use a vertical split into several irrelevant.
Distributed System Architecture: When the demolition of more vertical applications when the inevitable need for interaction between applications, which can be considered when splitting some core business as a separate service deployment, and gradually form a stable service center for more fast response to business needs.

1.2 Introduction to Distributed Systems

"Distributed system is a collection of several independent computers, these computers for users like a single related systems" - "distributed systems principles and paradigms." From a process point of view, the two programs are running on two hosts of the process, they collaborate with each other to finalize the same service or function, then in theory, the system consisting of both programs can be called "distributed systems . " Of course, this program may be two different programs, the programs may be the same. If the same procedure has another name called "clusters", ie, the same scale program to improve the service capabilities of the way.
Distributed sounds and micro-service concept is similar to the difference between them is: micro-services architecture biased in favor of business, for example, can be micro-services by sub-business, database, interface split into different dimensions of micro services; distributed architecture biased in favor of the machine It can be said that the micro-services architecture is a distributed architecture, because most micro services are separate deployment.

1.3 Distributed systems need to be resolved

1.3.1 Distributed Session

Sticky session

When a user session access viscous i.e. a machine in the cluster, the subsequent requests are forwarded to the current user on this machine.
Usage scenarios: moderate number of servers, not very stringent stability requirements
Advantages: simple, easy to configure, no additional network overhead
drawback: the network is down there, the user session will be lost easily lead to a single point of failure
program: Nginx load of ip_hash balancing algorithm

session replication

Copy the forthcoming session of a server session broadcast data is copied to the cluster of the remaining cluster
usage scenarios: fewer machines, less network traffic
advantages: simple, configuration less downtime without affecting user access
Cons: Broadcast Copy session there is a delay, the need for additional network overhead
solution: open source Tomcat-redis-session-manager

Cache centralized management

Centralized management of the forthcoming session cache write distributed cache cluster, priority access session information from the cache when the user accesses
usage scenario: many servers, complex network environment
advantages: reliability
Disadvantages: to implement complex, the need for additional network overhead, stability depends on the stability of the system cache
solution: open source Spring Session, they can achieve their own, mainly getSession method HttpServletRequestWrapper rewriting

1.3.2 Distributed Configuration Center

In a distributed system, the service may involve several thousand, eleven restart all programs release a new application will be a complex and time-consuming process. So how in the case of non-stop service with a cluster, the entire cluster to adjust the characteristic behavior of running it? Distributed System Center Configuration appeared. Common distributed system configuration changes: the thread pool, the size of the connection pool; switch, a current limiting configuration; data source standby redundancy switching; routing rules. There are a lot of open source solutions, such as:
disconf: Baidu open source, integrated with spring well, there are web administration interface, refer only client support JAVA.
diamond: Ali open source, internal Ali is widely used by the http server, diamond server, web composition, diamond server connection unified mysql, mysql dump file by data synchronization, support for subscriptions to the publication, the client only supports Java.
etcd, CoreOS open source, lightweight kv database, you can achieve service discovery and registration cluster environment, he provides a TTL data failure, data change monitoring, multi-fingered, directory monitoring, distributed lock atomic operations and other functions, to manage the node status .
zookeeper, mature distributed configuration solutions.

1.3.3 Distributed Transactions

User distributed transaction to resolve the demands of the most essential is to ensure data consistency. Current solutions substantially as follows: XX transaction programs, flexible transaction, based on the final uniform message service compensation and the like Correct manually.

1.3.4 Distributed Lock

Currently almost many large sites and applications are deployed in a distributed, data consistency problem has been distributed in the scene is more important topic. Distributed CAP theory (see 1.3.5) tells us that any distributed system can not meet at the same time consistency, availability, partition and fault tolerance, can only meet two CP or AP. In most scenarios is to ensure the AP, but the expense of strong consistency guarantee eventual consistency. Distributed Lock implementations include: mysql, Redis, zookeeper and so on.

1.3.5 CAP theory

CAP theory for distributed database terms, it refers to a distributed system, the consistency (Consistency, C), availability (Availability, A), fault tolerance partition (Partition Tolerance, P) can not have both the three .

  • Consistency (C)
    Consistency means that "all nodes see the same data at the same time", i.e., the update operation is successful, the data of all nodes at the same time exactly. Consistency can be divided into two different perspectives client and server: From the client perspective, mainly referring to the question of how to update consistency when multiple users concurrent access to data acquired by other users; from the server point of view, consistency is how to copy user data to update the data to the entire system to ensure data consistency. Consistency is to appear in the concurrent read and write issues, and in understanding the issue of consistency, we must pay attention considered in conjunction concurrent read and write the scene. Consistency can also be divided into strong consistency (be consistent), weak consistency (may not be consistent) and a final consistency (consistency allowed delay).
  • Availability (A)
    Availability means "reads and writes always succeed", i.e., the user accesses the data, whether the system can return to normal response time results. Good availability mainly refers to the system can work well for customer service, there are no user operation failed or timeout accessing user experience bad situation. Under normal circumstances, it has great relevance and availability of distributed data redundancy, load balancing and so on.
  • Partition fault tolerance (P)
    partitions fault tolerance refers to "the system continues to operate despite arbitrary message loss or failure of part of the system", that is a distributed system in the face of a node failure or network partitioning, still able to provide external meet the consistency and availability of services. Zoning is closely related to fault tolerance and scalability. In a distributed application, it may be because some of the causes of a distributed system can not function properly. Zoning refers to high fault tolerance in the lower part of a node failure or packet loss occurs, the cluster system can still provide service, complete access to data. Fault-tolerant multi-partition can be regarded as a copy of the policy in the system.

CAP theory is that a distributed system can only take into account the characteristics of the two of them, appears CA, CP, AP three cases, in practice, can be weighed according to the actual situation, or provide configuration software level, determined by the user how to choose CAP policy. CAP theory can be used at different levels, can be customized according to local design strategies CAP principle, for example, in a distributed system, each data node itself is a guaranteed CA, but on the whole have to take into account the AP or CP. as the picture shows.
Here Insert Picture Description

  • CA without P
    is not required if Partition Tolerance, which does not allow the partition, the strong consistency and availability can be guaranteed. In fact, the partition is always present problems, so CA's distributed system allows more after the partition of the subsystems remains CA.
  • CP without A
    if the availability is not required, as if each request requires a strong coherence between the various servers, fault tolerance and leads to the partition synchronization time unlimited extension, so CP is guaranteed. Many traditional database distributed transactions fall into this pattern.
  • AP without C
    if you want high availability and allow partition, you need to give up consistency. Once the partition occurs, you may lose the links between nodes, in order to achieve high availability, each node can only provide services with local data, and this can lead to inconsistencies global data.

1.3.6 BASE theory

EBay is proposed by the architect. BASE is the result of the CAP consistency and availability trade-offs, which comes from the summary of the large-scale Internet distributed systems practice is based on the laws of CAP gradually evolved. The core idea is that even if unable to do so strong consistency, but each application according to their operational characteristics, only appropriate way to make the system hit the eventual consistency. BASE theory is Basically Available (basic available), abbreviated three phrases Soft State (soft state) and Eventually Consistent (eventual consistency).

  • Basically Available (Basic available)
    basically refers to a distributed system can be used at the time of failure, allowing availability loss portion (e.g., response time, availability of the function), allowing availability loss portion. It is noted that substantially no equivalent available in the system is not available.
    Loss response time: Under normal circumstances requires the search engine returned to the user, query results within 0.5 seconds, but due to a failure (such as a power failure or system partially broken room Fault), query response time results increased 1 to 2 seconds.
    Loss of function: when shopping peak shopping sites (such as two-eleven), in order to protect the stability of the system, some consumers may be directed to a page downgrade.
  • Soft State (soft state)
    soft state refers to a system to allow an intermediate state and the intermediate state without affecting the overall system availability. A distributed storage in general have multiple copies of data, allowing synchronization of different copies of the delay is a manifestation of the soft state. mysql replication Asynchronous replication is also a reflection.
  • Eventually Consistent (final consistency)
    eventual consistency refers to all copies of the data in the system after a certain period of time, ultimately to a consistent state. Instead weak consistency and strong consistency, the final consistency is a special case of weak consistency. BASE emerged based on the consistency of some theory of algorithms and protocols: two-phase commit, the three-phase commit, paxos algorithm, zab agreement.

1.3.7 Distributed regular tasks

What is distributed timed task?

First of all, we have to understand the concept of scheduled tasks, planning tasks are run regularly by the plan or program periodically run. Our most common is Linux's 'crontab' and Windows 'Scheduled Tasks'. So what is distributed regular tasks, personal summarized as follows: the scattered, poor reliability scheduled tasks into a unified platform, and a way to manage the timing of cluster management task scheduling and distributed deployment. Timing is called a distributed task.

Why use a distributed regular tasks?

- timed task disadvantages:
function is relatively simple, interactive poor, low task deployment efficiency, development and maintenance costs are higher, can not meet the management and control of the timing of tasks for each system, especially in a multi-system environment is more obvious; many tasks are stand-alone deployment, poor usability; task tracking and alerting difficult to achieve.
Advantages of distributed timed tasks:
by way cluster management scheduling, greatly reducing development and maintenance costs; distributed deployment to ensure high system availability, scalability, load balancing, improved fault tolerance; and can be deployed by the console Administration regular tasks, convenient and flexible and efficient; tasks can be persisted to the database, to avoid the risk of downtime and data loss, while a sound mission failed redo mechanism and detailed task tracking and alarm strategies.

Popular distributed framework of the mandate

The Quartz:
the Quartz is the most famous open-source Java field task scheduling tool. Quartz offers a very wide range of features such as task persistence, clustering and distributed task. Features:

Written entirely in Java, and can easily and another spring java framework for integration;
powerful scheduling features: support for a variety of scheduling methods, to meet a variety of conventional and special needs;
flexible application: support tasks and scheduling a variety of combinations, to support a variety of scheduling data storage;
distributed and clustering capabilities, load balancing and high availability.

Job-Elastic:
Elastic-elastic distributed job is the Job frame work modules separated dd-job in the ddframe. Dd-job and remove the monitor in the access specification and ddframe portion. The project is based on the proven open source Quartz and Zookeeper its clients Curator for secondary development. Features:

Timing tasks: the task execution timing based on the expression of mature job task frame timing Quartz cron;
job registration center: and its clients based Zookeeper Curator realize the global register operations control center. For registration, control and coordinate the implementation of distributed job;
job fragment: The fragment to give the task performed simultaneously on a plurality of small task item to multiple servers;
elastic expansion volume reduction: running job server crashes, or add N job server, the job will be re-fragmented frame before next job execution, does not affect the current job execution;
support a variety of job execution mode: OneOff, Perpetual and SequenecePerpetual three kinds of operation modes;
failover: job server is not running in crash will lead to re-fragmentation, division will start operations in the next piece. Enabling failover feature in this process of job execution, monitoring other jobs the server is idle, grab unfinished orphan slice item execution;
runtime state collection: when monitoring operations running, the most recent statistical data processing and success the number of failures, record job last run start time, end time and the next run time;
the job is stopped, recovery and disable: used to start and stop the operation of the job, and a job can prohibit operation, commonly used in the on-line;
missed execution job re-trigger: automatically records job miss execution and automatically triggered after the last job is completed.
Multithreading fast data processing: using multithreading to fetch data, improve throughput;
idempotent: duplicate entry job tasks determined not repeated job execution task item already running;
fault tolerance: job server, and the communication server failures Zookeeper after the run is stopped immediately to prevent the failure of a job registry fragmented sub-assigned to another job servers, and any server in the current job tasks, resulting in repeated.
Spring Support: Support Spring container, a custom namespace, with placeholders;
Operation and maintenance platform: provide operation and maintenance platform that can manage jobs and registry.

TBSchedule:
TBSchedule is a very good, high-performance distributed scheduling framework that is widely used in Alibaba, Taobao, Alipay, Jingdong, the United States together, the car home, the United States and many other Internet companies process scheduling system. tbschedule time scheduling is concerned, although there is no quartz powerful, but it supports fragmentation. And quartz are different, tbschedule ZooKeeper used to achieve high availability task scheduling and fragmentation.
Saturn:
Saturn is the only product will be a distributed task scheduling products on github open source. It is based on Dangdang elastic-job to develop and improve on some features and adds some new feature. Currently open six months on github, 470 Ge star. Saturn tasks can be developed in multiple languages such as python, Go, Shell, Java, Php. It has made the deployment of 350+ nodes inside the only product, task scheduling more than 40 million times a day. At the same time, management and statistics is also its highlights.
One of distributed services architecture: Zookeeper entry and stand-alone and clustered environment to build

Published 118 original articles · won praise 7 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_43792385/article/details/104812815