Distributed Technology of Hadoop

Distributed Technology of Hadoop



1. Why do we need distributed

1.1 Computational issues

Whether we just started to learn programming in school, or started to deal with practical problems when we first started working, the programs we write are very simple. Because the problem is very simple. Taking data processing as an example, it may just parse a file of tens of K, and then generate a word frequency analysis report. It's a very simple program, and it can be done in a dozen or even a few lines.

Until one day, 1,000 files are thrown over to you , some of which are very large, hundreds of M. When you run the previous program, you find that the running time is a bit long. So I want to optimize it. There are 1,000 files, and there is no business relationship with each other. Use multi-threading, one thread processes one file, and the results are aggregated. If the effect of multi-threading is not good enough, for example, multi-threading like Python cannot take advantage of the power of multi-core, then use multi-process.

Whether it is a thread or a process, in essence, the purpose is to parallelize the calculation and solve the problem of slow calculation. And if the amount of calculation is large enough, even if the computing power of the machine is exhausted, it will not be counted, so what should I do?

If one machine is not enough, then build a few more machines . So from 多线程/进程the parallelization of computing, it evolved to computing 分布式化(of course, distributed is also parallelized to a certain extent).

1.2 Storage issues

On the other hand, what if the data to be processed has 10T, and the machine in your hand only has a 500G hard disk?

  • One way is 纵向扩展to build a machine with dozens of T hard drives ;

  • The other is 横向扩展to build a few more machines and spread them out .

The former is easy to reach the bottleneck. After all, the data is unlimited, but the capacity of a machine is limited, so in the case of large data volumes, the latter can only be selected. Distributing data to multiple machines essentially solves the problem of not being able to store it.

At the same time, after the distributed computing mentioned just now, it is impossible for all programs to read data on the same machine, so the efficiency will inevitably be dragged down by the performance of a single machine, such as, etc., which forces data storage to be 磁盘 IOdistributed 网络带宽to Each machine went.

For these two reasons, data storage is also distributed.

2. Overview of distributed systems

  • Baidu Encyclopedia

A network-based computer processing technique, as opposed to centralized. As the performance of personal computers has been greatly improved and the popularity of their use has made it possible to distribute processing power to all computers on the network. Distributed computing is a concept opposite to centralized computing, and the data of distributed computing can be distributed in a large area.

** A distributed system is a system in which hardware or software components are distributed on different network computers and communicate and coordinate with each other only through message passing. **In simple terms, a group of independent computers collectively provide services to the outside world, but for system users, it is like a computer providing services.

  • Distributed means that more ordinary computers (relative to expensive mainframes) can be used to form a distributed cluster to provide external services. The more computers, the more CPU, memory, storage resources, etc., and the greater the amount of concurrent access that can be processed.

From the concept of distributed system, we know that communication and coordination between hosts are mainly carried out through the network, so there is almost no space limit for the computers in the distributed system. These computers may be placed in different cabinets, or It may be deployed in different computer rooms, and may also be in different cities. For large-scale websites, it may even be distributed in different countries and regions.

3. Distributed implementation scheme

4. Distributed system

Examples are as follows:

Zhang San's company has 3 systems: System A, System B and System C. These three systems do different businesses and are deployed on 3 independent machines. They call each other (of course it is a cross-domain network ), and work together to complete the company's business processes.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-sJwTMDqP-1675484573302)(./0.jpeg)]

Different business divisions in different places constitute a distributed system.

But now the problem arises. System A is the face of the entire distributed system. Users directly access it. When the number of user visits is large, the speed is either extremely slow or directly hangs up. What should I do?

Since system A has only one copy, it will cause 单点失败.

Summary: Distributed systems need to meet the basic requirements of the following

5. Cluster

If Zhang San’s company is not short of money, he can choose to buy more machines. Zhang San deploys several copies of system A at once (for example, the three servers in the figure below), each of which is an instance of system A. Provide the same service, so that if one of them is broken, there are two others.

The system of these three servers forms a cluster.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-ebt0ghwC-1675484573303)(./1.jpeg)]

But for users, there are so many systems A at once, and the IP addresses of each system are different. Which one should they visit?

If everyone visits server 1.1, then server 1.1 will be exhausted, and the remaining two will die idle, which becomes a waste of money.

Next comes 负载均衡the function

6. Load balancing

Zhang San wants to balance the work of system A on the three machines as much as possible. For example, if there are 30,000 requests, let the three servers each handle 10,000 requests (ideally).负载均衡

Obviously, this load balancing work is best done independently and placed on a separate server (for example Nginx)

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Qbeyz3fD-1675484573304)(./2.jpeg)]

Later, Zhang San discovered that although the load-balancing server has a simple job content, which is to get requests and distribute requests, it may still hang up, and single-point failures will still occur. There is no other way, so we have to turn load balancing into a cluster. This cluster is different from the cluster of system A in two points:

  • We can use some method to make this machine provide only one IP address to the outside world, that is, the user sees as if there is only one machine.

  • At the same time, we only let one load-balanced machine work, and the other one stands by. If the working one turns around, the standby one will go up.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-WVA2YNQK-1675484573304)(./3.jpeg)]

7. Elasticity

elastic aka伸缩性

If the three instances of system A still cannot satisfy a large number of requests, such as Double Eleven, you can apply for additional servers. Create and delete virtual servers, so that you can easily increase or decrease servers with the user's request animation.

8. Failover

The above system looks beautiful, but it makes an unrealistic assumption: all services are equal 无状态, in other words, it is assumed that the two requests of the user are directly unrelated.

But the reality is that most services are stateful, such as shopping carts.

Example:

  • The user accesses the system, creates a shopping cart on the server, and adds several items to it, then server 1.1 hangs up, and the user cannot find server 1.1 in subsequent visits. At this time, a failover is required to allow other Several servers take over to process user requests.

  • But here comes the question, is there a user's shopping cart on servers 1.2 and 1.3? If not, the user will complain, where is the shopping cart I just created? What's more serious, assuming that the user's logged-in information is saved to the server 1.1, the user's logged-in information is saved in the server's session, and now the server is down, the used session is gone, and the user Kicked to the login screen and let the user log in again!

If the state problem is not handled well, the power of the cluster will be greatly reduced, and real failover cannot be completed, or even unusable. How to do it?

  • One way is to replicate the state information among the servers in the cluster so that the servers in the cluster can reach a consensus.

  • Another way is to store several state information in one place, so that each server of the cluster server can access it.

Finish!

Guess you like

Origin blog.csdn.net/m0_52735414/article/details/128880553