Personal understanding of how to solve the storage and computing problems of big data in distributed

Personal understanding of how to solve the storage and computing problems of big data in distributed

  • Distributed: How to solve the calculation and storage problems of large amounts of data?

    • Question 1: Why not use MySQL for storage and analysis calculations?

      • The amount of data is large, and MySQL cannot store it
      • Even if it can be stored, the processing performance is very poor
        • The value of data will gradually decrease over time
        • Offline Architecture: Process data in units of time
          • Processing yesterday's data today, the timeliness is relatively slow [minutes and above]
        • Real-time architecture: processing data in units of data generation
          • The data is generated and processed one by one, and the timeliness is relatively high [ms level]
      • Variety of Data Types
    • Question 2: How to solve the problem that the data cannot be stored or calculated?

      • Distributed: divide and conquer
        • ahead
        • Calculation processing
        • After closing
      • Definition: It is to logically merge the resources [cluster] of multiple machines into == a whole ==, and provide distributed services through distributed software
      • process
        • step1: There is a big task: storage and calculation
        • step2: Submit to the distributed service, and the distributed service realizes the process of scoring
          • Divide this large task into several smaller tasks
        • step3: The distributed service assigns several small tasks to multiple machines for joint execution, and each machine handles different small tasks
        • step4: When the user needs to get the result, the results of all small tasks should be combined and the final result returned
      • example
        • storage
          • Machines: 3: 8T = 24TB
          • File: 15TB
          • process
            • Users submit storage to the distributed storage service: 15TB
            • The distributed service will split this file
              • Block1:5TB
              • Block2:5TB
              • Block3:5TB
            • The distributed service stores the three 5TB blocks on three machines, and each machine stores 5TB
              • Metadata: This file and these three blocks must be recorded, as well as the information about the storage locations of the three blocks on the three machines
            • When the user reads the file and requests the distributed service to read it, the distributed service will combine the three blocks that were originally split into the file and return it to the user
        • calculate
          • Machines: 3: 2Core 4GB => 6Core 12GB
          • File: 9GB => Accumulation: 1 + ... +9
          • process
            • Users submit calculations to distributed computing services: 1 + ... +9
            • Distributed services will split this calculation
              • task1:1+2+3
              • task2:4+5+6
              • task3:7+8+9
              • task4: accumulate the results of other tasks
            • Assign three tasks to three machines to run calculations
              • node1:task1:6
              • node2:task2:15
              • node3:task3:24
            • Start Task4 to merge the results of the three machines
              • node3:task4:45
            • Return the final result to the user
    • Question 3: What problem does distributed solve?

      • Solve the problem of storage and calculation of large amounts of data
      • Insufficient resources of a single machine
      • The problem of poor resource performance of a single machine [main]
    • Question 4: What does a distributed general architecture look like? [excluding zookeeper]

      • Master-slave architecture: master-slave nodesprocess
      • Master node: management node
        • Mainly responsible for management operations of distributed services
          • Manage the life and death of all slave nodes
          • Assignment of administrative tasks
        • Pickup: Accept client's request
        • The process names of different distributed slave nodes are different: Leader, NameNode, ResourceManager, Master...
      • Slave node: Responsible for managing each machine's own resources
        • There are several machines, there are several slave nodes
        • The process names of different distributed slave nodes are different: Follower, DataNode, NodeManager, Worker...
        • Accept the task assigned by the master node [small task], call your own machine to perform the task
    • Question 5: Are there problems in the distributed architecture?

      • Single point of failure problem: There is only one master node. If the master node process or the machine where it is located fails, the entire distributed service will be unavailable
      • Distributed data consistency problem: multiple machines want to share the same data, how to ensure the consistency of the read data
      • Solution: Zookeeper
    • Question 6: How does zookeeper solve two distributed problems?

      • Problem: data consistency problem
        • Use ZK to achieve consistent storage and store data in ZK
        • All nodes read data from ZK
      • Problem: Single point of failure of master node
        • Solution: The distributed framework can build multiple master nodes to ensure that only one is working at the same time
        • state
          • Active: working status
          • Standby: backup status
      • Question: How do you decide who is working and who is backup?
        • Solution: Use zookeeper's temporary nodes for auxiliary elections
        • Realization: Let both master nodes A and B go to ZK to create a temporary node file with the same name, whoever creates it successfully will be Active, and the other one will fail to create because the node already exists
          • Assuming that A is successfully created, A is Active
          • B acts as a Standby, and sets monitoring for the file node. If A fails, the session with ZK will be disconnected, and the temporary node file will be deleted. B will receive the monitoring information, and if A fails, it will switch to Active state
  • Zookeeper: Solving distributed problems

    • Function
      • Used to store shared data: metadata, index data
      • Auxiliary election
    • All distributed frameworks either use zk to solve distributed problems, or implement ZK-like solutions by themselves
    • Question 7: Zookeeper itself is also distributed, and its problems need to be solved by itself?
      • Question: If a ZK fails, will it be affected?
        • does not affect
        • ZK is a fair node, the content stored in each node of ZK is consistent, and any ZK can accept read and write requests
      • Question: How does zk ensure that the content of each machine is consistent?
        • It is limited that only leaderxier and leader can synchronize to other nodes
      • Question: What if the leader fails?
        • Fair node: each machine can be elected as a leader
          , and the leader is synchronized to other nodes
      • Question: What if the leader fails?
        • Fair node: every machine can be elected as a leader

[External link image transfer...(img-l9kYu1r2-1606877554427)]

  • Hadoop: big data storage and computing issues = "design: distributed solutions

Guess you like

Origin blog.csdn.net/mitao666/article/details/110472613