Distributed Notes

 

         Have some rough understanding of distributed systems and record some key points for future reference.


         1 EC Erasure Code & Reed-Solomon Algorithm


        Distributed storage should pay attention to efficiency. Under normal circumstances, the system will back up a piece of data to ensure data security, for example, storing the data in two or three copies. The advantage of this method is that it is simple and requires no calculation on the CPU, but the storage cost is high.


         Some companies will use the EC erasure coding algorithm to save costs in the future, such as Qiniu Company. I don’t know the specific algorithm. It probably divides the data atmosphere into 28 points and calculates 4 copies of redundant data, which is 1.1 times the amount of data. To achieve the equivalent effect of three times the storage. Baidu Netdisk also uses the EC method to save storage, such as the Reed Solomn algorithm, which divides the data into 8 parts. This algorithm calculates 4 parts of redundant data, and uses 1.5 times the data to achieve the effect of 3 times the data. The disadvantage of EC erasure coding is of course that the computational load of the CPU is relatively high.


       2 Multi-granularity storage


       When storing data on a SATA disk, 4k is used as a physical unit as a page to allocate space to the file, and its granularity is 4k. Similarly, the granularity of data must also be distinguished in distributed storage. Generally, in order to achieve some balance between system storage utilization and IO efficiency, distributed systems will adopt multi-level granularity. For example, large file storage systems such as HDFS can have 8M as the minimum granularity and 256M as the maximum granularity, with some in between. There are four particle sizes: 16M, 32M, 64M and 128M, a total of 6 particle sizes. The distributed system can evenly divide each machine in the cluster according to a certain granularity. When there is an external file writing request, the metaserver can write the file on different machines according to the size of the file.


       Some memory databases (such as Goose Factory's CMEM) can be considered as an object file system, and its object size defaults to 84B. Of course, the object size can be determined by the client. The size of each data bucket in CMEM and its successor CKV (that is, the data store in the cluster) is 1G, so if you search for CKV, it claims that the data scaling range is 1G to 1P.


       3 Serial writing and multiple reads


       As mentioned above, in general large-scale distributed systems such as HDFS, data will be stored in three copies in order to ensure the P in the CAP of the data. So, the first question is to select those three machines in the cluster? Distributed systems generally write faster than they read. This is a significant feature that distinguishes them from stand-alone systems, because most distributed systems read far more times than they write. The metaserver can first select four machines and conduct network speed tests respectively, and select three systems with the fastest network speeds, one of which is the master and the other two are slaves.


        When the metaserver in FS receives the data packet sent by the client through a write request, it will notify the three machines to perform serial writes in order to ensure the C in CAP, which is strict data consistency. The metaserver first sends the request packet to the master. When the master receives the packet, it puts the content of the file to be written in the memory. As soon as the writing is successful, it will immediately forward the request packet to a slave. After the first slave writes successfully, it will Transfer the data to the second slave, and after the second slave writes successfully, send an ack packet to the master to notify the master that both slaves have written successfully. The master then checks whether its own data has been successfully written. Only when it has successfully written its own data will it return an ok packet to the metaserver. The master transfers the data packet to the slave without waiting for its own writing success. It has two meanings. The first is to ensure the serial speed. The second and final one is that when the master-slave-slave three data have not been written successfully, if the master receives When the metaserver forwards read requests for this file from other external clients, it cannot read the data from the disk, and other clients cannot get the file, which ensures the strict consistency of the data.


         4 Command Cancel


         In a system like HDFS that ensures strict data consistency, the data storage adopts the master-slave-slave structure, and only the master provides external read services. Some distributed systems only use a master-slave two-level structure to ensure data backup. Both master and slave can provide external data reading services. If the client stipulates that the read timeout is 10ms, it may first send the read request to the master. When it has not received a return packet from the master after the 10ms timeout, it can then send a read request to the slave.


        If the client receives the return packet from the master 2ms after it is sent, it can send a cancel read command to the slave. When the slave receives this command, it will no longer execute the previously received read request.


        5 Single thread and multi-thread


        Among distributed system developers, some prefer multi-threading such as memcached, while others prefer single-threading such as redis. According to the current trend, redis seems to have more users than memcache. However, because redis is developed in single-thread mode, some inexplicable problems may occur.


        One scenario I know is that if the user uses the AOF method to back up the memory data in SATA, when the SATA disk is full, redis will clear the disk and write all the data in the memory to the SATA disk. If there is an external write request at this time, it will alternate between processing the write request and writing the data in the memory to the disk. At this time, the data in the write request may be lost. If there are external read requests at this time, it will go crazy.


      6 Master-slave data synchronization
      Many systems that use logs for master-slave synchronization, such as mysql, have the concept of checkpoint. Every once in a while, check whether the checksum of the data so far from the last checkpoint is consistent. If it is inconsistent, just synchronize the data from the checkpoint so far.
      For detailed steps, please refer to ["Introduction to Tencent Game Data Self-Healing Service Solution"][1].


      7 Message Queue
      A clever thing about Kafka message queue is that one writer writes data directly to one disk, and multiple writers write to multiple disks respectively, so that when consumers read, they directly go to the corresponding disk to read. So the bottleneck of the queue is the CPU.
  
      8 The data consistency
      cache generally caches the hot data of the db. General systems have the problem of multiple reads and multiple writes. How to ensure that when writing, the front-end data behaves consistently and ensures the final data consistency of the back-end?
      When writing, send a freeze command. If the freeze fails, the write fails, which means that a previous write request has sent a freeze command; after the freeze is successful, the write operation is performed to mysql, and then it is over. During the freezing period, the cache can only be read, and writing and other freezing will fail. After the freezing timeout, the data will be deleted. In this way, when you read it next time and see that there is no data in the cache, you can read it in mysql and then write it to the cache. , and then the entire reading is considered successful. After the Cache receives the freeze command, if there is no data in it, it will return to the client saying that there is no data. The write request will first send a read command (the read command will first write the old data from mysql into the cache), and then send the freeze command again.


      This note. If there are new items in my knowledge base later, I will update them here. It is inevitable that there will be errors in the article, so please correct me.
    


  [1]: http://blog.csdn.net/dba_waterbin/article/details/43608587

 

おすすめ

転載: blog.csdn.net/menggucaoyuan/article/details/42928559