How rsync works

1) Software introduction

Rsync is a remote data synchronization tool that can quickly synchronize files between multiple hosts via LAN/WAN. Rsync was originally a tool to replace rcp, and it is currently maintained by Rsync.samba.org. Rsync uses the so-called "Rsync algorithm" to synchronize the files between the local and remote hosts. This algorithm only transfers different parts of the two files instead of sending the entire file each time, so the speed is quite fast. The machine running the Rsync server is also called the backup server. One Rsync server can back up the data of multiple clients at the same time; multiple Rsync servers can also back up the data of one client.

Rsync can be used with rsh or ssh or even daemon mode. The Rsync server will open a 873 service channel (port) and wait for the Rsync connection. When connecting, the Rsync server will check whether the password matches. If the password is checked, the file transfer can be started. When the first connection is completed, the entire document will be transmitted once, and the next time only the difference between the two documents will be transmitted.

Rsync supports most Unix-like systems, whether it is Linux, Solaris or BSD, it has been well tested. In addition, it also has a corresponding version under the windows platform, the more well-known ones are cwRsync and Sync2NAS.

The basic features of Rsync are as follows:

  1. The entire directory tree and file system can be mirrored and saved;
  2. It is easy to maintain the original file permissions, time, soft and hard links, etc.;
  3. It can be installed without special permission;
  4. Optimized process, high file transfer efficiency;
  5. You can use rcp, ssh, etc. to transfer files, of course, you can also use a direct socket connection;
  6. Support anonymous transmission.


2) Core algorithm

Suppose that similar files A and B are synchronized between two computers named α and β, where α has access to file A and β has access to file B. And suppose that the network bandwidth between hosts α and β is very small. Then the Rsync algorithm will be completed through the following five steps:

β Split file B into a set of non-overlapping data blocks with a fixed size of S bytes. The last piece may be smaller than S.
β performs two checks on each divided data block: one is a 32-bit rolling weak check, and the other is a 128-bit MD4 strong check.
β sends these verification results to α.
α Search for all data blocks of size S of file A (the offset can be optional, not necessarily a multiple of S) to find the same weak check code and strong check code as a block of file B Data block. This work can be quickly completed with the help of the rolling verification feature.
α sends a series of instructions to β to generate a backup of file A on β. Each instruction here is either a proof that file B has a certain data block and does not need to be retransmitted, or it is a data block, which must not match any data block of file B.

 

3) Working process of file-level RSync (transmit only changed files): (my understanding)

* Machine A constructs FileList, FileList contains all the file information pair name->id, which needs to be synced with machine B, (id is used to uniquely represent files such as MD5);
* Machine A sends FileList to machine B;
* Machine B runs The background program processes FileList and constructs NewFileList, which deletes the information pairs of the files that already exist on machine B according to the comparison of MD5, and only keeps the files that do not exist or change on machine B;
* machine A gets NewFileList, and it is for the files in NewFileList Transfer from new to machine B;

 

4) Further optimization of storage and transmission

File-level Rsync + Rsync compares and transfers a single file in blocks: It realizes efficient file transfer.

If the database with MD5 code index of all files is stored on the server + hardlink technology: the technology that realizes the deduplication of the server and stores only one copy of a single file

If the server has a technology of storing only one copy of a single file (MD5 database with all files), only the files that Rsync Server does not have are transferred during the Rsync transmission process, if the Rsync Server uses this file directly.

 

5) rsync has six different working modes

 

  1. Copy local files; start this working mode when neither SRC nor DES path information contains a single colon ":" separator.
  2. Use a remote shell program (such as rsh, ssh) to copy the contents of the local machine to the remote machine. This mode is activated when the DST path address contains a single colon ":" separator.
  3. Use a remote shell program (such as rsh, ssh) to copy the contents of the remote machine to the local machine. This mode is activated when the SRC address path contains a single colon ":" separator.
  4. Copy files from the remote rsync server to the local machine. This mode is activated when the SRC path information contains the "::" separator.
  5. Copy files from the local machine to the remote rsync server. This mode is activated when the DST path information contains the "::" separator.
  6. List the file list of the remote machine. This is similar to rsync transmission, but as long as the local machine information is omitted from the command.

Guess you like

Origin blog.csdn.net/JineD/article/details/111871170