Large file transfer scheme (scp and file cutting)

                                                  Large file transfer scheme (scp and file cutting)

First of all, we need to clarify what kind of file is considered a large file. Under normal circumstances, the 4G capacity is used as the standard for large files. The reason is that the U disk has many partitions in fat32 format. The fat32 format U disk only supports files below 4G. I believe that many students have not been able to successfully create a boot disk such as win7 or win10 when making a USB boot disk. The reason is that the system installation ISO file is larger than 4G, and the USB disk is a fat32 format partition, so the ISO file cannot be written Caused by the USB flash drive.

Secondly, there are three factors that affect and determine the success of file transmission. One is memory, the other is network bandwidth, which can also be understood as network speed, and the third is CPU performance, which is generally in the form of point-to-point file transfer.

The above is just a big concept. The specific practical operation is a more complicated process. Usually, it is divided into file transfer between networks and file transfer between different disks in an operating system.

File transfer between networks :

File transfer within the network will use the upd or tcp network protocol for file reading, copying, and so on.

For example, the most direct method: after finding the path of the file on the disk from the network request, if the file is relatively large, assuming 320MB, you can allocate a 32KB buffer in the memory, and then divide the file into 10,000 copies. The copy is only 32KB. In this way, 32KB is read from the beginning of the file to the buffer, and then the 32KB is sent to the client through the network API. Then repeat it ten thousand times until the complete file is sent.

Then, the detailed operations experienced in this process are as follows:

Context switch:
First, it has experienced at least 40,000 context switches between user mode and kernel mode. Because every 32KB message is processed, a read call and a write call are required. Each system call must first switch from user mode to kernel mode, and then switch from kernel mode back to user mode after the kernel completes the task. It can be seen that for every 32KB processed, there are 4 context switches, and after 10,000 repetitions, there are 40,000 switches.
The cost of context switching is not small. Although a switch only consumes tens of nanoseconds to a few microseconds, high concurrency services will amplify this type of time consumption. (CPU performance may become a bottleneck)

Memory copy:
Secondly, this program made 40,000 memory copies, and the number of bytes copied from a 320MB file was also quadrupled to 1280MB. Obviously, too many memory copies uselessly consume CPU resources and reduce the concurrent processing capability of the system. (The size of the memory may become a bottleneck)

We can draw a conclusion that file transfer between networks is a test for the memory and CPU of the servers at both ends, and it is also a test for the network cards at both ends. (If you are interested, you can try to observe the network card light during network transmission, it will flicker crazily.)

 

File transfer between different disks or partitions in the operating system :

The different disks here usually refer to the operating system such as windows and another independent file system such as U disk. The difference between file transfer and network file transfer here is that the network transfer protocol is not used, and it is completely the system kernel's file scheduling. Because there is no protocol limitation, the network bandwidth is not a decisive consideration, but the performance of the CPU and memory.

1. The principle of copying is to copy a copy of a file to a buffer in the memory, which is what we call the clipboard. The buffer capacity is limited, and large files will be copied in batches, but we will not notice. At the same time, the data in the buffer is continuously written to the target location. The file copy is completed. Cutting is one more step than that, which is to delete the source file after the copy is completed, that is, mark the source file for deletion in the file allocation table.
2. It depends on the number of files. Cutting is generally a little bit slower than copying, that is, the time to delete the source files is slower. Of course, the more source files there are, the slower cutting is than copying. Here is just normal deletion, the deletion speed has nothing to do with the file size, only the number of files.
The above is cross partition. The following is an explanation of
the same partition. 4. Because it is in the same partition, cutting is to directly change the content in the file allocation table and link the data to another file name, so it is very fast. The same partition replication is not very fast, just like cross-partition replication.

The above actually mentioned the concept of a file stream. If you have programming experience, I think it will be very easy to understand. In other words, there are two concepts, one is file streaming and the other is memory cache. Therefore, if we use an ancient system like winme, we will experience the pain of a blue screen. The reason is that the previous operating system reads the entire file into the memory cache, and does not automatically cut it intelligently, so it often Operating system memory burst. Fortunately, society has been improving, and the operating system is also constantly improving.

In fact, no matter what kind of file transfer, it is a divide-and-conquer scheme, that is, a file is divided into a suitable size on the source file side, and then a complete file is reassembled on the target side .

One more thing, the file transfer process is a very complicated process, and cpu, CPU first and second cache, memory, memory buffer, file system (referring to xfs, ext3, ext4, fat, fat32, ntfs, The different characteristics of various file systems such as hdfs, the throughput of the network card, the performance of the switch, and other factors are closely related. If you insist, you may need to write a book. Just remember that as long as it is transmission, then the idea of ​​divide and conquer is inseparable.




The above are some more abstract conceptual things, let's get some practical ones. The first question is, how to achieve efficient, fast and secure file transfer?

There are several solutions. First, the small files are merged into a whole file, that is, the packaged file, which is appropriately compressed while being packaged. Second, the tcp protocol is used for network transmission. Third, break up large files into small files, that is, cut into small files, reduce disk capacity, and effectively use network bandwidth. Simply put, it is compression and cutting.

Regarding the choice of compression tool, Windows generally uses winRAR. There are many choices under Linux, such as the commonly used tar, bzip2, and gzip. Here is a recommended compression tool and software with ultra-high compression ratio. For details, see the blog: https ://blog.csdn.net/alwaysbefine/article/details/110424989

Regarding the choice of cutting large files, Windows recommends using the artifact hjsplit software, and rar under Linux. The download link is: https://pan.baidu.com/s/1kIj9diSEorAr0pT70434ag  , extraction code: rara, including hjsplit and Linux rar .

The commonly used software for cutting large files under Linux is split, which is the software that comes with the system and is very user-friendly.




After cutting the files at the source end and transmitting them to the target end, of course they need to be merged. Here are a few examples:

windows source----"Linux target After cutting on the Windows side using hjsplit, Linux uses the cat command to redirect and merge the files

This file is 14k, if it is cut at 2k size, it will be cut into 7 files

After cutting, upload it to the Linux server, cat Oracle*> oracle.sql, you can see that the complete file is back! ! !

Cutting file under Linux system:

On the Linux server, I have a file, the content is the help file of mailx, the file size is 140k, plan to cut into three files of 50k? How to cut?

[root@centos7 ~]# split -b 50k -d --verbose mail.man mail.txt
creating file ‘mail.txt00’
creating file ‘mail.txt01’
creating file ‘mail.txt02’

The above command is -b to cut according to file size, each file is 50k, -d specifies the file to be cut in increments of numbers, --verbose displays the detailed process, the file to be cut, and the file name after cutting.

[root@centos7 ~]# ll
total 296
-rw-------. 1 root root   1587 Jan 23 22:17 anaconda-ks.cfg
-rw-------  1 root root    631 Feb 16 09:27 dead.letter
-rw-r--r--  1 root root   1635 Jan 23 23:05 initial-setup-ks.cfg
-rw-r--r--  1 root root 143091 Feb 16 10:22 mail.man
-rw-r--r--  1 root root  51200 Feb 21 00:37 mail.txt00
-rw-r--r--  1 root root  51200 Feb 21 00:37 mail.txt01
-rw-r--r--  1 root root  40691 Feb 21 00:37 mail.txt02

It can be seen that files 00 and 01 are 50k in size, and the insufficient part of the original file is allocated to the file 02. So, now that the original file mail.man is deleted, what about merging? (The original file is 2199 lines. Now after deleting, merge the previously cut files to see if it is still 2199 lines???)

[root@centos7 ~]# cat mail.man |wc -l
2199
[root@centos7 ~]# cat mail.txt0* > mail.bak
[root@centos7 ~]# cat mail.bak |wc -l
2199

After merging, it is named mail.bak, and the content and number of lines are consistent with the original file.

Linux is so powerful! ! ! ! ! ! ! ! ! ! !




If the operating system where the source file or source folder is located is Linux, and the target operating system is also Linux, then the convenient and quick scp command can be launched. You only need to know the IP address of the target operating system and the port number open for the ssh connection That's it, (usually the default, if it is not the default port number, just specify the port of the target operating system).

If it is the default port, omit the -P parameter and specify and add the port number. If the source is a folder, the -r parameter will transfer files recursively.

Example: Copy the /etc directory of machine a to the /mnt directory of remote machine b, then the command will be executed on machine a:

scp -r /etc/ ip of machine b:/mnt/, if the port opened by machine b ssh is 11111, then the command is executed on machine a:

scp -r -P 11111 /etc/ b machine IP:/mnt/, after the command is executed, you will be asked to enter the root password of the target machine, which is the b machine (the default is root user login, because the root user has the most authority and will not Causes permission problems, which causes scp to fail. If you want to use the ordinary user of the b machine, the command is as follows: scp -r -P 11111 /etc/ ordinary user name @b machine's IP:/mnt/).

Guess you like

Origin blog.csdn.net/alwaysbefine/article/details/113915285