Implementation technology of uploading and downloading SpringBoot large files (above 100M)

background

The user has a local txt or csv file, whether it is exported from a business database or obtained by other means. When you need to use Ant's big data analysis tools for data processing, mining and co-creation applications, first upload the local file to ODPS , Ordinary small files are uploaded to the server through the browser and can be achieved by doing a layer transfer, but when the file is very large to the 10GB level, we need to think about another form of technical solution, which is what this article will explain Program.

The technical requirements mainly include the following aspects:

  • Support super large data volume, above 10G level

  • Stability: 100% success except for network abnormalities

  • Accuracy: No data loss, 100% accuracy of reading and writing

  • Efficiency: 1G file minute level, 10G file hour level

  • Experience: Real-time progress awareness, resume transmission of abnormal network interruptions, special processing of customized characters

File upload selection

The basic idea of ​​uploading files to ODPS is to upload files to a transfer area for storage, and then synchronize to ODPS. According to the storage medium, it can be divided into two types, one is the application server disk, and the other is the intermediate medium. OSS is recommended by Alibaba Cloud With its massive, secure and low-cost cloud storage services, and rich API support, it has become the first choice for intermediate media. File upload to OSS is divided into two schemes: web direct upload and sdk upload. Therefore, there are three upload schemes as follows. The detailed advantages and disadvantages are as follows:

During the evolution of Ant’s text upload function, both the first and second schemes are practiced. The shortcomings are obvious. As mentioned in the table above, it does not meet the business needs. Therefore, the ultimate scheme for uploading large files is scheme 3.

overall plan

The following is a schematic diagram of the overall process of Option 3.

 

The request steps are as follows:

  1. The user obtains the upload policy and callback settings from the application server.

  2. The application server returns the upload policy and callback.

  3. The user directly sends a file upload request to OSS.
         After the file data is uploaded, OSS will request the user's server according to the user's callback settings before OSS gives the user a response. If the application server returns success, then it returns success to the user. If the application server returns failure, then OSS also returns failure to the user. This ensures that the user upload is successful and the application server has received the notification.

  4. The application server returns to OSS.

  5. OSS returns the content returned by the application server to the user.

  6. Start the background synchronization engine to perform data synchronization from oss to odps.

  7. The synchronized real-time progress is returned to the application server and displayed to the user at the same time.

technical solutions

4.1  upload

OSS provides a wealth of SDKs, such as simple upload, form upload, resumable upload, etc. It is recommended to use the resumable upload function for the upload function provided by oversized files. The advantage is that large files can be uploaded in parallel and fragmented. Parallel processing capabilities, intermediate pauses can also continue to upload from the current location, the network environment impact can be minimized.

4.2  download

There are also many ways to download OSS files, such as ordinary download, streaming download, resumable download, range download, etc. If you download directly to the local area, it is also recommended to download resumably, but our needs are not only for downloading files. Local storage, instead of reading files to synchronize data from OSS to ODPS, so no intermediate storage, read and change directly, on the one hand, use OSS streaming read, on the other hand ODPS tunnel upload, use multi-threaded read and write mode Increase the synchronization rate.

4.3  Two-stage data transfer

Files from local to ODPS can be divided into two stages. In the first stage, the front end will upload the local files to OSS in fragmented and interrupted transfers. In the second stage, the back-end streaming read and write will synchronize data from OSS to ODPS, as shown in the following figure:

 

Technical points involved:

4.3.1  Front end, js sdk with STS token to upload safely

When the file to be uploaded is large, it can be uploaded in multiple parts through the multipartUpload interface. The advantage of multipart upload is that a large request is divided into multiple small requests for execution, so that when some of the requests fail, there is no need to re-upload the entire file, but only the failed fragments. Generally, for files larger than 100MB, it is recommended to use the method of uploading in fragments. It is recommended to create a new OSS instance every time the upload is performed in fragments.

The Alibaba Cloud segment upload process mainly calls 3 APIs, including

  1. InitiateMultipartUpload, the interface for initializing fragmentation tasks.

  2. UploadPart, a separate part upload interface.

  3. CompleteMultipartUpload, the task completion interface after the segment upload is completed

Temporary access credentials are a way to achieve authorization through Alibaba Cloud Security Token Service (STS). For its implementation, please refer to STS Java SDK. The flow of temporary access credentials is as follows:

  1. The client initiates a request for authorization to the server. The server first verifies the legitimacy of the client. If it is a legitimate client, the server will use its own AccessKey to initiate a request for authorization to the STS. For details, refer to Access Control.

  2. After obtaining the temporary credentials, the server returns to the client.

  3. The client uses the obtained temporary credentials to initiate an upload request to OSS. For more detailed request structure, please refer to Temporary Authorized Access. The client can cache the credential for upload, and request a new credential from the server until the credential expires.

4.3.2  Backend, multi-threaded streaming read and write

OSS side: If the file to be downloaded is too large, or the one-time download takes too long, you can download multi-threaded streaming, processing part of the content at a time, until the file download is completed.
ODPS side: tunnel sdk directly writes OSS streaming data. A complete data write process usually includes the following steps:
first divide the data;

  1. Specify the block id for each data block, that is, call openRecordWriter(id);

  2. Then use one or more threads to upload these blocks separately, and after a block upload fails, the entire block needs to be retransmitted;

  3. After all blocks are uploaded, provide the successfully uploaded blockid list to the server for verification, that is, call session.commit([1,2,3,...])
         due to the server's restrictions on block management, connection timeout, etc. , The upload process logic becomes more complicated. In order to simplify the upload process, the SDK provides a more advanced RecordWriter—TunnelBufferWriter.

Realization process and pressure measurement

There are too many, you can refer to this article I wrote: http://blog.ncmem.com/wordpress/2019/08/09/%e5%a4%a7%e6%96%87%e4%bb%b6% e4%b8%8a%e4%bc%a0%e8%a7%a3%e5%86%b3%e6%96%b9%e6%a1%88/

summary

The actual test results show that the uploading scheme of this article has fulfilled the technical requirements in Section 1, as follows:

  • Supports large data volume and no pressure above the 10G level. The main reason is that the front-end uploads the fragmentation quota (maximum 10000 pieces, maximum 100G per piece). Currently, 1M per piece is set to meet the 10G demand.

  • Stability: There are few abnormalities in the actual network, and the file content is 100% successful under normal conditions.

  • Accuracy: There is no loss of measured data, and the accuracy of reading and writing is 100%.

  • Efficiency: When the office network bandwidth is 1.5M/s, 1G file minute level, 10G file hour level, the actual speed depends on the current network bandwidth of the client.

  • Experience: Advanced functions such as real-time progress perception, resumable transmission of abnormal network interruptions, and special processing of customized characters can enhance user experience.


Welcome to join the group to discuss together: 374992201

 

 

Guess you like

Origin blog.csdn.net/weixin_45525177/article/details/108513636