Talking about the Large File Upload Solution

File upload is a very common function. In business scenarios, it can be divided into single file upload, multipart upload, resumable upload, and instant transfer.

A small file upload can be completed quickly in an http connection, and there is no need to worry about the problem of re-uploading if the upload fails. This is not the case for uploading a large file. Imagine a scenario: a 10G file is uploaded directly. If the network speed of the uploading party is very good, and the network bandwidth of the server is small, then the bandwidth of the server is completely occupied by this upload connection. There is no bandwidth available for the file; if you upload it in an environment with poor network speed, the network is interrupted when the upload is about to complete, and you have to upload it again, which will definitely drive you crazy.

So how to solve the upload of large files? At this point, you need to introduce multi-part upload, which can bring the following advantages:

  • Does not occupy server network bandwidth: upload one segment at a time, and the upload can be completed quickly.
  • Breakpoint resume upload: The fragments are independent, and the upload can be resumed directly after the network is interrupted during uploading, that is, the successfully uploaded fragments do not need to be uploaded again.

So how to achieve the effect of instant transmission? Each file can use an encryption algorithm to generate an encrypted value, such as an MD5 value. As long as a file with exactly the same content is encrypted using the same encryption algorithm, the value obtained must be the same. Before starting to upload in pieces, use this encrypted value to judge whether there is The same file will do.

Upload plan

Now let’s talk about the specific plan and take a look at the whole process from uploading to completion:

user client server upload file split file upload slice process slice response loop [ complete=false ] success user client server

It can be seen that the important part lies in the closed loop where the client uploads fragments to the server for processing, and there can be multiple implementations in this closed loop.

Single Part Upload

The process of uploading a single segment is to upload the next segment after a segment upload forms a closed loop. The key point is that each uploaded segment is directly merged, so that the end will only retain a copy of the data of the file until the serverlast When the multipart upload is complete, it means that the current file upload is complete. A single closed-loop process is as follows:

client server upload slice, index=n verify slice merge slice response success client server

The advantage of this method is that the process is simple, easy to control, and relatively simple to implement; the disadvantage is that the time consumed for uploading will not be reduced, and K times of shard merging will be consumed.

Parallel upload

Parallel upload is to control the number of concurrent requests sent each time according to the number of concurrency of the server after the client has divided the entire file until the upload of all parts is finally completed. The parallel upload closed-loop process is as follows;

client server upload slice, index=1 upload slice, index=2 ... upload slice, index=n index=1 success index=2 success ... index=n success upload complete verify slice merge slice to file response success client server

The advantage of parallel upload is that it is fast, and if the file is large, it can save a lot of upload time; but its disadvantages are also obvious, its processing logic is more complicated than that of a single fragment upload, and it involves multi-shard storage maintenance and the final multi-shard Merge issues.

单分片上传最终合并

单分片上传最终合并是对前面两种方案的中和,其过程是一片一片的上传,最后一次上传时进行合并。此方案去除了单分片上传时的k次合并时间消耗代价,对分片的控制亦相对简单合理。方案闭环过程如下:

client server upload slice, index=n verify slice response success upload complete merge slice to file response success client server

分片的合并时机选取

对于前面的方案,第一种是一边上传一边合并,后两种是先上传最后进行合并,可见合并操作属于整个上传流程之中。那么上传流程是否可以只上传分片不合并呢?当然是可以的,只是需要业务场景允许用这样的方式。针对并行上传单分片上传最终合并两种方式,如果去掉最后的合并操作,当需要使用此文件的时候,再来做一次合并操作,那么此时整体的上传流程较少了一笔时间消耗。

方案实例

现在以“用户上传文件”为场景描述单分片上传方案的简单实现方式,以此来加深印象方便更容易理解。并行上传单分片上传最终合并的实现不作说明。

实体信息

首先,定义出单分片上传方式相关的数据实体信息:

这里对文件的加密值以MD5的方式呈现

  • 单分片上传接口参数
public class UploadFileReq {
    
    
    private Long userId;            // 用户id
    private String taskId;          // 当前文件上传任务标识(第一个分片不传,其余分片必传 )
    private boolean complete;       // 是否完成(是否是最后一个分片)

    private String fileMd5;         // 文件的MD5值
    private String filename;        // 文件名

    private MultipartFile slice;    // 分片
    private String sliceMd5;        // 分片的MD5值
    private Integer sliceIndex;     // 分片索引(当前第几个分片)
    private Long sliceOffset;       // 分片在整个文件的偏移量
}
  • 文件上传记录信息实体
public class UploadRecord {
    
    
    private Long userId;            // 用户id
    private String taskId;          // 当前文件上传任务标识
    private boolean complete;       // 是否完成(是否是最后一个分片)

    private String fileMd5;         // 文件的MD5值
    private String filename;        // 文件名
    private String filePath;        // 文件路径
    private long fileSize;          // 文件大小

    private String sliceMd5;        // 分片的MD5值
    private Integer sliceIndex;     // 分片索引(当前第几个分片)
    private Long sliceOffset;       // 分片在整个文件的偏移量
    private long sliceSize;         // 分片大小

    private Date createTime;        // 上传任务创建时间
    private Date updateTime;        // 任务更新时间
}
  • 接口返回VO
public class UploadFileVo {
    
    
    private String taskId;          // 上传任务标识
    private String filePath;        // 文件路径(当整个文件上传完成后才会返回值)
}

对于PO实体中的字段信息,userId可以标识出当前上传记录的所属;complete可以很直观的看出当前的上传记录是否已经完成上传;分片相关的字段可以直观给出当前文件传到哪儿一个分片,以及相关的信息是什么;文件相关字段亦是直观的给出了文件相关信息。

三个实体的给出,便已经能想象得出接口的输入、存储、输出的闭环流程了,接下来便是看看另一个核心点分片处理过程

分片处理

单个分片上传的分片处理方式在前面已经有提到过,现在通过流程图的形式来描述服务端处理一个分片的详细过程,如下所示:

通过
开始处理分片
分片md5值校验
第一次上传
处理结束
第一次分片处理
增量分片处理
sliceIndex==1
保存分配:本地or文件服务器
持久化任务记录
包装返回信息:任务taskId
complete==true
包装返回信息:文件路径filePath
任务存在
文件MD5匹配
分片偏移正确
合并文件,追加分片
更新任务记录信息

续传

当需要续传的时候,客户端需要调用接口获取当前用户某文件的上传情况信息。如果本地分片仍存在,可以根据已上传的分片(UploadRecord.sliceIndex)来推断应该续传的分片;如果本地分片已丢失,则可以根据已上传的文件大小信息(UploadRecord.fileSize)来对文件未上传的部分进行分片上传。

当然,客户端依然可以自行维护某个文件的上传状态、进度等情况,但不推荐此方式,应该以服务器的数据信息为准。

秒传

秒传功能可以根据文件的MD5值来实现,当在上传某个文件的时候,客户端向服务端询问当前MD5值的文件是否已经存在,服务端去比对查询已完成的记录,如果存在直接生成一条当前用户的上传记录即可。

Guess you like

Origin blog.csdn.net/qq_28851503/article/details/103866955
Recommended