background

For a file server, only one copy of the same file should be saved on the server. Based on this principle, the content of this article is derived.

This article only explains the concurrency problem of the file server receiving the same file at the same time. This is not common for small-scale services, but it is best to pay attention to this extreme situation.

Implementation principle

Common process: The database records the basic attributes of the file: file name, size, hash value, file path, etc., with the hash value as the only identifier . When a user uploads a new file, the database is first queried. If there is already a record with the same hash value (calculated by the client and passed to the server, the most common spark-md5 on the client), the file is not saved and is directly marked as uploaded successfully. , using an existing file copy, which is commonly known as instant transfer implementation.

What is missing in the above process is that when multiple identical files are uploaded at the same time for a file that does not exist in the database, if no processing is done, there will be multiple copies of the file on the server. Therefore, when a user uploads a file, he or she can mark the file as locked. If other users upload the same file, they need to check the locked status of the file and wait until the lock is lifted before proceeding.

Please add image description

Code

The text expression is a bit poor, so let’s code it! The server in this example uses the goframework gin, which simply simulates the concurrent upload of the same file.

File directory structure

	- go.mod
	- go.sum
	- hash_cache.go
	- main.go
	- spark-md5-min.js
	- upload.html

js client

upload.htmlAfter selecting the file, you can repeatedly click the upload button to test concurrency, or modify the script yourself.

Insert image description here

<!DOCTYPE html>
<html>
<head>
    <title>文件上传</title>
    <script src="spark-md5.min.js"></script>
</head>
<body>
<h1>文件上传</h1>
<input id="file" type="file" name="file"/>
<button onclick="upload();">上传</button>

<script>
    var file_md5 = {
      
      };

    function upload() {
      
      
        if (!file_md5.md5) {
      
      
            alert("请先选择文件");
            return
        }

        var form = new FormData();
        form.append("md5", file_md5.md5);
        form.append("file", file_md5.file);
        var xhr = new XMLHttpRequest();
        var action = "/upload"; // 上传服务的接口地址
        xhr.open("POST", action);
        xhr.send(form); // 发送表单数据
        xhr.onreadystatechange = function () {
      
      
            if (xhr.readyState == 4 && xhr.status == 200) {
      
      
                var resultObj = JSON.parse(xhr.responseText);
                // 处理返回的数据......
                console.log(resultObj)
            }
        }
    }

    document.getElementById('file').addEventListener('change', function (event) {
      
      
        var blobSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice,
            file = this.files[0],
            chunkSize = 2097152,                             // Read in chunks of 2MB
            chunks = Math.ceil(file.size / chunkSize),
            currentChunk = 0,
            spark = new SparkMD5.ArrayBuffer(),
            fileReader = new FileReader();

        fileReader.onload = function (e) {
      
      
            console.log('read chunk nr', currentChunk + 1, 'of', chunks);
            spark.append(e.target.result);                   // Append array buffer
            currentChunk++;

            if (currentChunk < chunks) {
      
      
                loadNext();
            } else {
      
      
                console.log('finished loading');
                var md5=spark.end()
                console.info('computed hash', md5);  // Compute hash
                file_md5 = {
      
      file:file,md5:md5}
            }
        };

        fileReader.onerror = function () {
      
      
            console.warn('oops, something went wrong.');
        };

        function loadNext() {
      
      
            var start = currentChunk * chunkSize,
                end = ((start + chunkSize) >= file.size) ? file.size : start + chunkSize;

            fileReader.readAsArrayBuffer(blobSlice.call(file, start, end));
        }

        loadNext();
    });
</script>
</body>
</html>

Please add image description

go gin server

main.goA service is opened 4780as the port and can be accessed by accessing the page.httphttp://127.0.0.1:4780/client/upload.htmlhtml

In order to simulate concurrency scenarios, the server /uploadinterface deliberately allows it to sleep 30s.

package main

import (
	"crypto/md5"
	"embed"
	"encoding/hex"
	"errors"
	"fmt"
	"github.com/gin-gonic/gin"
	"io"
	"net/http"
	"os"
	"path/filepath"
	"runtime"
	"time"
)

//go:embed upload.html spark-md5.min.js
var client embed.FS

var hashCache = NewHashCache()

func main() {
    
    
	engine := gin.New()
	engine.StaticFS("/client", http.FS(client))
	engine.POST("/upload", doUpload)

	engine.Run(":4780")
}

func doUpload(c *gin.Context) {
    
    
	printMem("start")
	clientMd5 := c.PostForm("md5")

	// 查询是否有其他正在上传，若有，则等待其上传完毕，根据返回值来做判断
	if hashCache.Has(clientMd5) {
    
    
		info, er := hashCache.Wait(clientMd5)
		if er != nil {
    
    
			c.String(http.StatusInternalServerError, er.Error())
			return
		}
		if info.Err == nil {
    
    
			c.String(http.StatusOK, "上传成功: "+info.SavedPath)
			return
		}
		// 若是出错了，则继续接收
	}

	hashCache.Set(clientMd5)

	// 模拟并发，这里睡一下
	time.Sleep(time.Second * 30)

	savedPath, err := doSaveFile(c, clientMd5)
	if err != nil {
    
    
		hashCache.SetDone(clientMd5, "", err)
		c.String(http.StatusInternalServerError, err.Error())
		return
	}
	hashCache.SetDone(clientMd5, savedPath, nil)

	c.String(http.StatusOK, "上传成功: "+savedPath)
} 

func doSaveFile(c *gin.Context, clientMd5 string) (savedPath string, err error) {
    
    
	fh, err := c.FormFile("file")
	if err != nil {
    
    
		return
	}

	fn := fmt.Sprintf("%s_%d", fh.Filename, time.Now().UnixMilli())
	savedPath = filepath.Join("uploaded", fn)
	err = c.SaveUploadedFile(fh, savedPath)
	if err != nil {
    
    
		return
	}

	md5Str, err := getFileMd5(savedPath)
	if err != nil {
    
    
		return
	}

	if clientMd5 != md5Str {
    
    
		os.Remove(savedPath)
		err = errors.New("哈希不匹配")
		return
	}

	return
}

func getFileMd5(p string) (md5Str string, err error) {
    
    
	f, err := os.Open(p)
	if err != nil {
    
    
		return
	}
	defer f.Close()

	h := md5.New()
	_, err = io.Copy(h, f)
	if err != nil {
    
    
		return
	}
	md5Str = hex.EncodeToString(h.Sum(nil))
	return
}

func printMem(prefix string) {
    
    
	var m runtime.MemStats
	runtime.ReadMemStats(&m)
	fmt.Printf("%s: %d Kb\n", prefix, m.Alloc/1024)
}

hash_cache.goA database is maintained mapto determine whether the same file is uploaded.

package main

import (
	"errors"
	"sync"
)

type HashCache struct {
    
    
	mutex sync.RWMutex
	m     map[string]*HashCacheInfo
}

func NewHashCache() *HashCache {
    
    
	return &HashCache{
    
    
		m: make(map[string]*HashCacheInfo),
	}
}

type HashCacheInfo struct {
    
    
	Done      chan struct{
    
    }
	SavedPath string
	Err       error
}

func (this *HashCache) Set(md5Hash string) {
    
    
	this.mutex.Lock()
	defer this.mutex.Unlock()
	this.m[md5Hash] = &HashCacheInfo{
    
    
		Done: make(chan struct{
    
    }),
	}
}

func (this *HashCache) SetDone(md5Hash, savedPath string, err error) error {
    
    
	this.mutex.Lock()
	defer this.mutex.Unlock()
	data, ok := this.m[md5Hash]
	if !ok {
    
    
		return errors.New("no hash: " + md5Hash)
	}

	data.SavedPath = savedPath
	data.Err = err
	close(data.Done)

	delete(this.m, md5Hash)

	//这里的 data 不能直接释放，wait 那里需要用，垃圾收集器自己去回收吧

	return nil
}

func (this *HashCache) Has(md5Hash string) bool {
    
    
	this.mutex.RLock()
	defer this.mutex.RUnlock()
	_, has := this.m[md5Hash]
	return has
}

func (this *HashCache) Wait(md5Hash string) (info HashCacheInfo, err error) {
    
    
	this.mutex.RLock()
	data, ok := this.m[md5Hash]
	if !ok {
    
    
		this.mutex.RUnlock()
		err = errors.New("no hash: " + md5Hash)
		return
	}
	this.mutex.RUnlock()
	<-data.Done

	info = *data
	return
} 维护了一个 map 用于

The server log output shows that each time a request is submitted, the memory usage will increase.
Please add image description

In this service, if a large number of identical files are uploaded at the same time, the memory usage will soar ( when c.PostFormparsing formdatathe data, the data will be read into the memory). If you want to solve this problem, you need to read the data yourself, as follows:

doUpload1Changes in method

package main

import (
	"crypto/md5"
	"embed"
	"encoding/hex"
	"errors"
	"fmt"
	"github.com/gin-gonic/gin"
	"io"
	"mime/multipart"
	"net/http"
	"os"
	"path/filepath"
	"runtime"
	"time"
)

//go:embed upload.html spark-md5.min.js
var client embed.FS

var hashCache = NewHashCache()

func main() {
    
    
	engine := gin.New()
	engine.StaticFS("/client", http.FS(client))
	engine.POST("/upload", doUpload1)

	engine.Run(":4780")
}

func doUpload(c *gin.Context) {
    
    
	printMem("start")
	clientMd5 := c.PostForm("md5")

	// 查询是否有其他正在上传，若有，则等待其上传完毕，根据返回值来做判断
	if hashCache.Has(clientMd5) {
    
    
		info, er := hashCache.Wait(clientMd5)
		if er != nil {
    
    
			c.String(http.StatusInternalServerError, er.Error())
			return
		}
		if info.Err == nil {
    
    
			c.String(http.StatusOK, "上传成功: "+info.SavedPath)
			return
		}
		// 若是出错了，则继续接收
	}

	hashCache.Set(clientMd5)

	// 模拟并发，这里睡一下
	//time.Sleep(time.Second * 30)

	savedPath, err := doSaveFile(c, clientMd5)
	if err != nil {
    
    
		hashCache.SetDone(clientMd5, "", err)
		c.String(http.StatusInternalServerError, err.Error())
		return
	}
	hashCache.SetDone(clientMd5, savedPath, nil)

	c.String(http.StatusOK, "上传成功: "+savedPath)
}

func doSaveFile(c *gin.Context, clientMd5 string) (savedPath string, err error) {
    
    
	fh, err := c.FormFile("file")
	if err != nil {
    
    
		return
	}

	fn := fmt.Sprintf("%s_%d", fh.Filename, time.Now().UnixMilli())
	savedPath = filepath.Join("uploaded", fn)
	err = c.SaveUploadedFile(fh, savedPath)
	if err != nil {
    
    
		return
	}

	md5Str, err := getFileMd5(savedPath)
	if err != nil {
    
    
		return
	}

	if clientMd5 != md5Str {
    
    
		os.Remove(savedPath)
		err = errors.New("哈希不匹配")
		return
	}

	return
}

func getFileMd5(p string) (md5Str string, err error) {
    
    
	f, err := os.Open(p)
	if err != nil {
    
    
		return
	}
	defer f.Close()

	h := md5.New()
	_, err = io.Copy(h, f)
	if err != nil {
    
    
		return
	}
	md5Str = hex.EncodeToString(h.Sum(nil))
	return
}

func printMem(prefix string) {
    
    
	var m runtime.MemStats
	runtime.ReadMemStats(&m)
	fmt.Printf("%s: %d Kb\n", prefix, m.Alloc/1024)
}

func doUpload1(c *gin.Context) {
    
    
	printMem("start")
	reader, err := c.Request.MultipartReader()
	if err != nil {
    
    
		c.String(http.StatusBadRequest, err.Error())
		return
	}
	clientMd5, err := readMd5(reader) // 读 md5
	if err != nil {
    
    
		c.String(http.StatusBadRequest, err.Error())
		return
	}

	// 查询是否有其他正在上传，若有，则等待其上传完毕，根据返回值来做判断
	if hashCache.Has(clientMd5) {
    
    
		info, er := hashCache.Wait(clientMd5)
		if er != nil {
    
    
			c.String(http.StatusInternalServerError, er.Error())
			return
		}
		if info.Err == nil {
    
    
			er = closeReaderParts(reader)
			if er != nil {
    
    
				c.String(http.StatusInternalServerError, er.Error())
			} else {
    
    
				c.String(http.StatusOK, "上传成功: "+info.SavedPath)
			}
			return
		}
	}

	hashCache.Set(clientMd5)

	// 模拟并发，这里睡一下
	time.Sleep(time.Second * 30)

	savedPath, err := saveFilePart(reader, clientMd5)
	hashCache.SetDone(clientMd5, savedPath, err)
	if err != nil {
    
    
		c.String(http.StatusInternalServerError, err.Error())
		return
	}

	c.String(http.StatusOK, "上传成功: "+savedPath)
}

func readMd5(reader *multipart.Reader) (md5Hash string, err error) {
    
    
	part, err := reader.NextPart() // 读 md5
	if err != nil {
    
    
		return
	}
	name := part.FormName()
	if name != "md5" {
    
    
		err = errors.New("first key is not match")
		return
	}
	buf, err := io.ReadAll(part)
	if err != nil {
    
    
		return
	}
	md5Hash = string(buf)
	return
}

func closeReaderParts(reader *multipart.Reader) (err error) {
    
    
	for {
    
    
		p, er := reader.NextPart()
		if er == io.EOF {
    
    
			break
		}
		if er != nil {
    
    
			err = er
			return
		}
		p.Close()
	}
	return
}

func saveFilePart(reader *multipart.Reader, clientMd5 string) (fp string, err error) {
    
    
	part, err := reader.NextPart() // 读 file
	if err != nil {
    
    
		return
	}
	name := part.FormName()
	if name != "file" {
    
    
		err = errors.New("key not match")
		return
	}
	fn := fmt.Sprintf("%s_%d", part.FileName(), time.Now().UnixMilli())
	fp = filepath.Join("uploaded", fn)
	f, err := os.Create(fp)
	if err != nil {
    
    
		return
	}
	defer f.Close()
	_, err = io.Copy(f, part)
	if err != nil {
    
    
		return
	}

	md5Str, err := getFileMd5(fp)
	if err != nil {
    
    
		return
	}

	if clientMd5 != md5Str {
    
    
		os.Remove(fp)
		err = errors.New("哈希不匹配")
		return
	}

	return

	return
}

The server log output shows that the memory is not consumed as much as before.
Please add image description

Note here: when the client formdataadds data to , it needs to be md5put first, otherwise the logic will go wrong. Another point is that if the server does not finish reading the requested bodydata ( closeReaderPartsthis is what it does) and apireturns it directly, it will also cause jsan error in the client request (currently it has only been encountered when the uploaded file is relatively large, but gothe client I use There will be no problem in end-side testing. It should be due to the browser implementation. Anyone who knows can leave a comment).

For other implementations, you can also md5put in the request url( http://127.0.0.1:4780/client/upload?md5=xxx), and then do the matching (it is also the same as above, if the requested bodyis not read, the client will report an error).

Summarize

This article just gives an idea and introduces how to implement concurrent upload control on the client and server. The sample code ensures that only one copy of the file exists on the server during concurrent uploads.

In a real production environment, these codes may need to be further optimized and enhanced to meet performance, security, and reliability needs.

The web server receives concurrent uploads of the same file from multiple users, ensuring that only one copy of the file exists (go language implementation attached)