Go language actual combat sparse file decompression, 4 times improvement, multiple skills! !

Click on the blue "Flying Snow Ruthless" above to follow my official account, set a star, and read the article as soon as possible

Some time ago, I used Golang to write a small tool for decompressing tar files. The speed is much faster than the tar that comes with linux (4 or 5 times the difference), and it also supports sparse files. For specific explanations about sparse files, you can refer to my article Golang and Sparse Files .

High energy ahead, a bit boring, but I can learn something:

The tar standard package does not support sparse files, how to support it yourself
How to call private methods
How to improve decompression performance

golang tar decompression

First of all, through a simple example of tar file decompression, I analyze its implementation step by step to see why golang's tar package does not support sparse files.

Take decompression in the current folder demo.tar as an example, and decompress it to the current directory. The Golang implementation code is as follows:

func main() {
   unTarDir := "." //解压到当前目录
   tarFile, err := os.Open("demo.tar")
   if err != nil {
      log.Fatalln(err)
   }
   tr := tar.NewReader(tarFile)
   for {
      hdr, err := tr.Next()
      if err == io.EOF {
         break // End of archive
      }
      if err != nil {
         log.Fatalln(err)
      }
      if hdr.Typeflag == tar.TypeDir {
         // 创建解压目录
      } else if hdr.Typeflag == tar.TypeReg || hdr.Typeflag == tar.TypeGNUSparse {
         tarFile := path.Join(unTarDir, hdr.Name)
          file, err := os.OpenFile(tarFile, os.O_RDWR|os.O_CREATE|os.O_TRUNC, os.FileMode(hdr.Mode))
          if err != nil {
             log.Fatalln(err)
          }
          err = file.Truncate(hdr.Size)
          if err != nil {
             log.Fatalln(err)
          }
         _, err = io.Copy(file, tr)
         if err != nil {
            log.Fatalln(err)
         }
      }
   }
}

The above code handles regular files and sparse files in the same way, that is, uses io.Copy functions to write the content to the file.

func Copy(dst Writer, src Reader) (written int64, err error) {
   return copyBuffer(dst, src, nil)
}

Copy The implementation rules of the function are as follows:

If the interface dst is implemented io.WriterTo , then dst.WriterTo the method can be used directly
If the first point is not satisfied, then judge src whether the interface is implemented io.ReadFrom , and if so, use it directly
Finally, if you are not satisfied, call src the Read method, and then use dst the Write method to write the content

Tar Reader source code analysis

Specifically tar.Reader , because it does not implement io.WriterTo the interface, its Read method is finally called.

// If the current file is sparse, then the regions marked as a hole
// are read back as NUL-bytes.
func (tr *Reader) Read(b []byte) (int, error) {
   if tr.err != nil {
      return 0, tr.err
   }
   n, err := tr.curr.Read(b)
   if err != nil && err != io.EOF {
      tr.err = err
   }
   return n, err
}

From the comments, the tar package of golang supports sparse files, and it will fill the hole area with bytes with a value of 0, indicating that the go tar package considers the scenario of sparse files.

func (sr *sparseFileReader) Read(b []byte) (n int, err error) {
   // 省略无关代码
   for endPos > sr.pos && err == nil {
      var nf int // Bytes read in fragment
      holeStart, holeEnd := sr.sp[0].Offset, sr.sp[0].endOffset()
      if sr.pos < holeStart { // In a data fragment
         bf := b[:min(int64(len(b)), holeStart-sr.pos)]
         nf, err = tryReadFull(sr.fr, bf)
      } else { // In a hole fragment
         bf := b[:min(int64(len(b)), holeEnd-sr.pos)]
         nf, err = tryReadFull(zeroReader{}, bf)
      }
      b = b[nf:]
      sr.pos += int64(nf)
      if sr.pos >= holeEnd && len(sr.sp) > 1 {
         sr.sp = sr.sp[1:] // Ensure last fragment always remains
      }
   }
   // 省略无关代码
}

Note that in the above source code zeroReader , this is a filler with 0 bytes, which will generate a byte slice with all 0s. It can be seen Read that the method is to realize sparse files by writing 0-value data to the file.

The tar package does not support sparse files

Unfortunately, the results of my actual test cannot support the sparse file feature, and the disk usage is still the same as the file size.

At first, I thought it was a problem with my MacOS file system. My own system is APFS, which also supports sparse files.

So I tried Ext4 and XFS systems again. This method of writing byte 0 does not support the sparse feature.

Moreover, this method also has a disadvantage, that is, the data of 0 must also be written to the file, and writing the file will involve disk IO, which will cause performance waste. Now look at the time-consuming method of using padding 0:

➜  time go run cmd/main.go
go run cmd/main.go  1.13s user 7.19s system 54% cpu 15.317 total

The above is the test data of a 1.43GB tar file used, which takes about 15 seconds.
Now I am in a deadlock, but I actually tested it in the previous article on sparse files Golang and sparse files . My computer supports sparse files. The only difference is File.Seek the method used at that time. Tar decompression uses the blank byte 0 method.

As mentioned just now, the poor performance is because 0 must be written every time. If we know that this part of the data is 0, using File.Seek skip can reduce IO? Might as well solve the sparse file problem.

private method writeTo

Nice idea, and it's actually doable. Let's continue to think along this line of thought. If you want to use File.Seek the method to skip the data that is 0, then file the variable plays a leading role, and you must find a way to let it Seek be called.

_, err = io.Copy(file, tr)

It turns out that this is how we write data file into the file. Copy We have analyzed the method, and it will prioritize dst whether to implement io.WriterTo the interface. If so, it will prioritize using it to write data into the file.

// If the reader has a WriteTo method, use it to do the copy.
// Avoids an allocation and a copy.
if wt, ok := src.(WriterTo); ok {
   return wt.WriteTo(dst)
}

But unfortunately, the interface tar.Reader has not been implemented io.WriterTo , so io.Copy this path will not work.
But this gave me an idea. I went through tar.Reader the source code and found a private method writeTo . Judging from its comments, it also supports sparse files, and it uses Seek the method to skip hole data. sparseFileReader The specific implementation is through the method of the internal structure WriteTo , see the comments of the code below for details.

// If the current file is sparse and w is an io.WriteSeeker,
// then writeTo uses Seek to skip past holes defined in Header.SparseHoles,
// assuming that skipped regions are filled with NULs.
func (tr *Reader) writeTo(w io.Writer) (int64, error) {
   if tr.err != nil {
      return 0, tr.err
   }
   n, err := tr.curr.WriteTo(w)
   if err != nil {
      tr.err = err
   }
   return n, err
}

func (sr *sparseFileReader) WriteTo(w io.Writer) (n int64, err error) {
   // 关键，先判断是否实现了io.WriteSeeker，实现了才有Seek方法
   ws, ok := w.(io.WriteSeeker)
   if ok {
      if _, err := ws.Seek(0, io.SeekCurrent); err != nil {
         ok = false // Not all io.Seeker can really seek
      }
   }
   // 如果不支持Seek，又回到了Copy方法
   if !ok {
      return io.Copy(w, struct{ io.Reader }{sr})
   }
   var writeLastByte bool
   pos0 := sr.pos
   for sr.logicalRemaining() > 0 && !writeLastByte && err == nil {
      var nf int64 // Size of fragment
      holeStart, holeEnd := sr.sp[0].Offset, sr.sp[0].endOffset()
      if sr.pos < holeStart { // In a data fragment
         nf = holeStart - sr.pos
         nf, err = io.CopyN(ws, sr.fr, nf)
      } else { // In a hole fragment
         nf = holeEnd - sr.pos
         if sr.physicalRemaining() == 0 {
            writeLastByte = true
            nf--
         }
         // 关键，使用Seek方法跳过
         _, err = ws.Seek(nf, io.SeekCurrent)
      }
      sr.pos += nf
      if sr.pos >= holeEnd && len(sr.sp) > 1 {
         sr.sp = sr.sp[1:] // Ensure last fragment always remains
      }
   }
}

Surprisingly, this is the method we want, but it is private, how to use it?

go:linkname appears

To call a private method, is the first idea reflection? But in Golang, there is a better calling method, which is the go:linkname compilation instruction, which is also a hidden black technology of golang, and it is also used in its own standard library.

If you want to use a private method, first define a function with the same signature in your own source code file, without a function body! ! !

func writeTo(tr *tar.Reader, w io.Writer) (int64, error)

Because we are defining a function, the first parameter is tar.Reader itself. Through this definition method, it indicates tar.Reader the method of yes.
Then it is the use of go:linkname, adding the description of this instruction above the function we defined

//go:linkname writeTo archive/tar.(*Reader).writeTo
func writeTo(tr *tar.Reader, w io.Writer) (int64, error)

This refers to two parts, separated by spaces:

The first part represents the function writeTo in our current source file writeTo
The private method archive/tar.(*Reader).writeTo represented by the second part pays attention to the second part, it must be written in full , the full path of the import, then the structure, and finally the private method, both of which are indispensable and cannot be written wrong.tar.ReaderwriteTo
import pathReaderwriteTo

Because this approach is non-standard, we also import unsafe the package, telling the compiler that we already know it's unsafe.

import (
   "archive/tar"
   _ "unsafe"
)

Well, the preparations are all done, and now it's time to use it. io.Copy Just replace it with a function in our sample code writeTo .

tarFile := path.Join(unTarDir, hdr.Name)
file, err := os.OpenFile(tarFile, os.O_RDWR|os.O_CREATE|os.O_TRUNC, os.FileMode(hdr.Mode))
if err != nil {
   log.Fatalln(err)
}
err = file.Truncate(hdr.Size)
if err != nil {
   log.Fatalln(err)
}
_, err = writeTo(tr, file)
if err != nil {
   log.Fatalln(err)
}

Now let's test it and see how much the performance has improved:

➜  time go run cmd/main.go
go run cmd/main.go  0.55s user 3.65s system 86% cpu 4.844 total

The time-consuming has been reduced from the original 15 seconds to 5 seconds, which is a performance improvement, just now! ! !

Seek The improvement of the method, for the sparse files with more holes, the better the effect, because they are all skipped.

The key is to draw the key point. We finally realized the decompression of sparse files (APFS, XFS, and Ext4 are all supported). Viewed through the du command, the disk usage is far smaller than the actual size of the file .

## 磁盘占用大小1.3G
$ du -h RAM 
1.3G    spare_file

## 文件实际大小12G
$ ls -lh RAM 
12G     spare_file

Refactor the code

Ok, now let's refactor the code to distinguish between regular files and sparse files, and then treat them differently. The original regular file is still io.Copy implemented through functions.

func writeFile(root string, hdr *tar.Header, tr *tar.Reader, sparseFile bool) {
   tarFile := path.Join(root, hdr.Name)
   file, err := os.OpenFile(tarFile, os.O_RDWR|os.O_CREATE|os.O_TRUNC, os.FileMode(hdr.Mode))
   if err != nil {
      log.Fatalln(err)
   }
   err = file.Truncate(hdr.Size)
   if err != nil {
      log.Fatalln(err)
   }
   if sparseFile {
      _, err = writeTo(tr, file)
   } else {
      _, err = io.Copy(file, tr)
   }
   if err != nil {
      log.Fatalln(err)
   }
}

Extract a common function , use functions writeFile for sparse files , and use functions for regular files, and use them when decompressing.writeToio.Copy

else if hdr.Typeflag == tar.TypeReg {
   writeFile(unTarDir, hdr, tr, false)
} else if hdr.Typeflag == tar.TypeGNUSparse {
   writeFile(unTarDir, hdr, tr, true)
}

summary

At this point, the entire actual combat is over. Judging from this process, if you want to complete this actual combat, you must first not give up. You cannot give up when you find that sparse files are not supported when you first use the tar package. If you give up, there will be nothing left wonderful.

Secondly, we have an in-depth understanding of the principles of source code and sparse files. Only in this way can we find better methods and achieve optimal performance.

This article is an original article, reprint and indicate the source, welcome to scan the QR code to follow the official account flysnow_orgor website https://www.flysnow.org/, and read the follow-up exciting articles as soon as possible. If you think it is good, please click "Looking" in the lower right corner of the article, thank you for your support.

Scan code attention

Share, like, watch is the biggest support