Click on the blue "Flying Snow Ruthless" above to follow my official account, set a star, and read the article as soon as possible
Some time ago, I used Golang to write a small tool for decompressing tar files. The speed is much faster than the tar that comes with linux (4 or 5 times the difference), and it also supports sparse files. For specific explanations about sparse files, you can refer to my article Golang and Sparse Files .
High energy ahead, a bit boring, but I can learn something:
The tar standard package does not support sparse files, how to support it yourself
How to call private methods
How to improve decompression performance
golang tar decompression
First of all, through a simple example of tar file decompression, I analyze its implementation step by step to see why golang's tar package does not support sparse files.
Take decompression in the current folder demo.tar
as an example, and decompress it to the current directory. The Golang implementation code is as follows:
func main() {
unTarDir := "." //解压到当前目录
tarFile, err := os.Open("demo.tar")
if err != nil {
log.Fatalln(err)
}
tr := tar.NewReader(tarFile)
for {
hdr, err := tr.Next()
if err == io.EOF {
break // End of archive
}
if err != nil {
log.Fatalln(err)
}
if hdr.Typeflag == tar.TypeDir {
// 创建解压目录
} else if hdr.Typeflag == tar.TypeReg || hdr.Typeflag == tar.TypeGNUSparse {
tarFile := path.Join(unTarDir, hdr.Name)
file, err := os.OpenFile(tarFile, os.O_RDWR|os.O_CREATE|os.O_TRUNC, os.FileMode(hdr.Mode))
if err != nil {
log.Fatalln(err)
}
err = file.Truncate(hdr.Size)
if err != nil {
log.Fatalln(err)
}
_, err = io.Copy(file, tr)
if err != nil {
log.Fatalln(err)
}
}
}
}
The above code handles regular files and sparse files in the same way, that is, uses io.Copy
functions to write the content to the file.
func Copy(dst Writer, src Reader) (written int64, err error) {
return copyBuffer(dst, src, nil)
}
Copy
The implementation rules of the function are as follows:
If the interface
dst
is implementedio.WriterTo
, thendst.WriterTo
the method can be used directlyIf the first point is not satisfied, then judge
src
whether the interface is implementedio.ReadFrom
, and if so, use it directlyFinally, if you are not satisfied, call
src
theRead
method, and then usedst
theWrite
method to write the content
Tar Reader source code analysis
Specifically tar.Reader
, because it does not implement io.WriterTo
the interface, its Read
method is finally called.
// If the current file is sparse, then the regions marked as a hole
// are read back as NUL-bytes.
func (tr *Reader) Read(b []byte) (int, error) {
if tr.err != nil {
return 0, tr.err
}
n, err := tr.curr.Read(b)
if err != nil && err != io.EOF {
tr.err = err
}
return n, err
}
From the comments, the tar package of golang supports sparse files, and it will fill the hole area with bytes with a value of 0, indicating that the go tar package considers the scenario of sparse files.
func (sr *sparseFileReader) Read(b []byte) (n int, err error) {
// 省略无关代码
for endPos > sr.pos && err == nil {
var nf int // Bytes read in fragment
holeStart, holeEnd := sr.sp[0].Offset, sr.sp[0].endOffset()
if sr.pos < holeStart { // In a data fragment
bf := b[:min(int64(len(b)), holeStart-sr.pos)]
nf, err = tryReadFull(sr.fr, bf)
} else { // In a hole fragment
bf := b[:min(int64(len(b)), holeEnd-sr.pos)]
nf, err = tryReadFull(zeroReader{}, bf)
}
b = b[nf:]
sr.pos += int64(nf)
if sr.pos >= holeEnd && len(sr.sp) > 1 {
sr.sp = sr.sp[1:] // Ensure last fragment always remains
}
}
// 省略无关代码
}
Note that in the above source code zeroReader
, this is a filler with 0 bytes, which will generate a byte slice with all 0s. It can be seen Read
that the method is to realize sparse files by writing 0-value data to the file.
The tar package does not support sparse files
Unfortunately, the results of my actual test cannot support the sparse file feature, and the disk usage is still the same as the file size.
At first, I thought it was a problem with my MacOS file system. My own system is APFS, which also supports sparse files.
So I tried Ext4 and XFS systems again. This method of writing byte 0 does not support the sparse feature.
Moreover, this method also has a disadvantage, that is, the data of 0 must also be written to the file, and writing the file will involve disk IO, which will cause performance waste. Now look at the time-consuming method of using padding 0:
➜ time go run cmd/main.go
go run cmd/main.go 1.13s user 7.19s system 54% cpu 15.317 total
The above is the test data of a 1.43GB tar file used, which takes about 15 seconds.
Now I am in a deadlock, but I actually tested it in the previous article on sparse files Golang and sparse files . My computer supports sparse files. The only difference is File.Seek
the method used at that time. Tar decompression uses the blank byte 0 method.
As mentioned just now, the poor performance is because 0 must be written every time. If we know that this part of the data is 0, using File.Seek
skip can reduce IO? Might as well solve the sparse file problem.
private method writeTo
Nice idea, and it's actually doable. Let's continue to think along this line of thought. If you want to use File.Seek
the method to skip the data that is 0, then file
the variable plays a leading role, and you must find a way to let it Seek
be called.
_, err = io.Copy(file, tr)
It turns out that this is how we write data file
into the file. Copy
We have analyzed the method, and it will prioritize dst
whether to implement io.WriterTo
the interface. If so, it will prioritize using it to write data into the file.
// If the reader has a WriteTo method, use it to do the copy.
// Avoids an allocation and a copy.
if wt, ok := src.(WriterTo); ok {
return wt.WriteTo(dst)
}
But unfortunately, the interface tar.Reader
has not been implemented io.WriterTo
, so io.Copy
this path will not work.
But this gave me an idea. I went through tar.Reader
the source code and found a private method writeTo
. Judging from its comments, it also supports sparse files, and it uses Seek
the method to skip hole data. sparseFileReader
The specific implementation is through the method of the internal structure WriteTo
, see the comments of the code below for details.
// If the current file is sparse and w is an io.WriteSeeker,
// then writeTo uses Seek to skip past holes defined in Header.SparseHoles,
// assuming that skipped regions are filled with NULs.
func (tr *Reader) writeTo(w io.Writer) (int64, error) {
if tr.err != nil {
return 0, tr.err
}
n, err := tr.curr.WriteTo(w)
if err != nil {
tr.err = err
}
return n, err
}
func (sr *sparseFileReader) WriteTo(w io.Writer) (n int64, err error) {
// 关键,先判断是否实现了io.WriteSeeker,实现了才有Seek方法
ws, ok := w.(io.WriteSeeker)
if ok {
if _, err := ws.Seek(0, io.SeekCurrent); err != nil {
ok = false // Not all io.Seeker can really seek
}
}
// 如果不支持Seek,又回到了Copy方法
if !ok {
return io.Copy(w, struct{ io.Reader }{sr})
}
var writeLastByte bool
pos0 := sr.pos
for sr.logicalRemaining() > 0 && !writeLastByte && err == nil {
var nf int64 // Size of fragment
holeStart, holeEnd := sr.sp[0].Offset, sr.sp[0].endOffset()
if sr.pos < holeStart { // In a data fragment
nf = holeStart - sr.pos
nf, err = io.CopyN(ws, sr.fr, nf)
} else { // In a hole fragment
nf = holeEnd - sr.pos
if sr.physicalRemaining() == 0 {
writeLastByte = true
nf--
}
// 关键,使用Seek方法跳过
_, err = ws.Seek(nf, io.SeekCurrent)
}
sr.pos += nf
if sr.pos >= holeEnd && len(sr.sp) > 1 {
sr.sp = sr.sp[1:] // Ensure last fragment always remains
}
}
}
Surprisingly, this is the method we want, but it is private, how to use it?
go:linkname appears
To call a private method, is the first idea reflection? But in Golang, there is a better calling method, which is the go:linkname compilation instruction, which is also a hidden black technology of golang, and it is also used in its own standard library.
If you want to use a private method, first define a function with the same signature in your own source code file, without a function body! ! !
func writeTo(tr *tar.Reader, w io.Writer) (int64, error)
Because we are defining a function, the first parameter is tar.Reader
itself. Through this definition method, it indicates tar.Reader
the method of yes.
Then it is the use of go:linkname, adding the description of this instruction above the function we defined
//go:linkname writeTo archive/tar.(*Reader).writeTo
func writeTo(tr *tar.Reader, w io.Writer) (int64, error)
This refers to two parts, separated by spaces:
The first part represents the function
writeTo
in our current source filewriteTo
The private method
archive/tar.(*Reader).writeTo
represented by the second part pays attention to the second part, it must be written in full , the full path of the import, then the structure, and finally the private method, both of which are indispensable and cannot be written wrong.tar.Reader
writeTo
import path
Reader
writeTo
Because this approach is non-standard, we also import unsafe
the package, telling the compiler that we already know it's unsafe.
import (
"archive/tar"
_ "unsafe"
)
Well, the preparations are all done, and now it's time to use it. io.Copy
Just replace it with a function in our sample code writeTo
.
tarFile := path.Join(unTarDir, hdr.Name)
file, err := os.OpenFile(tarFile, os.O_RDWR|os.O_CREATE|os.O_TRUNC, os.FileMode(hdr.Mode))
if err != nil {
log.Fatalln(err)
}
err = file.Truncate(hdr.Size)
if err != nil {
log.Fatalln(err)
}
_, err = writeTo(tr, file)
if err != nil {
log.Fatalln(err)
}
Now let's test it and see how much the performance has improved:
➜ time go run cmd/main.go
go run cmd/main.go 0.55s user 3.65s system 86% cpu 4.844 total
The time-consuming has been reduced from the original 15 seconds to 5 seconds, which is a performance improvement, just now! ! !
Seek
The improvement of the method, for the sparse files with more holes, the better the effect, because they are all skipped.
The key is to draw the key point. We finally realized the decompression of sparse files (APFS, XFS, and Ext4 are all supported). Viewed through the du command, the disk usage is far smaller than the actual size of the file .
## 磁盘占用大小1.3G
$ du -h RAM
1.3G spare_file
## 文件实际大小12G
$ ls -lh RAM
12G spare_file
Refactor the code
Ok, now let's refactor the code to distinguish between regular files and sparse files, and then treat them differently. The original regular file is still io.Copy
implemented through functions.
func writeFile(root string, hdr *tar.Header, tr *tar.Reader, sparseFile bool) {
tarFile := path.Join(root, hdr.Name)
file, err := os.OpenFile(tarFile, os.O_RDWR|os.O_CREATE|os.O_TRUNC, os.FileMode(hdr.Mode))
if err != nil {
log.Fatalln(err)
}
err = file.Truncate(hdr.Size)
if err != nil {
log.Fatalln(err)
}
if sparseFile {
_, err = writeTo(tr, file)
} else {
_, err = io.Copy(file, tr)
}
if err != nil {
log.Fatalln(err)
}
}
Extract a common function , use functions writeFile
for sparse files , and use functions for regular files, and use them when decompressing.writeTo
io.Copy
else if hdr.Typeflag == tar.TypeReg {
writeFile(unTarDir, hdr, tr, false)
} else if hdr.Typeflag == tar.TypeGNUSparse {
writeFile(unTarDir, hdr, tr, true)
}
summary
At this point, the entire actual combat is over. Judging from this process, if you want to complete this actual combat, you must first not give up. You cannot give up when you find that sparse files are not supported when you first use the tar package. If you give up, there will be nothing left wonderful.
Secondly, we have an in-depth understanding of the principles of source code and sparse files. Only in this way can we find better methods and achieve optimal performance.
This article is an original article, reprint and indicate the source, welcome to scan the QR code to follow the official account
flysnow_org
or website https://www.flysnow.org/, and read the follow-up exciting articles as soon as possible. If you think it is good, please click "Looking" in the lower right corner of the article, thank you for your support.
Scan code attention
Share, like, watch is the biggest support