Talk about Git storage principle and related implementation

The following content is reproduced from https://www.toutiao.com/i6938300453767676455/

Flashing gene 2021-03-11 15:53:07

Talk about Git storage principle and related implementation

 

[Editor's note] Git is currently the most popular version control system. From local development to production deployment, we use Git for our version control every day. In addition to the daily commands, if you want to have a deeper understanding of Git Understand, then studying the underlying storage principles of Git will be very helpful to understand Git and its use. Even if you are not a Git developer, it is recommended that you understand the underlying principles of Git. You will have a brand new power of Git. Understand, and will be more handy in the daily Git use process.

This article is aimed at readers who have a certain understanding of Git. It will not introduce the specific role and use of Git, nor will it introduce the differences with other version control systems such as Subversion. It mainly introduces Git. The essence of and the related principles of his storage implementation are designed to help Git users have a clearer understanding of its internal implementation when using Git for version control.

What is the essence of Git

Git is essentially a content-addressed Key-Value database. We can insert any type of content into the Git repository, and Git will return us a unique key value. We can use this key to retrieve the value we inserted at the time. We can Try it through the underlying command git hash-object command:

➜  Zoker git:(master) ✗ cat testfile

Hello Git

➜  Zoker git:(master) ✗ git hash-object testfile -w

9f4d96d5b00d98959ea9960f069585ce42b1349a

You can see that there is a file named testfile in our directory. The content is Hello Git! We use the git hash-object command to write the content of this file to the Git repository. The -w option tells Git to write this content to Git. The git/objects object database directory, and Git returned a SHA value, which is the key value of the file we will retrieve later:

➜  Zoker git:(master) ✗ git cat-file -p 9f4d96d5b00d98959ea9960f069585ce42b1349a

Hello Git

We used the git cat-file command to retrieve the content that was just saved in the Git repository. Although it is not as intuitive as the Redis command get set, it is indeed a KV database, is it?

The data we just tried to insert is a basic blob type object. Git also has other object types such as tree and commit. These different object types have specific association relationships, and they logically associate different objects. Only when we get up can we control and check out different versions. We will expand on these different object types later. Let's first understand the directory structure of Git and see how data is stored in Git.

Git directory structure

Through the introduction in the previous section, we know that Git is essentially a KV database, and also mentioned that the content is written to the .git/objects object directory, so where is this directory located? How does Git store this data? In this section, we focus on the Git storage directory structure and understand how Git stores different types of data.

For a more detailed introduction, please refer to:
https://github.com/git/git/blo ... t.txt

Through git init we can initialize an empty Git warehouse in the current directory. Git will automatically generate a .git directory. This .git directory is the storage center for all subsequent Git metadata. Let's take a look at its directory structure:

➜  Zoker git init

Initialized empty Git repository in /Users/zoker/tmp/Zoker/.git/

➜  Zoker git:(master) ✗ tree .git

.git

├── HEAD              // 是一个符号引用,指明当前工作目录的版本引用信息,我们平时执行 checkout 命令时就会改变 HEAD 的内容

├── config             // 配置当前存储库的一些信息,如:Proxy、用户信息、引用等,此处的配置项相对于全局配置权重更高

├── description      // 仓库描述信息

├── hooks             // 钩子目录,执行 Git 相关命令后的回调脚本,默认会有一些模板

│   ├── update.sample

│   ├── pre-receive.sample

│   └── ...

├── info                // 存储一些额外的仓库信息如 refs、exclude、attributes 等

│   └── exclude

├── objects           // 元数据存储中心

│   ├── info

│   └── pack

└── refs               // 存放引用信息,也就是分支、标签

├── heads

└── tags

The Git warehouse generated by default initialization only has these files. In addition, there are some other types of files and directories such as packed-refs modules logs, etc. These files have specific uses and are only after specific operations or configurations. Will appear, here we only focus on the implementation of the core storage, the role of these additional files or directories and usage scenarios can then browse the documents by themselves, here only introduce some of the core files.

hooks directory

The hooks directory mainly stores Git hooks. Git hooks can be triggered after or before many events occur, which can provide us with a very flexible way to use them. By default, all of them are with the .sample suffix. You need to remove this suffix and Give executable permissions to take effect. Here are some commonly used hooks and their common uses:

Client hook:

  • pre-commit: Triggered before submission, such as checking whether the submitted information is standardized, whether the test is completed, and whether the code format meets the requirements
  • post-commit: On the contrary, this is triggered after the entire submission is completed and can be used to send notifications

Server hook:

  • pre-receive: The script that is called first when the server receives the push request, and can detect whether these pushed references meet the requirements
  • update: similar to pre-receive, but pre-receive will only run once, and update will run once for each pushed branch
  • post-receive: triggered after the entire push process is completed, can be used to send notifications, trigger the build system, etc.

objects directory

As we mentioned in the previous section, Git stores all received content generation object files in this directory. We generated an object through git hash-object and wrote it to the Git warehouse. The key value of this object is
9f4d96d5b00d98959ea9960f069585ce42b1349a, this When we look at the structure of the objects directory:

➜  Zoker git:(master) ✗ git hash-object testfile -w

9f4d96d5b00d98959ea9960f069585ce42b1349a

➜  Zoker git:(master) ✗ tree .git/objects

.git/objects

├── 9f

│   └── 4d96d5b00d98959ea9960f069585ce42b1349a

├── info

└── pack

You can see that the objects directory has new content. There is an additional 9f folder and files in it. This file is the object file inserted into the Git repository. Git takes the first two letters of its key value as the folder , And store the following letters as the file name of the object file. The objects stored here (that is, objects/[0-9a-f][0-9a-f]) are generally called loose objects or unpacked objects. It is a loose object.

In addition to the object storage folder, careful students should have noticed the existence of the objects/pack folder, which corresponds to the packaged files. In order to save space and improve efficiency, when there are too many loose object files or When manually executing the git gc command, or during the transfer process of pushing and pulling, Git will pack these loose object files into pack files to improve efficiency. Here are these packed files:

➜  objects git:(master) git gc

...

Compressing objects: 100% (75/75), done.

...

➜  objects git:(master) tree

.

├─ pack

├── pack-fe24a22b0313342a6732cff4759bedb25c2ea55d.idx

└── pack-fe24a22b0313342a6732cff4759bedb25c2ea55d.pack

└── ...

You can see that there are no loose objects in the objects directory. Instead, there are two files in the pack directory. One is the packaged file and the other is the idx file that indexes the packaged content. It is convenient to query whether an object is in this In the corresponding pack package.

It should be noted that if GC is performed in a warehouse of a blob object that we manually created just now, it will not produce any effect, because at this time, the entire Git warehouse does not have any reference to this object. We say that this object is free. Let's introduce the directory where the reference is stored.

refs directory

The refs directory stores our references. The reference can be regarded as an alias to a version number. It actually stores the SHA value of a certain Commit. The warehouse we used for testing above does not have any commit, so there is only one Empty directory structure.

└── refs

├── heads

└── tags

We randomly find a repository that contains commits to view his default branch master.

➜  .git git:(master) cat refs/heads/master

87e917616712189ecac8c4890fe7d2dc2d554ac6

You can see that the master reference only stores a Commit SHA value. The advantage of course is that we don't need to remember the long string of SHA values. We only need to use the master alias to get this version. The same tags directory stores our tags. Unlike branches, the recorded reference values ​​of tags generally do not change, but branches can change with our version. In addition, you may also see directories such as refs/remotes refs/fetch, which store references to specific namespaces.

There is another situation, which is the GC mechanism we mentioned above. If a warehouse executes GC, not only loose objects in the objects directory will be packaged, but references under refs will also be packaged, but they are stored in the bare warehouse. The root directory of .git/packed-refs

➜  .git git:(master) cat packed-refs

# pack-refs with: peeled fully-peeled sorted

87e917616712189ecac8c4890fe7d2dc2d554ac6 refs/heads/master

When we need to access the branch master, Git will first search in refs/heads. If it cannot find it, it will go to .git/packed-refs to search. Packing all the references into one file will undoubtedly improve a lot of efficiency. . It should be noted that if we update some commits to the master branch at this time, Git will not directly modify the .git/packed-refs file at this time, it will directly recreate a master reference under refs/heads/, Contains the SHA value of the latest commit. According to the Git mechanism we just introduced, Git will first look in refs/heads/, and then go to .git/packed-refs if it cannot be found.

So what does the SHA value of the Commit stored in the reference refer to? We can use the cat-file command to view the content of the blob object to view:

➜  .git git:(master) git cat-file -p 87e917616712189ecac8c4890fe7d2dc2d554ac6

tree aab1a9217aa6896ef46d3e1a90bc64e8178e1662 // 指向的 tree 对象

parent 7d000309cb780fa27898b4d103afcfa95a8c04db // 父提交

author Zoker <[email protected]> 1607958804 +0800 // 作者信息

committer Zoker <[email protected]> 1607958804 +0800 // 提交者信息



test ssh // 提交信息

It is a commit type object, the main attributes are the tree object it points to, its parent commit (if it is the first commit, then 0000000...), author and commit information.

So what is the commit object? What is the tree object it points to? What is the difference with the blob object we manually created before? Next, let's talk about Git storage objects.

Git storage objects

In the Git world, there are four types of storage objects: file (blob), tree (tree), commit (commit), tag (tag), here we mainly discuss the first three types, because these three are the most Basic Git metadata, and the tag object is just a Tag that contains additional attribute information, that is, an annotated tag, so I won’t introduce too much here.

Introduction to lightweight and annotated tags:
https://git-scm.com/book/zh/v2 ... %25BE

Blob object

When introducing the essence of Git, in order to demonstrate that Git is a KV database based on content addressing, we inserted the content of a file into the Git repository:

➜  Zoker git:(master) ✗ cat testfile

Hello Git

➜  Zoker git:(master) ✗ git hash-object testfile -w

9f4d96d5b00d98959ea9960f069585ce42b1349a

The
Git object whose Key is 9f4d96d5b00d98959ea9960f069585ce42b1349a is actually a Blob object. It stores the value of the testfile file. We can use the cat-file command to view it:

➜  Zoker git:(master) ✗ git cat-file -p 9f4d96d5b00d98959ea9960f069585ce42b1349a

Hello Git

Every time we modify a file, Git will save a complete snapshot of the file instead of recording the differences, so if we modify the content of the testfile file and save it to the Git repository again, Git will generate it based on the current latest content Its Key, it should be noted that when the content does not change, its Key value is fixed. After all, as we said before, Git is a KV database based on content addressing.

In addition, the Blob object here stores text content, it can also be binary content, but it is not recommended to use Git to manage the version of the binary file. The most common problem encountered by our Gitee platform in the daily operation process is that the user warehouse is too large. This situation is generally caused by the user submitting a large binary file, because each file change is recorded as a snapshot, so this binary file If changes are frequent, the space it takes up is doubled. And for text content blobs, Git will only save the file differences between the two submissions during the GC process, which can save space, but for binary content blobs, it cannot be handled like text content blobs. , So try not to store frequently changing binary content in the Git repository. You can use LFS to store it. If a large number of binary files already exist, you can use filter-branch to lose weight. New colleagues will definitely appreciate you when they clone the warehouse for the first time.

The use of LFS:
https://gitee.com/help/articles/4235, the slimming of the big warehouse: https://gitee.com/help/articles/4232, the filter-branch: https://github.com/git /git/blo ... h.txt

Do you think something is wrong when you get here? That's right, this Blob object only stores the content of this file, but does not record the file name, so how do we know which file this content belongs to? The answer is another important object of Git: the Tree object.

Tree object

In Git, the main function of the Tree object is to organize multiple Blobs or child Tree objects together, and all content is stored by Tree and Blob type objects. A Tree object contains one or more Tree Entry (tree object records), and each tree object record contains a pointer to the Blob or sub-Tree SHA value, as well as their corresponding file names and other information, which can be understood in fact In order to index the relationship between inode and block in the file system, if a Tree object is shown, as shown in the figure below:

Talk about Git storage principle and related implementation

 

The directory structure corresponding to this Tree object is as follows:

.

├── LICENSE

├── readme.md

└── src

├── libssl.so

└── logo.png

In this way, we can store the contents of our warehouse in a structured manner like the way of organizing directories under Linux, regard Tree as a directory structure, and Blob as specific file content.

So how to create a Tree object? In Git, the corresponding Tree object is created according to the state of the staging area. The staging area here is actually the staging area (Staged) that we understand in the daily process of using Git. Generally, we use the git add command to add Some files are added to the staging area to be submitted. In an empty warehouse without any submissions, the status of this staging area is the files you added through git add, such as:

➜  Zoker git:(master) ✗ git status

On branch master



No commits yet



Changes to be committed:

(use "git rm --cached <file>..." to unstage)

new file:   LICENSE

new file:   readme.md



Untracked files:

(use "git add <file>..." to include in what will be committed)

src/

The current state of the staging area here is that there are two files in the root directory. The state of the staging area is saved in the .git/index file. Let's use the file command to see what it is:

➜  Zoker git:(master) ✗ file .git/index

.git/index: Git index, version 2, 2 entries

You can find that there are two entries in the index file, that is, the two files LICENSE and readme.md in the root directory. For a warehouse that has been submitted, if there is no content in the temporary storage area, then this index represents the current version of the directory tree state. If files are modified or deleted, and the temporary storage area is added, the index will change. The pointer of the related file points to the SHA value of the new Blob object of the file.

So if we want to create a Tree object, we need to put something in the staging area. In addition to using git add, we can also use the underlying command update-index to create a staging area. Next, we create a tree object based on the testfile file that has been created above. The first is to add the file testfile to the temporary storage area:

➜  Zoker git:(master) ✗ git update-index --add testfile // 与 git add testfile 一样

➜  Zoker git:(master) ✗ git status

On branch master



No commits yet



Changes to be committed:

(use "git rm --cached <file>..." to unstage)

new file:   testfile

In this process, Git mainly inserts the content of the testfile into the Git warehouse in the form of a Blob, and then records the SHA value of the returned Blob in the index, telling the temporary storage area which content of the file is currently.

➜  Zoker git:(master) ✗ tree .git/objects

.git/objects

├── 9f

│   └── 4d96d5b00d98959ea9960f069585ce42b1349a

├── info

└── pack



3 directories, 1 file

➜  Zoker git:(master) ✗ git cat-file -p 9f4d96d5b00d98959ea9960f069585ce42b1349a

Hello Git

When Git executes the update-index command, it stores the content of the specified file as a Blob object and records it in the index file state. Since we have inserted the content of this file through the git hash-object command before, and we can find that because the content is unchanged, the SHA value of the generated Blob object is also the same, if we have already inserted it like us The following commands are equivalent:

git update-index --add --cacheinfo 9f4d96d5b00d98959ea9960f069585ce42b1349a testfile

This command actually puts the previously generated Blob object in the temporary storage area and specifies its file name as testfile. Since our staging area already has a file testfile, we can use the git write-tree command to create a Tree object based on the current staging area status:

➜  Zoker git:(master) ✗ git write-tree

aa406ee8804971cf8edfd8c89ff431b0462e250c

➜  Zoker git:(master) ✗ tree .git/objects

.git/objects

├── 9f

│   └── 4d96d5b00d98959ea9960f069585ce42b1349a

├── aa

│   └── 406ee8804971cf8edfd8c89ff431b0462e250c

├── info

└── pack

After executing the command, Git will generate a
Tree object with a SHA value of aa406ee8804971cf8edfd8c89ff431b0462e250c based on the state of the current staging area , and store this Tree object in the .git/objects directory like a Blob object.

➜  Zoker git:(master) ✗ git cat-file -p aa406ee8804971cf8edfd8c89ff431b0462e250c

100644 blob 9f4d96d5b00d98959ea9960f069585ce42b1349a    testfile

Use the cat-file command to view this Tree object, you can see that there is only one file under this object, named testfile.

Talk about Git storage principle and related implementation

 

We continue to create the second Tree object. We need the modified testfile file under the second Tree object, the new testfile2 file, and the first Tree object as the duplicate directory of the second Tree object. First, we first add the modified testfile and the newly added testfile2 file to the temporary storage area:

➜  Zoker git:(master) ✗ git update-index testfile

➜  Zoker git:(master) ✗ git update-index --add testfile2

➜  Zoker git:(master) ✗ git status

On branch master



No commits yet



Changes to be committed:

(use "git rm --cached <file>..." to unstage)

new file:   testfile

new file:   testfile2

Then we need to hang the first Tree object to the duplicate directory, we can use the read-tree command to achieve:

➜  Zoker git:(master) ✗ git read-tree --prefix=duplicate aa406ee8804971cf8edfd8c89ff431b0462e250c 

➜  Zoker git:(master) ✗ git status

On branch master



No commits yet



Changes to be committed:

(use "git rm --cached <file>..." to unstage)

new file:   duplicate/testfile

new file:   testfile

new file:   testfile2

Then we execute write-tree and view the second Tree object through cat-file:

➜  Zoker git:(master) ✗ git write-tree

64d62cef754e6cc995ed8d34f0d0e233e1dfd5d1

➜  Zoker git:(master) ✗ git cat-file -p 64d62cef754e6cc995ed8d34f0d0e233e1dfd5d1

040000 tree aa406ee8804971cf8edfd8c89ff431b0462e250c    duplicate

100644 blob 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e    testfile

100644 blob 098ffe6f84559f4899edf119c25d276dc70607cf    testfile2

Successfully completed, we not only modified the file content of testfile, but also added a new file testfile2, and also treated the first Tree object as the duplicate directory of the second Tree object. At this time, the Tree object should look like this of:

Talk about Git storage principle and related implementation

 

So far, we know how to manually create a Tree object, but what if I need snapshots of these two different Trees later? Can't you remember the SHA values ​​of these three Tree objects? That's right, it takes a lot of effort to remember. The key is that we don't know who created this snapshot at what time and for what, and the Commit object (commit object) can help us solve this problem.

Commit object

The Commit object is mainly to record some additional information of the snapshot and maintain the linear relationship between the snapshots. We can create a commit through the git commit-tree command. This command literally means that it is a command used to submit a Tree object as a Commit object:

➜  Zoker git:(master) ✗ git commit-tree -h

usage: git commit-tree [(-p <parent>)...] [-S[<keyid>]] [(-m <message>)...] [(-F <file>)...] <tree>



-p <parent>           id of a parent commit object

-m <message>          commit message

-F <file>             read commit log message from file

-S, --gpg-sign[=<key-id>]

                      GPG sign commit

The two key parameters are -p and -m. -p specifies the parent submission of this submission. If it is the initial first submission, then it can be ignored here; -m specifies the information of this submission, mainly used To describe the reason for the submission. Let's use the first Tree object as our initial commit:

➜  Zoker git:(master) ✗ git commit-tree -m "init commit" aa406ee8804971cf8edfd8c89ff431b0462e250c

17ae181bd6c3e703df7851c0f7ea01d9e33a675b

Use cat-file to view this submission:

tree aa406ee8804971cf8edfd8c89ff431b0462e250c

author Zoker <[email protected]> 1613225370 +0800

committer Zoker <[email protected]> 1613225370 +0800



init commit

The content stored by Commit is a Tree object, and records the committer, commit time, and commit information. We refer to the second Tree object based on this Commit:

➜  Zoker git:(master) ✗ git commit-tree -p 17ae181bd -m "add dir" 64d62cef754e6cc995ed8d34f0d0e233e1dfd5d1

de96a74725dd72c10693c4896cb74e8967859e58

➜  Zoker git:(master) ✗ git cat-file -p de96a74725dd72c10693c4896cb74e8967859e58

tree 64d62cef754e6cc995ed8d34f0d0e233e1dfd5d1

parent 17ae181bd6c3e703df7851c0f7ea01d9e33a675b

author Zoker <[email protected]> 1613225850 +0800

committer Zoker <[email protected]> 1613225850 +0800



add dir

We can use git log to view these two commits, here add the --stat parameter to view the file change record:

commit de96a74725dd72c10693c4896cb74e8967859e58

Author: Zoker <[email protected]>

Date:   Sun Feb 13 22:17:30 2021 +0800



add dir



duplicate/testfile | 1 +

testfile           | 2 +-

testfile2          | 1 +

3 files changed, 3 insertions(+), 1 deletion(-)



commit 17ae181bd6c3e703df7851c0f7ea01d9e33a675b

Author: Zoker <[email protected]>

Date:   Sun Feb 13 22:09:30 2021 +0800



init commit



testfile | 1 +

1 file changed, 1 insertion(+)

At this time, the structure of the entire object is as follows:

Talk about Git storage principle and related implementation

 

Exercise: Use low-level commands to create a commit

Only use the low-level commands such as hash-object write-tree read-tree commit-tree we mentioned above to create a commit, and think about which process is equivalent to git add git commit.

Object storage method

Through the previous introduction, we know that Git summarizes data in different object types, and calculates a SHA value based on the content to use as addressing. So how is it calculated? Taking the Blob object as an example, Git mainly does the following steps:

  • Identify the type of the object, construct the header information, use the type + content bytes + null bytes as the header information, such as blob 151\u0000
  • Splice the header information with the content, and calculate the SHA-1 checksum
  • Compress content via zlib
  • Put its content in the corresponding objects directory by SHA value

These things are done in the whole process. The Tree object and the Commit object are similar, except that the head type is different. I won’t go into details here. "Pro Git 2" introduces how to use Ruby to achieve the same in the chapter on Git internal principles. Logic, you can read it by yourself if you are interested.

Git-Internal Principle:
https://git-scm.com/book/zh/v2 ... %25A1

Git references

We can view the relevant information of the first version through git log --stat 17ae181b above, and can get the content of this snapshot through this string of SHA values, but it is still very troublesome, because we have to remember that the string is meaningless At this time, the Git reference comes in handy. In the Git directory structure chapter, we have introduced the refs directory. We know that the key value of the Commit object is stored in the reference, which is the SHA value of the object. In this way, we will give our current version a meaningful name, and generally we will use master as the default branch reference:

➜  Zoker git:(master) ✗ echo "17ae181bd6c3e703df7851c0f7ea01d9e33a675b" >> .git/refs/heads/master

➜  Zoker git:(master) ✗ tree .git/refs

.git/refs

├── heads

│   └── master

└── tags

At this time, the SHA value of our first Commit is stored in the master, and we can use master to replace the meaningless string of 17ae181b.

➜  Zoker git:(master) ✗ git cat-file -p master

tree aa406ee8804971cf8edfd8c89ff431b0462e250c

author Zoker <[email protected]> 1613916447 +0800

committer Zoker <[email protected]> 1613916447 +0800



init commit

However, this is not our latest version. Our latest version is the second submission
de96a74725dd72c10693c4896cb74e8967859e58. Similarly, we can change the content of refs/heads/master to the SHA value of this submission, but here we use a low-level command to carry out.

➜  Zoker git:(master) ✗ git update-ref refs/heads/master de96a74725dd72c10693c4896cb74e8967859e58

➜  Zoker git:(master) ✗ cat .git/refs/heads/master

de96a74725dd72c10693c4896cb74e8967859e58

At this time, the branch master points to our latest version.

Talk about Git storage principle and related implementation

 

to sum up

The above mainly discusses the basic storage principles and some implementations of Git, as well as some such as Pack packaging, transmission negotiation mechanism, and storage format. Due to space limitations, we will not talk about them. We will discuss them later based on some scenarios.

Author: Zoker

https://zoker.io/blog/talk-about-git-internals

Guess you like

Origin blog.csdn.net/pyf09/article/details/115095221