From basic Git instructions to the underlying principles, we will take you to implement a simple Git by hand

At first, I was worried that the principle of git would be difficult to understand, but after reading the official documentation, I found that it is not difficult to understand. It seems that a simple git can be implemented by hand, so I have the following learning record.

The narrative thinking of this article refers to the principle introduction part of the official document Book, discusses the code implementation on some nodes, and the official document link .

After reading this article, you can: 1. Understand the design ideas of git. 2. Have a little happiness?

I chose go as the programming language, because I am not familiar with it just now and want to use it more.

This is my warehouse address, but if you are a beginner like me, you may not be able to get started quickly by looking at the code directly. It is recommended to follow the article.

Mini git implementation - link

If the article is difficult to read, you can follow the principle part of the official document and look back again. Maybe it is easier to understand?

1. init

Before learning the principles of git, let's forget about the cool git commands like commit, branch, and tag that we usually use. We will find out their essence later.

You know, git was written by Linus when he was writing Linux. It was used for version management of Linux. Therefore, recording the change information of file items in different versions is the core function of git.

The big cows always make corresponding abstractions when designing software. To understand their design ideas, we have to think under their abstraction. Although it is a bit mysterious, these abstractions will eventually be implemented in the code, so don't worry, it's easy to understand.

First of all, we have to lay down the concept of ojbect, which is the lowest abstraction of git. You can understand git as an object database.

Not much nonsense, follow the instructions and you will have a new understanding of git. First, we create a git repository in any directory:

My operating environment is win10 + git bash

$ git init git-test
Initialized empty Git repository in C:/git-test/.git/

You can see that git created an empty git warehouse for us, with a .gitdirectory in it, and the directory structure is as follows:

$ ls
config  description  HEAD  hooks/  info/  objects/  refs/

In .gitthe directory let's focus on .git/objectsthis directory, we began to say git is a database object, this directory is git place to store the object.

Enter .git/objectsthe directory we can see infoand packtwo directories, but this has nothing to do and core functionality, we only need to know .git/objectsthe directory in addition to the other two have nothing in an empty directory on the line.

Let's stop here. Let's implement this part first. The logic is very simple. We only need to write an entry function, parse the command line parameters, and create the corresponding directories and files in the specified directory after getting the init command.

Here is my implementation: init

For the sake of readability, there is no error handling for creating files/directories.

I gave it a slightly more earthy name, called jun, uh, in fact, you can call it anything (⊙ˍ⊙).

2.object

Next we enter the git repository directory and add a file:

$ echo "version1" > file.txt

Then we add the record of this file to the git system. It should be noted that we will not use addinstruction addition for the time being. Although we are likely to do so, this is an article that reveals the principle. Here we are going to introduce a git instruction that you may not have heard before git hash-object.

$ git hash-object -w file.txt
5bdcfc19f119febc749eef9a9551bc335cb965e2

After the instruction is executed, a hash value is returned. In fact, this instruction has added the content of file.txt into the object database in the form of an object, and this hash value corresponds to the object.

In order to verify that git writes this object into the database (saved as a file), let's check the .git/objectsdirectory:

$ find .git/objects/ -type f    #-type用于制定类型,f表示文件
.git/objects/5b/dcfc19f119febc749eef9a9551bc335cb965e2

It is found that 5bthere is an additional folder, and there is a dcfc19f119febc749eef9a9551bc335cb965e2file named in the folder, which means that git uses the first 2 characters of the object hash value as the directory name and the last 38 characters as the file name, which is stored in the object database .

Regarding the official introduction of the git hash-object command, this command is used to calculate the ID value of an ojbect. -w is an optional parameter, which means that the object is written to the object database; another parameter is -t, which is used to specify the type of object. If the type is not specified, the default is blob type.

Now you may be curious about what information is stored in the object. Let's use the git cat-filecommand to check it out:

$ git cat-file -p 5bdc  # -p:查看 object 的内容,我们可以只给出哈希值的前缀
version1

$ git cat-file -t 5bdc  # -t:查看 object 的类型
blob

With the above foreshadowing, then we will uncover the secret of git version control!

We change the content of file.txt and rewrite it into the object database:

$ echo "version2" > file.txt
$ git hash-object -w file.txt
df7af2c382e49245443687973ceb711b2b74cb4a

The console returned a new hash value, let's check the object database again:

$ find .git/objects -type f
.git/objects/5b/dcfc19f119febc749eef9a9551bc335cb965e2
.git/objects/df/7af2c382e49245443687973ceb711b2b74cb4a

(゚Д゚)Found one more object! Let's check the contents of the new object:

$ git cat-file -p df7a
version2

$ git cat-file -t df7a
blob

Seeing this, you may have a better understanding of the concept that git is an object database: git saves the contents of each version of a file in an object.

If you want to restore file.txt to the state of the first version, just do this:

$ git cat-file -p 5bdc > file.txt

Then view the contents of file.txt:

$ cat file.txt
version1

So far, a version control system that can record file versions and restore files to any version status is complete (ง •_•)ง!

Does it feel okay, not so difficult? You can understand git as a key-value database, with a hash value corresponding to an object.

Let's stop here and realize this part.


Recommend your own Linux C/C++ exchange group: 973961276! I have compiled some learning books and interview questions from Dachang, interesting projects and popular technology teaching videos that I think are better. Interested friends can join the group to receive. Those who are looking for a job or are about to change jobs should not miss it.

I was a little curious at first, why didn't I use the cat command to view the object directly, but compiled a git cat-file command by myself? After thinking about it, git will definitely not save the content of the file in the object as it is. It should be compressed, so we need to decompress and read it.

These two instructions are implemented by referring to the official ideas. First, let's talk about git hash-object. The content of an object is like this:

  1. First, we must construct the header information. The header information is composed of the object type, a space, the number of bytes of the data content, and a null byte. The format is as follows:
blob 9\u0000
  1. Then splicing the header information with the original data, the format is like this:
blob 9\u0000version1
  1. Then use zlib to compress the spliced ​​information above and store it in the object file.

The implementation of the git cat-file command is the opposite. First decompress the data stored in the object file with zlib, divide the decompressed data according to spaces and null bytes, and then return the content of the object according to the parameter -t or -p Or type.

Here is my implementation: hash-object and cat-file

A simple and rude process-oriented implementation is adopted, but I have vaguely felt that there will be a lot of reuse functions later, so first write the unit test to facilitate later reconstruction.

3. tree object

In the previous chapter, a careful friend may find that git will save the content of our file as a blob type object. These blob objects seem to only save the contents of the file, not the file name.

And when we are developing a project, it is impossible to have only one file. Normally, we need to perform version management on a project, and a project will contain multiple files and folders.

So the most basic blob object is no longer enough for us to use. We need to introduce a new object called tree object, which can not only save file names, but also organize multiple files together.

But here comes the problem. It is easy to introduce concepts, but how to write it in the code? (T_T), The first idea in my head is to create a tree objct in memory first, and then we add content to the specified tree object. But this seems very troublesome, every time you add something, you must give the hash value of the tree object. And in this case, the tree object is mutable. A mutable object has violated the original intention of storing fixed version information.

Let's see how git thinks about this problem. When git creates a tree object, it introduces a concept called staging area, which is a good idea! You think, our tree object is to save the version information of the entire project. The project has many files, so we put the files in the buffer. Git creates a tree object at one time based on the contents of the buffer. Can you record version information!

Let's first manipulate the git buffer to deepen our understanding. First, introduce a new command git update-index, which can artificially add a file to a new buffer, and add a --add parameter. Because this file did not exist in the buffer before.

$ git update-index --add file.txt

Then we observe .gitthe changes in the catalog

$ ls
config  description  HEAD  hooks/  index  info/  objects/  refs/

$ find .git/objects/ -type f
objects/5b/dcfc19f119febc749eef9a9551bc335cb965e2
objects/df/7af2c382e49245443687973ceb711b2b74cb4a

It is found that .gitthere is an indexadditional file named in the directory, which is probably our buffer. The objectsdirectory under the object back nothing changes.

Let's check the contents of the buffer, here is an instruction: git ls-files --stage

$ git ls-files --stage
100644 df7af2c382e49245443687973ceb711b2b74cb4a 0       file.txt

We found that the buffer is used to store our addition records: the code name of a file mode, the blob object of the file content, a number and the name of the file.

Then we save the contents of the current buffer in the form of a tree object. Introduce a new instruction: git write-tree

$ git write-tree
907aa76a1e4644e31ae63ad932c99411d0dd9417

After entering the command, we get the hash value of the newly generated tree object, let's verify whether it exists, and see its content:

$ find .git/objects/ -type f
.git/objects/5b/dcfc19f119febc749eef9a9551bc335cb965e2 #文件内容为 version1 的 blob object
.git/objects/90/7aa76a1e4644e31ae63ad932c99411d0dd9417 #新的 tree object
.git/objects/df/7af2c382e49245443687973ceb711b2b74cb4a #文件内容为 version2 的 blob object

$ git cat-file -p 907a
100644 blob df7af2c382e49245443687973ceb711b2b74cb4a    file.txt

It is estimated that after seeing this, everyone has a preliminary understanding of the relationship between the temporary storage area and the tree object.

Now we further understand two points: how to record a file whose content is not recorded by git, and how to record a folder.

Let's take it step by step to create a new file and add it to the temporary storage area:

$ echo abc > new.txt

$ git update-index --add new.txt

$ git ls-files --stage
100644 df7af2c382e49245443687973ceb711b2b74cb4a 0       file.txt
100644 8baef1b4abc478178b004d62031cf7fe6db6f903 0       new.txt

After checking the buffer, we found that the record of the new file has been appended to the temporary storage area, and it also corresponds to a hash value. Let's check the contents of the hash value:

$ find .git/objects/ -type f
.git/objects/5b/dcfc19f119febc749eef9a9551bc335cb965e2 #新的 object
.git/objects/8b/aef1b4abc478178b004d62031cf7fe6db6f903 #文件内容为 version1 的 blob object
.git/objects/90/7aa76a1e4644e31ae63ad932c99411d0dd9417 #tree object
.git/objects/df/7af2c382e49245443687973ceb711b2b74cb4a #文件内容为 version2 的 blob object

$ git cat-file -p 8bae
abc

$ git cat-file -t 8bae
blob

We found that when new.txt was added to the temporary storage area, git automatically created a blob object for the content of new.txt.

Let's try to create a folder and add it to the staging area:

$ mkdir dir

$ git update-index --add dir
error: dir: is a directory - add files inside instead
fatal: Unable to process path dir

As a result, git tells us that we cannot add an empty folder. We need to add a file to the folder. Then we add a file to the folder and then add it to the staging area again:

$ echo 123 > dir/dirFile.txt

$ git update-index --add dir/dirFile.txt

Success~ Then check the contents of the staging area:

$ git ls-files --stage
100644 190a18037c64c43e6b11489df4bf0b9eb6d2c9bf 0       dir/dirFile.txt
100644 df7af2c382e49245443687973ceb711b2b74cb4a 0       file.txt
100644 8baef1b4abc478178b004d62031cf7fe6db6f903 0       new.txt

$ git cat-file -t 190a
blob

Like the previous demo, a blob object is automatically created for the file content.

Next, we save the current temporary storage area as a tree object:

$ git write-tree
dee1f9349126a50a52a4fdb01ba6f573fa309e8f

$ git cat-file -p dee1
040000 tree 374e190215e27511116812dc3d2be4c69c90dbb0    dir
100644 blob df7af2c382e49245443687973ceb711b2b74cb4a    file.txt
100644 blob 8baef1b4abc478178b004d62031cf7fe6db6f903    new.txt

The new tree object holds the current version information temporary storage area, it is worth noting that the staging area in the form of blob object of the record dir/dirFile.txt, and in the process save the tree object, git directory dirto create a tree object, we Verify it:

$ git cat-file -p 374e
100644 blob 190a18037c64c43e6b11489df4bf0b9eb6d2c9bf    dirFile.txt

$ git cat-file -t 374e
tree

This discovery is dira directory tree and create object stores difFile.txtthe information, is not feeling may have been similar! This tree object is a simulation of the file directory!

Let's stop! Get started!

This time we need to implement the above three instructions:

  1. git update-index --add

git update-index updates the temporary storage area. This official instruction has many parameters. We only implement --add, which means adding files to the temporary storage area. The overall process is as follows: if it is the first time to add a file into the buffer, we need to create an index file, if the index file already exists, read the contents of the temporary storage area directly, pay attention to a decompression process. Then add the new file information to the temporary storage area, compress the content of the temporary storage area and save it in the index file.

This involves a serialization and deserialization operation, please allow me to be lazy to simulate through json ψ(._. )>.

  1. git ls-files --stage

git ls-files is used to view file information in the temporary storage area and work area. There are also many parameters. We only implement --stage to view the contents of the temporary storage area (the ls-files command without parameters is to list the current directory including All files in subdirectories). Implementation process: read the content of the temporary storage area from the index file, and print it to the standard output in a certain format after decompression.

  1. git write-tree

Git write-tree is used to convert the contents of the temporary storage area into a tree object. According to the example we demonstrated before, we need to recursively parse the tree object for the folder, which should be the most difficult place in this chapter.

The code is as follows: update-index --add, ls-files --stage, write-tree

I feel that object can be abstracted, so I refactored the code related to object : refactor object part

When this part is completed, we already have a system capable of version management of folders (ง •_•)ง.

4.commit object

Although we can already use a tree object to represent the version information of the entire project, there seems to be some shortcomings:

The tree object only records the version information of the file. Who modified this version? Why was it modified? Who was its last version? This information has not been saved.

At this time, it's time for the commit object to appear! How about it, does it feel good to explore all the way up from the bottom? ?

Let's do it with git first, and then consider how to implement it. Next, we use the commit-tree command to create a commit object, which points to the tree object generated at the end of Chapter 3.

$ git commit-tree dee1 -m 'first commit'
893fba19d63b401ae458c1fc140f1a48c23e4873

Because the generation time is different from the author, the hash value you get will be different. Let's take a look at this newly generated commit object:

$ git cat-file -p 893f
tree dee1f9349126a50a52a4fdb01ba6f573fa309e8f
author liuyj24 <[email protected]> 1608981484 +0800
committer liuyj24 <[email protected]> 1608981484 +0800

first commit

As you can see, this commit ojbect points to a tree object, the second and third lines are the author and submitter's information, and the blank line is the submission information.

Below we modify our project to simulate the changes in the version:

$ echo version3 > file.txt

$ git update-index --add file.txt

$ git write-tree
ff998d076c02acaf1551e35d76368f10e78af140

Then we create a new submission object and point its parent object to the first submission object:

$ git commit-tree ff99 -m 'second commit' -p 893f
b05c65b6fdd7e13a51aaf1abb8ff3e795835bfb0

We then modify our project, and then create a third submission object:

$ echo version4 >file.txt

$ git update-index --add file.txt

$ git write-tree
1403e859154aee76360e0082c4b272e5d145e13e

$ git commit-tree 1403 -m 'third commit' -p b05c
fe2544fb26a26f0412ce32f7418515a66b31b22d

Then we execute the git log command to view our commit history:

$ git log fe25
commit fe2544fb26a26f0412ce32f7418515a66b31b22d
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:36:31 2020 +0800

    third commit

commit b05c65b6fdd7e13a51aaf1abb8ff3e795835bfb0
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:34:25 2020 +0800

    second commit

commit 893fba19d63b401ae458c1fc140f1a48c23e4873
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:18:04 2020 +0800

    first commit

how about it? Is there a feeling of sudden enlightenment!

Let's stop now and realize this part.

There are two instructions in total

  1. commit-tree

Create a commit object, let it point to a tree object, add author information, submitter information, submission information, and then add a parent node (the parent node may not be specified). The author information and submitter information are temporarily hard-coded. This can be set by the git config command. You can check it .git/config. It is actually an operation of reading and writing configuration files.

  1. log

According to the hash value of the incoming commit object, look up its parent node and print the information, which can be quickly achieved through recursion.

Here is my implementation: commit-tree, log

5. references

In the previous four chapters, we laid a lot of git low-level instructions. Starting from this chapter, we will explain the common functions of git, which will definitely feel like a broken bamboo.

Although our commit object has been able to record the version information completely, there is a fatal disadvantage: we need to locate this version through a long SHA1 hash value. If you and your colleagues say during the development process:

Hey! Can you review the code of 32h52342 for me?

Then he will definitely return you: Where. . . Which version is coming?(+_+)?

So we have to consider naming our commit object, such as master.

Let's actually operate git and call our latest submission object master:

$ git update-ref refs/heads/master fe25

Then view the submission record under the new name:

$ git log master
commit fe2544fb26a26f0412ce32f7418515a66b31b22d (HEAD -> master)
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:36:31 2020 +0800

    third commit

commit b05c65b6fdd7e13a51aaf1abb8ff3e795835bfb0
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:34:25 2020 +0800

    second commit

commit 893fba19d63b401ae458c1fc140f1a48c23e4873
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:18:04 2020 +0800

    first commit

Good guys (→_→), why don't we give this feature an awesome name, just call it branch !

At this time, you may be thinking that we usually commit on the master branch with a git commit -m command. Now I seem to understand the principle behind it:

  1. The first is to write the records in the temporary storage area to a tree object through the command write-tree to get the SHA1 value of the tree object.
  2. Then create a new commit object by command commit-tree.

The question is: the SHA1 value of the tree object used by the commit-tree command and the -m submission information are all available, but how do we obtain the SHA1 value of the -p parent submission object?

This will mention our HEAD reference! You will find .gita HEADfile in our directory , let’s check its contents:

$ ls
config  description  HEAD  hooks/  index  info/  logs/  objects/  refs/

$ cat HEAD
ref: refs/heads/master

So when we perform the commit operation, git will retrieve the current reference from the HEAD file, that is, the SHA1 value of the current commit object as the parent object of the new commit object, so that the entire commit history can be connected in series!

Seeing this, did you create a branch for git branch, and git checkout to switch branches also feel a little bit? !

Now that we have three commit objects, we try to create a branch on the second commit object, again using the underlying instructions to complete, we use the git update-ref instruction to create a reference for the second commit:

$ git update-ref refs/heads/bugfix b05c

$ git log bugfix
commit b05c65b6fdd7e13a51aaf1abb8ff3e795835bfb0 (bugfix)
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:34:25 2020 +0800

    second commit

commit 893fba19d63b401ae458c1fc140f1a48c23e4873
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:18:04 2020 +0800

    first commit

Then we change our current branch is located, that is, modify the .git/HEADvalue of the document, we use git symbolic-ref command:

git symbolic-ref HEAD refs/heads/bugfix

We check the log again through the log command. If no parameters are added, the default is to check the current branch:

$ git log
commit b05c65b6fdd7e13a51aaf1abb8ff3e795835bfb0 (HEAD -> bugfix)
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:34:25 2020 +0800

    second commit

commit 893fba19d63b401ae458c1fc140f1a48c23e4873
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:18:04 2020 +0800

    first commit

The current branch is switched to bugfix!

We stopped and realized this part, basically simple file read and write operations.

  1. update-ref

Write the hash value of the submitted object to .git/refs/headsthe file specified below. Since the previous log instruction was not fully implemented, we need to refactor here to support the search of ref names.

  1. symbolic-ref

For modifying ref, we simply implement it and HEADmodify the file.

  1. commit

With the foundation laid by the above two instructions, we can implement the commit command. Repeat the process again: First, write the records in the temporary storage area to a tree object through the command write-tree to get the SHA1 value of the tree object. Then use the command commit-tree to create a new submission object, and the parent object of the new submission object HEADis obtained from the file. Finally update the commit object information of the corresponding branch.

This is my implementation: update-ref, symbolic-ref, commit

At this point, it is estimated that you have no interest in the checkout, branch and other commands. Checkout is to encapsulate symbolic-ref, and branch is to encapsulate update-ref.

In order to increase the flexibility of the instructions, git provides a lot of optional parameters for the instructions, but in fact they are all calls of these low-level instructions. And with these low-level instructions, you will find that other extended functions are easily implemented, so I won't expand it here (ง •_•)ง.

6. tag

After completing the above functions, it is estimated that everyone will have a deeper understanding of git, but I don't know if you have found a small problem:

When we develop the branch function, we will do version management based on the branch. But as the branch has a new commit, the branch will point to the new commit object, which means that our branch is changed. But we always have some important versions to record, and we need something unchanged to record a submitted version.

And because the SHA1 value of recording a certain submitted version is not very good, we give these important submitted versions a name and store them in the form of tags. It is estimated that when implementing references, everyone noticed .git/refs/that in addition headsto a tagsdirectory, the principle is the same as that of reference, which also records the hash value of a submitted object. Let's do it with git and tag the first commit object of the current branch:

$ git log
commit b05c65b6fdd7e13a51aaf1abb8ff3e795835bfb0 (HEAD -> bugfix)
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:34:25 2020 +0800

    second commit

commit 893fba19d63b401ae458c1fc140f1a48c23e4873
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:18:04 2020 +0800

    first commit

$ git tag v1.0 893f

Then check this tag

$ git show v1.0
commit 893fba19d63b401ae458c1fc140f1a48c23e4873 (tag: v1.0)
Author: liuyj24 <[email protected]>
Date:   Sat Dec 26 19:18:04 2020 +0800

    first commit

······

In this way, we can locate a certain version through the v1.0 tag.

I won't realize this, hey (→_→).

7. more

I wrote this article while looking at the official documentation. In fact, the outline of the entire git is very clear. Because git itself is already good enough, we don't need to rewrite one. The way of making small wheels in this article is to learn the core idea of ​​git, that is, how to build an object database for version management.

In fact, we can look forward to other functions of git (on paper (→_→)):

  1. add instruction: In fact, it is an encapsulation of our update-index instruction. We usually directly add .add all modified files into the cache area. To achieve this function, you can traverse the directory recursively and use the diff tool to perform an update-index on the modified file.
  2. Merge instruction: I think this is more difficult to implement. The current idea is this: recursively, with the help of the diff tool, add the extra part of the merge project to the merged project. If the diff indicates a conflict, let the user resolve the conflict.
  3. Rebase instruction: In fact, it is to modify the order of submitted objects. The specific implementation is to modify their parent value. A problem like inserting a node or a linked list into the middle of a linked list is to adjust the linked list.
  4. ······

In addition to these, git also has the concept of remote warehouses, and the essence of remote warehouses and local warehouses is the same, but it involves a lot of synchronization and collaboration issues. It feels easier to continue learning other functions of git now, and I feel more confident!

Finally, some reviews about my mini git

Finally, I have to make some summary of the parts I have implemented, and what can be improved compared with open source code:

  1. An addressing function is not implemented. Git can work in any directory of the warehouse, but mine can only work in the root directory of the warehouse. Should implement a .gitfunction to find the directory under the current warehouse , so that the entire system can have a unified entry when addressing the file directory.
  2. The abstraction of object is not perfect. The mini project only implements the addition of the version to the object database, and cannot restore the version from the object database. To achieve the recovery version, you need to develop a corresponding deserialization method for each object, that is, the object should implement such a set of interfaces :
type obj interface {
	serialize(Object) []byte
	deserialize([]byte) Object
}
  1. The problem of the directory separator, because I use windows to develop and test on git bash, all the separators are hard-coded /, which is not good.
  2. At present, you can keep committing. When committing, you should check whether there is an update in the temporary storage area. If there is no update, don't let the commit.
  3. The judgment of the command line parameters is a bit ugly, and I haven't found a good way yet.······

8. end

At last!

Thanks for reading here, and if it helps, please give me a thumbs up!

Guess you like

Origin blog.csdn.net/linuxguitu/article/details/111884001