Git: A code commit that changed the world

Abstract: If you choose the greatest Git code submission in the history of the Linux community, it must be the first code submission of the Git tool project itself.
吾诗已成。无论大神的震怒,还是山崩地裂,都不能把它化为无形!
—— 奥维德《变形记》

background

As the largest and most successful open source project, Linux has attracted the contributions of programmers from all over the world. So far, more than 20,000 developers have submitted code to Linux Kernel.

Surprisingly, in the first ten years of the project (1991 ~ 2002), Linus as a project administrator did not use any configuration management tools, but manually merged the code submitted by everyone through patches. It's not that Linus likes manual processing, but because he is very picky about software configuration management tools (SCM), whether it is commercial clearcase or open source cvs, svn, etc., he can't get into his eyes.

In his opinion, a version control system that can meet the development and use of Linux kernel projects needs to meet several conditions: 1) Fast 2) Support multi-branch scenarios (thousands of branch parallel development scenarios) 3) Distributed 4) Can support large project. It was not until 2002 that Linus finally found a tool that basically met his requirements-BitKeeper, and BitKeeper is a commercial tool. They are willing to give the Linux community free use, but they need to ensure that they comply with the provisions of no decompilation. The default interface provided by BitKeeper obviously cannot meet all the needs of community users. A community developer decompiled BitKeeper and used the undisclosed interface, which caused BitKeeper to withdraw the license for free use. As a last resort, Linus used the ten-day holiday to implement a DVCS-Git, and pushed it to community developers.

design

Git has been called the standard configuration for software developers worldwide. Needless to say about the introduction and usage of Git, I want to talk about the internal implementation of Git today. But before reading this article, let me ask you a question: If you were to design git (or redesign git), how would you design it? What functions are ready to be implemented in the first version? After reading this article, compare your own ideas. Welcome to leave a message to discuss.

The best way to learn the internal implementation of Git is to look at Linus’s initial code submission and checkout the first submission node of the git project (see the blog: "Tips for reading open source code" ). You can see that there are only a few in the code base. Two files: a README, a build script Makefile, and a few C source files. The remarks of this commit are also very special: Initial revision of “git”, the information manager from hell.

commit e83c5163316f89bfbde7d9ab23ca2e25604af290
Author: Linus Torvalds <[email protected]>
Date:   Thu Apr 7 15:13:13 2005 -0700

    Initial revision of "git", the information manager from hell

In the README, Linus described the design ideas of Git in detail. For seemingly complex Git work, in Linus' design, there are only two object abstractions: 1) object database ("object database"); 2) current directory cache ("current directory cache").

The essence of Git is a series of file object collections. Code files are objects, file directory trees are objects, and commits are also objects. The name of these file objects is the SHA1 value of the content, and the value of the SHA1 hash algorithm is 40 bits. Linus uses the first two digits as the folder and the last 38 digits as the file name. You can see a lot of directories with two-letter/digital names in the objects in the .git directory, which store a lot of files with 38-digit hash names. This is all Git information.

Linus defines the data structure of the object according to <tag ascii code representation> (blob/tree/commit) + <space> + <length ascii code representation> + <\0> + <binary data content>, you can use The xxd command looks at the object files in the objects directory (decompression by zlib). For example, the content of a tree object file is as follows:

00000000: 7472 6565 2033 3700 3130 3036 3434 2068  tree 37.100644 h
00000010: 656c 6c6f 2e74 7874 0027 0c61 1ee7 2c56  ello.txt.'.a..,V
00000020: 7bc1 b2ab ec4c bc34 5bab 9f15 ba         {....L.4[....

There are three types of objects: BLOB, TREE, and CHANGESET.

BLOB:  Binary object. This is the file stored by Git. Git does not store delta information like some VCS (such as SVN), but stores the complete information of each version of the file. For example, if you submit a copy of hello.c and enter the Git library, a BLOB file will be generated to fully record the contents of hello.c; after you modify hello.c, submit commit, and a new BLOB file will be generated to record the modified All contents of hello.c. When Linus was designed, only the content of the file was recorded in the BLOB, without metadata information such as file name, file attributes, etc. This information was recorded in the second object TREE.

TREE:  Directory tree object. In Linus's design, the TREE object is an abstraction of directory tree information in a time slice, including file name, file attributes and SHA1 value information of BLOB objects, but no historical information. The advantage of this design is that it can quickly compare the TREE objects of the two history records, and cannot read the content, but according to the SHA1 value to display the same and different files.

In addition, since the file name and attribute information are recorded on the TREE, BLOB objects can be reused to save storage resources for modifying file attributes or modifying file names or moving directories without modifying the file content. In the subsequent development and evolution of Git, the design of TREE was optimized, and it became an abstraction of folder information at a certain point in time. TREE contains the object information (SHA1) of the TREE of its subdirectory. In this way, storage resources can be saved for Git libraries with complex or deep-level directory structures. History information is recorded in the third object CHANGESET.

Picture taken from Pro Git  1

CHANGESET:  Commit object. A CHANGESET object records the TREE object information (SHA1) of the submission, as well as information such as the committer and commit message. Different from other SCM (software configuration management) tools, Git's CHANGESET object does not record file renaming and attribute modification operations, nor does it record the delta information of file modification, etc. The CHANGESET will record the SHA1 value of the parent node CHANGESET object , Obtain the difference by comparing the TREE information of this node and the parent node.

Linus allows a node to have up to 16 parent nodes when designing the CHANGESET parent node. Although the merging of more than two parent nodes is very strange, in fact, Git supports multi-head merging of more than two branches.

Linus emphasized the trustworthiness (TRUST) after the design explanation of the three objects: Although Git does not involve trustworthiness in its design, Git can be trusted as a configuration management tool. The reason is that all objects are encoded in SHA1 (Google’s implementation of SHA1 collision attacks is a later story, and the Git community is also preparing to use the more reliable SHA256 encoding instead), and the process of signing in objects is guaranteed by signature tools, such as GPG tools etc.

Understand the three basic objects of Git, then Linus has a good understanding of the two abstractions of "object database" and "current directory cache" originally designed by Linus for Git. In addition to the original working directory, Git has three levels of abstraction, as shown in the following figure: one is the current working area (Working Directory), which is where we view/write code, and the other is the Git repository, which is the object database Linus said. , The content stored in the .git folder that we see in the Git warehouse, Linus named .dircache in the first version of the design, and there is a layer of intermediate staging area (Staging Area) in these two storage abstractions, namely. The information stored in git/index, when we execute the git add command, we add the current modification to the cache area.

Linus explained the design of the "current directory cache". The cache is a binary file with a content structure similar to the TREE object. The difference from the TREE object is that the index will no longer contain nested index objects, that is, the contents of the current modified directory tree are all in one index file. This design has two advantages: 1. It can quickly restore the complete content of the cache, even if the files in the current workspace are accidentally deleted, all files can be restored from the cache; 2. It can quickly find out the cache and the current work Files with inconsistent zone contents.

图片摘自 Things About Git and Github You Need to Know as Developer 2

achieve

Linus completed the most basic functions of Git in the first code submission of Git and can be compiled and used. The code is extremely concise, and the Makefile is only 848 lines in total. Interested colleagues can use the method described in the previous paragraph to checkout Git's earliest commit to get started compiling and playing, as long as there is a Linux environment.

Because of the dependency on the library version, you need to make some minor modifications to the original Makefile script. The first version of Git relied on two libraries, openssl and zlib, which need to be installed manually. Execute on ubuntu: sudo apt install libssl-dev libz-dev; then modify the makefile in the LIBS= -lssl line and change -lssl to -lcrypto and add -lz; finally execute make, ignore the compilation warning, and you will find that it is compiled 7 executable program files: init-db, update-cache, write-tree, commit-tree, cat-file, show-diff and read-tree.

The following briefly introduces the implementation of these executable programs:

(1) init-db: Initialize a git local warehouse, which is the git init command that we now initialize and build a git library-style percussion. It's just that the name of the warehouse and cache folder created by Linus at the beginning is called .dircache, not the .git folder we are now familiar with.

(2) update-cache: Enter the file path and add the file (or multiple files) to the buffer. The specific implementation is: verify the legitimacy of the path, then calculate the SHA1 value of the file, add the blob header information to the file content and write it to the object database (.dircache/objects) after zlib compression; finally, the file path, file attributes and The blob sha1 value is updated to the .dircache/index cache file.

(3) write-tree: Generate TREE objects from the cached directory tree information and write them into the object database. The data structure of the TREE object is:'tree' + length + \0 + file tree list. The file tree list is stored in the structure of file attributes + file name + \0 + SHA1 value. After writing the object successfully, the SHA1 value of the TREE object is returned.

(4) Commit-tree: Generate commit node objects from TREE object information and submit them to the version history. The specific implementation is to enter the SHA1 value of the TREE object to be submitted, and choose to enter the parent commit node (up to 16). The commit object information includes the name, email and date information of the TREE, parent node, committer and author, and finally writes the new one commit the node object file and return the SHA1 value of the commit node.

(5) cat-file: Since all object files are compressed by zlib, you need to use this tool to decompress and generate temporary files if you want to view the content of the file in order to view the content of the object file.

(6) show-diff: Quickly compare the difference between the current cache and the current workspace, because the attribute information of the file (including modification time, length, etc.) is also stored in the cache data structure, so you can quickly compare whether the file has been modified, and Show the difference.

(7) read-tree: According to the input TREE object SHA1 value, output and print the content information of TREE.

These are all seven subroutines of the first available version of Git. Colleagues who may have used Git will say: Why is this different from my usual Git commands? What about Git add, git commit? Yes, in the original Git design, there are no git commands that we usually use.

In the design of Git, there are two kinds of commands: the low-level commands (Plumbing commands) and the high-level commands (Porcelain commands). At the beginning, Linus designed these commands that conform to the Unix KISS principle for hackers in the open source community. Because hackers themselves are masters, they roll up their sleeves to fix the water pipes, so these commands are called plumbing commands. Later they took over. Junio ​​Hamano of Git feels that these commands are not very friendly to ordinary users, so on top of this, we have encapsulated high-level commands that are easier to use and have more exquisite interfaces, which are the git add, git commit we use every day today.

Git add encapsulates the update-cache command, and git commit encapsulates the write-tree and commit-tree commands. For a more detailed introduction to the underlying commands, if you are interested, you can see   the Git Internals chapter in Pro Git .

The specific code implementation will not be detailed here. Linus's code style is extremely concise, and never write two lines if it can be completed in one line. In addition, the use of Linux API is naturally unsurpassed. What I am most impressed is that there are many places where mmap is used to establish file and memory mapping, which eliminates memory application, file read and write operations, and improves tool performance. As a colleague said: Linus's code does not meet the programming specifications, but it seems that there is nothing wrong with it. By the way, Linus's indentation style is the Tab key (for allusions, see "Tab or space character, this is a problem").

Enlightenment

After Linus submitted the first git commit, he released the git tool to the community. At that time, a developer named Junio ​​Hamano in the community found this tool very interesting, so he downloaded the code, and found that there were only 1244 lines of code in total, which surprised him and aroused great interest. Junio ​​communicated with Linus on the mailing list and helped to add features such as merge, and then continued to polish git. Finally, Junio ​​completely took over the maintenance of Git, and Linus went back to continue maintaining the Linux Kernel project.

If you choose the greatest Git code submission in history, it must be the first code submission of the Git tool project itself. This code submission is undoubtedly groundbreaking. If the Linux project contributed to the success of open source software and rewritten the landscape of the software industry, then Git has changed the way developers work and write around the world. Two years after Git was born, three young programmers sat in a tavern in San Francisco and decided to do something with Git. A few months later, GitHub went live.

Going back to the problem mentioned at the beginning of the article, if I design Git, it is estimated that I will extend the design from the existing tool experience (such as the use of SVN). Even when I first came into contact with Git, I thought that Git is SVN + distributed. . It was after understanding the internal principles of Git and even reading the initial code of Git that I sighed for the exquisite design. The initial design and implementation of Git can probably inspire (open source) software products as follows:

1. Solve the problem of pain points : The origin of Git is the demands of Linus himself and the Linux community, and these demands are generalized to the common demands of project collaborative development (especially cross-regional projects). Linus solved his own pain points and achieved a great achievement by the way.

2. Minimal design : Linus was not bound by traditional SCM tools when designing Git tools, considering file differences, version comparisons, etc., but abstracted several basic objects to clarify the git design ideas.

3. MVP (minimum viable product) : Everyone understands this concept, but it is not easy to operate in practice. What functions does an MVP configuration management tool need? Generally speaking, you think of code submission, history tracing, version comparison, branch merging, etc. But Linus disassembled it and quickly realized the underlying basic functions, so simple that only hackers in the open source community can use it. But this is enough, and hackers discovered its value and continued to contribute to it.

4. Rapid release and rapid iteration : This is also derived from the development experience of Linux Kernel; after Linus has implemented Git MVP, it will be announced on the Linux community mailing list, and comments will be sought and iteratively improved.

5. Find the right successor : There is a similar view in The Cathedral and the Bazaar. It says: "If you lose interest in a project, your last responsibility is to hand it over to a competent successor. "However, Linus gave Git to Junio ​​not because he lost interest, but because he found that after the Git infrastructure was established, Junio ​​was better at implementing richer and more user-friendly features than him. Feel free to hand over Git to Junio. Finding a more suitable successor for an open source project requires both courage and wisdom.

  1. Pro Git, 10.2 Git Internals - Git Objects: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects 

  2. Things About Git and Github You Need to Know as Developer: https://medium.com/swlh/things-about-git-and-github-you-need-to-know-as-developer-907baa0bed79 

 

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/108768685