Introduction to distributed version control system Git

Git is an open source distributed version control system with distributed, lightweight branches, powerful collaboration capabilities, and version management for large and small projects. This article briefly introduces the features of the Git tool, objects in Git, and branch management for a deeper understanding.


1. Introduction to version control system

Version control refers to the management of file changes such as various program codes, configuration files, and documentation during the software development process, and is one of the core ideas of software configuration management. Its main purpose is to track and record each version in the software development process, record the modification history, version number and release date of each version, so as to better manage code and documents, and help developers collaborate better and develop software.

Through version control, developers can roll back to the previous stable version at any time, recover from errors or perform operations such as branch development, so as to better manage changes and risks in the software development process.

1.1 Distributed version control system

Version control is usually associated with a version control system (Version Control System, VCS), such as Git, Mercurial, SVN, etc. These systems provide features such as storing code, managing version history, branching and merging, helping developers to better collaborate and manage version changes in the software development process.

1.1.1 Development of Version Control System

The earliest version control system was managed locally, using a simple database to record the differences between previous updates of files. However, with the need for cooperative development and coordination work, a centralized version control system (CVCS) has emerged. Different developers connect to the centrally managed server through the client to submit or read the latest files, such as the commonly used SVN That's the way it works.

insert image description here

In a centralized version control system, administrators can easily control the permissions of each developer, and the maintenance cost is low. However, there is an obvious single point of failure in the centralized server under this architecture. If the version control server goes down, the service cannot be provided. If server data is lost and cannot be recovered, the entire project's data and change history are lost with it.

In order to solve the above problems, a distributed version control system (DVCS) emerged. Its basic design concept is that the client not only extracts the snapshot of the latest version, but mirrors the code warehouse completely, which is equivalent to having a complete copy of the data locally. . In this way, when the CVS server fails, the data can be restored through any local mirror warehouse. At the same time, it can also interact with different remote code warehouses and set up different collaboration processes.

1.1.2 Common version control systems

Commonly used distributed version control tools include the following:

  1. Git: An open source distributed version control system, widely used in software development and project version management. Git excels at handling large projects and team collaboration, and is the tool of choice for many businesses and development teams.
  2. Mercurial: Another popular distributed version control system, similar to Git. It also supports collaborative development, branch management, version rollback and other functions, and has good cross-platform support.
  3. SVN: Subversion (SVN) is a centralized version control system that is still widely used in some teams and projects. SVN provides an easy-to-use interface and powerful version management functions, especially for small teams and individual project management.
  4. Perforce: A high-performance distributed version control system, suitable for version management of large teams and large projects. Perforce provides a wealth of features, including version control, workspace management, branching and merging, etc., and also supports multi-language and cross-platform support.
  5. Bazaar: An open source distributed version control system that provides functionality similar to Git and Mercurial. Bazaar emphasizes ease of use and scalability, and supports multiple platforms and development environments, including Python, C++, Java, etc.

The following mainly introduces the function and use of the Git version control tool.

1.2 Introduction to the features of the Git version control system
1.2.1 Record snapshots directly

The difference between Git and other VCS systems lies in the way they treat data. Other systems such as Mercurial and SVN store information in the form of file change lists. The stored information is a set of basic files and update operations on files accumulated over time. , often called diff-based versioning. As shown below:

insert image description here

Git saves the snapshot of the file system, and each version saves the full copy data of the current file system. Whenever an update is committed or the state of the project is saved, it basically creates a snapshot of all the files at that time and keeps an index of that snapshot. For efficiency, if the file has not been modified, Git does not re-store the file, but only keeps a link pointing to the previously stored file.

insert image description here

1.2.2 Executing operations locally

Most operations in Git only need to access local files and resources, because there is a complete history of the project on the local disk, most operations do not need to interact with the remote server, and file query and submission are completed in an instant. For example, to browse the history of the project, you can directly read the local database for query; or if you want to compare the difference between the current version and the historical version, you can directly read the historical file locally for comparison. In this way, it can also work normally in some offline environments, which is very convenient.

1.2.3 Integrity Guarantee

All data in Git will calculate the checksum before storage. If information is lost or files are damaged during transmission, Git can find out in time. Git uses SHA-1 to calculate the hash value, which is a string of 40 hexadecimal characters calculated based on the contents of the file or directory structure in Git.

1.2.4 Only add data operations

Most of the operations performed in Git are data addition operations, that is, any operations that may cause files to be unrecoverable will not be performed in Git. When the data is not submitted for update, it may be lost, but once it is submitted to Git, it is difficult to lose the data, because each client will save a complete snapshot data locally every time it is updated.

1.3 Git local installation and deployment

1) Install Git in the centos system using the following command

# yum install git –y
# git --version
git version 1.8.3.1

2) Configure users and email addresses

# git config --global user.name "Tango"
# git config --global user.email tango@com
# git config --list
user.name=Tango
user.email=tango@com
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true

2. Analysis of Git tool principle

2.1 Three states in Git
2.1.1 Three states of Git files

There are three states of files in Git: committed, modified and staged

  • Modified: Indicates that the file has been modified but has not been saved to the database.
  • Staged: Indicates that the current version of a modified file is marked for inclusion in the next committed snapshot.
  • Committed: Indicates that the data has been safely saved in the local database.
2.1.2 Three phases of Git project

insert image description here

Based on these three states, the Git project is divided into three phases: workspace, temporary storage area, and Git directory

  • The workspace is the content independently extracted from a certain version of the project. These data are extracted from the compressed database of the Git warehouse and placed on the disk for subsequent use or modification.
  • The temporary storage area is a file that stores the list information of the files to be submitted next time, usually in the Git warehouse directory.
  • The Git repository directory is where Git keeps the project's metadata and object database.
2.1.3 Git basic workflow

The basic workflow of Git is as follows:

  • modify files in the workspace;
  • Selectively stage the changes that you want to submit next time, so that only the changed part will be added to the stage;
  • Submit the update, find the file in the staging area, and permanently store the snapshot to the Git directory.

If a specific version of the file is saved in the Git directory, it belongs to the "committed" state; if the file has been modified and put into the temporary storage area, it belongs to the "temporary storage" state; if it has been modified since the last checkout But it has not been placed in the temporary storage area, it is in the "modified" state.

2.2 Objects in Git
2.2.1 Git directory structure

When executing git init in a new directory or an existing directory, Git will create a .git directory. The typical structure of the .git directory is as follows:

drwxr-xr-x. 2 root root   6 Jul  1 19:08 branches
-rw-r--r--. 1 root root  92 Jul  1 19:08 config
-rw-r--r--. 1 root root  73 Jul  1 19:08 description
-rw-r--r--. 1 root root  23 Jul  1 19:08 HEAD
drwxr-xr-x. 2 root root 242 Jul  1 19:08 hooks
drwxr-xr-x. 2 root root  21 Jul  1 19:08 info
drwxr-xr-x. 5 root root  40 Jul  1 19:13 objects
drwxr-xr-x. 4 root root  31 Jul  1 19:08 refs

There are 4 directories that are very important and are the core components of Git:

  • HEAD file: points to the currently checked out branch
  • index file: save the temporary storage area information
  • objects directory: store all data content
  • refs directory: stores pointers to commit objects pointing to data (branches, remote repositories, tags, etc.)

In addition to these 4 directories, other files such as description are only used by GitWeb programs; config files contain project-specific configuration options; info directories contain a global exclude (global exclude) file to place those who do not want to be recorded in the .gitignore file Ignore mode in ; the hooks directory contains client or server hook scripts.

2.2.2 Data Objects

The core part of Git is a simple key-value pair database. Inserting any type of content into the Git warehouse will return a unique key value, through which the content can be retrieved again at any time.

1) Initialize the Git repository

# git init tango
Initialized empty Git repository in /usr/local/git/tango/.git/
# find .git/objects/
.git/objects/
.git/objects/pack
.git/objects/info

You can see that Git has initialized the objects directory and created pack and info subdirectories, but they are all empty

2) Use git hash-object to create a new data object and manually store it in the Git database

# echo 'test tango' | git hash-object -w --stdin
ef9624cb2770e61445fdba38dacd4af9c9240c19

The above command stores the data in the Git repository and returns a 40-character hash value, which is also a unique key value

3) View the data stored in Git

# find .git/objects -type f
.git/objects/ef/9624cb2770e61445fdba38dacd4af9c9240c19

Find the file corresponding to the content in objects. One file corresponds to one piece of content. The name of the file is obtained by calculating the hash value of the content plus header information through the HSA-1 algorithm. The first two characters are used to name subdirectories, and the remaining 38 characters are used for file names. The saved content can be displayed by the cat-file command:

# git cat-file -p ef9624cb2770e61445fdba38dacd4af9c9240c19
test tango

The objects saved above are called data objects (Blob Object)

2.2.3 Tree Objects

Tree Object (Tree Object) can solve the problem of file name preservation, allowing multiple files to be organized together. Git is similar to the UNIX file system, and all content is stored in the form of tree objects and data objects. Among them, the tree object corresponds to the directory entry in UNIX, and the data object roughly corresponds to the inodes or file content. A tree object contains one or more tree object records (tree entry), and each record contains a SHA-1 pointer pointing to a data object or subtree object, as well as corresponding mode, type, and file name information.

insert image description here

1) Create a temporary storage area for a separate file by command git update-index, such as test01

# echo "version 01" > test01
# git hash-object -w test01
df1c656f9be822fcb40701da71a32b10dc56cc50
# git update-index --add --cacheinfo 100644 \
> df1c656f9be822fcb40701da71a32b10dc56cc50 test01

2) Write the content of the temporary storage area into a tree object through the git write-tree command

# git write-tree
f2e68037cac9dcebccfdf85ad36dff4c04b0f19d
# git cat-file -p f2e68037cac9dcebccfdf85ad36dff4c04b0f19d
100644 blob df1c656f9be822fcb40701da71a32b10dc56cc50    test01
# git cat-file -t f2e68037cac9dcebccfdf85ad36dff4c04b0f19d
tree

3) Create a new tree object

# echo "new version" > test02
# git update-index --add test02
# git write-tree
f94cde1983fe96dec7cb0e83d09f66f63289d765
# git cat-file -p f94cde1983fe96dec7cb0e83d09f66f63289d765
100644 blob df1c656f9be822fcb40701da71a32b10dc56cc50    test01
100644 blob 2777791d6de7aac03f38b1b8403ad0fae82e28ac    test02

4) By calling the git read-tree command, you can read the tree object into the temporary storage area, and specify the –prefix option to read an existing tree object as a subtree into the temporary storage area

# git read-tree --prefix=bak f2e68037cac9dcebccfdf85ad36dff4c04b0f19d
# git write-tree
79fdfbd5a6a32829908c40babd8781a1c143784e
# git cat-file -p 79fdfbd5a6a32829908c40babd8781a1c143784e
040000 tree f2e68037cac9dcebccfdf85ad36dff4c04b0f19d    bak
100644 blob df1c656f9be822fcb40701da71a32b10dc56cc50    test01
100644 blob 2777791d6de7aac03f38b1b8403ad0fae82e28ac    test02

The root of the working directory contains two files and a subdirectory called bak that contains the first version of the test01 file.

2.2.4 Submit object

The commit object is the storage of the snapshot information. Create a commit object by calling the commit-tree command, specifying the hash value of the tree object and the parent commit object of the commit.

# echo 'first commit' | git commit-tree f2e68
411366ee5bd910edf128778fbf991fb5a2209959

View this new submission object through the git cat-file command

# git cat-file -p 411366
tree f2e68037cac9dcebccfdf85ad36dff4c04b0f19d
author Tango <tango@com> 1688222097 +0800
committer Tango <tango@com> 1688222097 +0800

first commit

The format of the submission object is very simple: it first specifies a top-level tree object representing the current project snapshot; then there are possible parent submissions (the submission objects described above do not have any parent submissions); after that is the author/committer information; A blank line, followed by a commit comment.

2.2 Git branches

1) The process of Git submitting and saving data

When performing a commit operation, Git will save a commit object, which contains a pointer to a snapshot of the temporary content, as well as author information, input information at the time of commit, and a pointer to its parent object.

insert image description here

Submit after each modification, then the commit object generated this time will contain a pointer to the last commit object (parent object).

2) Git branch

Git branches are essentially mutable pointers to commit objects. The default branch of Git is Master. After multiple commit operations, it actually points to the Master branch of the last commit object. The Master branch is automatically moved on every commit. Below is the branch and its commit history:

insert image description here

3) Branch creation

Use the git branch command to create a branch, and a branch will be created on the commit object.

4) Head pointer and branch switching

In Git, the Head pointer points to the current local branch, such as the master in the above figure. If you need to switch to an existing branch, such as the new branch testing, use the command git checkout to complete.

# git checkout testing

insert image description here

When switching back to the master branch to update the submission, the new branch test will be ignored, and project crossover will occur. So you can switch back and forth between different branches, and finally merge them. The creation and destruction of Git branches are very efficient, because they essentially include referent checksums.

insert image description here

2.3 Distributed Git

Compared with the traditional centralized CVS, the distributed nature of Git makes the collaboration between developers more flexible and convenient. A single-point collaboration model is usually used in a centralized system. The central server acts as a code warehouse, and all developments are synchronized with it as a client. When two developers clone the code from the same warehouse and modify and submit it, the first developer can submit it smoothly, but the second developer must wait for the first developer to submit and merge the work into it before submitting Go up so that it won't be overwritten. Through permission control in Git, this traditional workflow mode is also supported.

2.3.1 Integration Manager Workflow

Git supports a distributed management process, allowing multiple warehouses to exist, and each developer has write permissions for their own warehouses and read permissions for everyone else's warehouses. When developers want to contribute to the project, they can clone a copy of the project code to the local warehouse, then push their own changes, and notify the maintainer of the official warehouse to pull the update and merge it into the main project. Maintainers can locally add the contributor's warehouse as a remote warehouse and test it, and then push the merged changes to the official warehouse. The main process is as follows:

  1. The project maintainer pushes to the main warehouse;
  2. Contributors clone this repository and make changes;
  3. Contributors push data to their public warehouses;
  4. Contributors send emails to maintainers, requesting to pull their own updates;
  5. The maintainer adds the contributor's warehouse as a remote warehouse and merges the changes in his local warehouse;
  6. The maintainer pushes the merged changes to the main repository;

insert image description here

This is the most common workflow for hub-based tools like GitHub and GitLab.

2.3.2 Supervisor and Deputy Supervisor Workflow

A variant of the multi-repository workflow, suitable for very large projects, such as the Linux kernel project. Each deputy director (lieutenant) is responsible for a specific part of the integration project, and a general integration manager called a dictator is responsible for coordinating the deputy director. The warehouse maintained by the supervisor is used as a reference warehouse to provide all collaborators with their The project code that needs to be pulled. The whole process is as follows:

  1. Regular developers work on their own topic branches and rebase against the master branch. Here is the master branch of the reference repository pushed by the supervisor.
  2. The deputy director merges the topic branches of ordinary developers into his own master branch.
  3. The supervisor merges the master branches of all deputy supervisors into his own master branch.
  4. Finally, the supervisor pushes the integrated master branch to the reference repository so that all other developers can rebase on it.

insert image description here

3. Summary

Git, as an open source distributed version control system, is widely used in version management of development projects. To sum up, it has the following characteristics:

  1. Distributed: Git is a distributed version control system. Each developer keeps a complete version history, and does not need to be connected to the Internet for code comparison or check-in. This allows developers to develop on devices in different locations without network constraints.
  2. Lightweight branch: Git supports lightweight branch, which enables team members to collaborate and develop conveniently. The creation and switching of branches is very fast, and the merging of branches is also very easy to operate, which makes it easy for teams to carry out multiple parallel development tasks.
  3. Efficient handling of large projects: Git excels in handling version management of large projects. It efficiently manages file and version history for large projects, and provides fast code retrieval, comparison, and rollback.
  4. Powerful collaboration features: Git provides rich collaboration features, such as branching, merging, tags, etc., to help team members work better together. Through remote warehouses, developers can easily share code with other members for code merging and collaborative development.
  5. Customizability: Git provides a wealth of configuration options and hooks (hooks), allowing developers to customize according to their own needs and the workflow of the team. It is possible to set personal and team preferences, as well as define custom actions, through configuration files such as .gitconfig.
  6. Community support: Git is an open source project with a large community support and active developers. This enables Git to continuously improve and develop, and integrate with other tools and libraries to meet the needs of different scenarios.

This article briefly introduces the features of Git tools, as well as object and Git branch management and distributed workflow implementation in Git.


References:

  1. https://git-scm.com/book/zh/v2

Guess you like

Origin blog.csdn.net/solihawk/article/details/131508827