git large file record clearing solution

reference:

https://blog.csdn.net/Y0W1as5eg37urFdS/article/details/123539994

https://www.manongdao.com/article-2342370.html

overview

git overview

Git is a distributed version control software, originally created by "Linus Torvalds" and released in 2005. The original purpose was to better manage Linux kernel development. Git saves all historical updates about the current project on the local disk, and the processing speed is fast. Most operations in Git only need to access local files and resources, without real-time networking.

"Git LFS" (Large File Storage - large file storage) is a small tool that can store any specified files such as music, pictures, videos, etc. outside the Git repository, and replace them with a text pointer that occupies less than 1KB in the Git repository. By storing large files outside the Git warehouse, the size of the Git warehouse itself can be reduced, the speed of cloning the Git warehouse can be accelerated, and Git will not lose performance because the warehouse is full of large files.

With Git LFS, by default only the current version of the LFS object under the currently checked out commit will be downloaded. In addition, it can also be configured to only fetch the actual content of some specific files managed by Git LFS, while only retaining file pointers for other files managed by Git LFS, thereby saving bandwidth and speeding up the speed of cloning warehouses; it can also be configured to obtain the latest version of a large file at a time, so that it is convenient to check the recent changes of large files.


.git directory

Although the hidden directory of .git is not counted after the code volume, it needs to be pulled down when pulling the code, because it contains information such as previous submission records. This will cause the download speed to become very slow.

├── HEAD
├── branches
├── index
├── logs
│   ├── HEAD
│   └── refs
│       └── heads
│           └── master
├── objects
│   ├── 88
│   │   └── 23efd7fa394844ef4af3c649823fa4aedefec5
│   ├── 91
│   │   └── 0fc16f5cc5a91e6712c33aed4aad2cfffccb73
│   ├── 9f
│   │   └── 4d96d5b00d98959ea9960f069585ce42b1349a
│   ├── info
│   └── pack
└── refs
    ├── heads
    │   └── master
    └── tags
  • description: for GitWeb program
  • config: configure settings specific to this repository
  • hooks: place hook scripts on the client or server side
  • HEAD: indicates which branch you are currently in
  • objects: Git object storage directory
  • refs: Git reference storage directory
  • branches: the directory where branch references are placed

The reason why the .git directory becomes larger

  • When using the git add and git commit commands, Git will generate a Git object, called a blob object, stored in the objects directory, then update the index index, then create a tree object, and finally create a commit object. These commit objects point to the top-level tree object and previous commit objects.
  • The objects created above are all saved in the .git/objects directory as files. Therefore, when a file with a particularly large size is submitted during use, it will be tracked and recorded by Git under the .git/objects folder.
  • Even if this particularly large file is deleted later, in fact Git will only record the deletion operation, and will not actually delete the file from the .git folder, that is, the .git folder will not become smaller at all.

solution

Option 1: Rebuild the warehouse

This method of rebuilding the warehouse can be regarded as a once-and-for-all and relatively simple way. However, this approach is generally not feasible, unless it is your own local project.


Scenario 2: Delete Large Files

Directly find the large file in the .git directory, delete it, and then push it to the remote code repository. The premise of doing this is to delete all other branches, keep the master or main branch, clear all branches of the current project on the git server, and push again. What needs to be noted here is that the operation is risky, and the consequences are at your own risk.

  1. Right-click on the root directory of the git project and use Git Bash Here to bring up the Git command window

  2. Find the largest five commit record files in the git project

    git verify-pack -v .git/objects/pack/pack-*.idx | sort -k 3 -g | tail -5
    
    # 命令说明:
    # 	verify-pack	显示已打包的内容(找大文件)
    # 	sort -k 3 -g	以第三列排序
    

    Execution results (currently, the introduction is based on the five largest queries):

    dbad6eb20d31a5aefe132b74b2137cd10105c574 blob   16684712 7287784 86721282
    6c858bc93421b2db41dafc2bfd4eb82c77c50266 blob   17504576 8257957 47764046
    416088453a2514ada98ba639af3ff298510b4246 blob   22216104 10854126 75719527
    f0d8d3b476526af42b4e06f390a1b4925580e99b blob   22435760 10589160 16678516
    553ba826b92c9d42acb1e586774ac697661588c9 blob   29814552 12998052 60871184
    
  3. The letters in the first line are actually equivalent to the id of the file. Use the following command to find out the file name corresponding to the id

    git rev-list --objects --all | grep dbad6eb20d31a5aefe132b74b2137cd10105c574
    
    # 命令说明:
    # 	rev-list   	列出Git仓库中的所有提交记录
    # 	--objects  	列出该提交涉及的所有文件ID
    # 	--all	    所有分支的提交(位于/refs下的所有引用)
    
    

    Results of the

    dbad6eb20d31a5aefe132b74b2137cd10105c574 Pods/UMengUShare/UShareSDK/SocialLibraries/WeChat/WechatSDK/libWeChatSDK.a
    
  4. Delete large file records

    git filter-branch --force --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch Pods/UMengUShare/UShareSDK/SocialLibraries/WeChat/WechatSDK/libWeChatSDK.a' --tag-name-filter cat -- --all
    
    # 命令说明
    # 	filter-branch	重写Git仓库中的提交
    # 	--index-filter	指定后面命令进行删除
    # 	--all			所有分支的提交(位于/refs下的所有引用)
    

    After the above code is executed, the following error may be reported

    WARNING: git-filter-branch has a glut of gotchas generating mangled history
         rewrites.  Hit Ctrl-C before proceeding to abort, then use an
         alternative filtering tool such as 'git filter-repo'
         (https://github.com/newren/git-filter-repo/) instead.  See the
         filter-branch manual page for more details; to squelch this warning,
         set FILTER_BRANCH_SQUELCH_WARNING=1.
    Proceeding with filter-branch...
    
    Cannot rewrite branches: You have unstaged changes.
    

    If the above error occurs, execute the following command (skip this step if no error is reported)

    git stash
    

    Then execute the remove command again

    Results of the

    WARNING: git-filter-branch has a glut of gotchas generating mangled history
         rewrites.  Hit Ctrl-C before proceeding to abort, then use an
         alternative filtering tool such as 'git filter-repo'
         (https://github.com/newren/git-filter-repo/) instead.  See the
         filter-branch manual page for more details; to squelch this warning,
         set FILTER_BRANCH_SQUELCH_WARNING=1.
    Proceeding with filter-branch...
    
    Rewrite 967ff01a8b2bc7d7f6c90630ede38a2135ba5813 (1/41) (0 seconds passed, remaiRewrite 9674de7f9f8ed029785421f2246bb9d41786cc49 (2/41) (0 seconds passed, remaiRewrite e04ab331167f24b7d26d69e169386e690081af74 (3/41) (0 seconds passed, remaiRewrite 997e93fdaff75162eda8e5c6ee6d3265ce61fe96 (4/41) (0 seconds passed, 
    Rewrite 429d321e2ee4c54d8d96b35a8e35d021bd8930f1 (35/41) (3 seconds passed, remaining 0 predicted)    rm 'Pods/UMengUShare/UShareSDK/SocialLibraries/WeChat/WechatSDK/libWeChatSDK.a'
    
    Ref 'refs/heads/master' was rewritten
    Ref 'refs/remotes/origin/master' was rewritten
    WARNING: Ref 'refs/remotes/origin/master' is unchanged
    Ref 'refs/stash' was rewritten
    
  5. clean up

    rm -rf .git/refs/original/
    git reflog expire --expire=now --all
    git gc --prune=now
    
  6. Force Push Remote

    git push --force --all
    
    # 让远程仓库变小
    git remote prune origin
    

Solution 3: Use the git repo-clean tool to clean up (pro-test, recommended)

git repo-clean is a Git extension tool developed in Golang that has the functions of scanning, cleaning, and rewriting commit records for large files in Git warehouses.

For source code and download and installation details, see: https://gitee.com/oschina/git-repo-clean

It is recommended to use binary package installation , which is simple and convenient


use :

There are two ways to use it, one is the command line and the other is interactive.

The current options are as follows:

  -v, --verbose		显示处理的详细过程
  -V, --version		显示 git-repo-clean 版本号
  -h, --help		显示使用信息
  -p, --path		指定Git仓库的路径, 默认是当前目录,即'.'
  -s, --scan		扫描Git仓库数据,默认是扫描所有分支中的数据
  -f, --file		直接指定仓库中的文件或目录,与'--scan'不兼容
  -b, --branch		设置需要删除文件的分支, 默认是从所有分支中删除文件
  -l, --limit		设置扫描文件阈值, 比如: '--limit=10m'
  -n, --number		设置显示扫描结果的数量
  -t, --type		设置扫描文件后缀名,即文件类型
  -i, --interactive 	开启交互式操作
  -d, --delete		执行文件删除和历史重写过程
  -L, --lfs		将大文件转换为Git LFS指针文件

Command-line usage:

git repo-clean --verbose --scan --limit=1G --type=tar.gz --number=5

With the --delete option, the scanned files will be deleted in batches and the related commit history (including HEAD) will be rewritten

git repo-clean --verbose --scan --limit=1G --type=tar.gz --number=5 --delete 

Notice:

  • If there is no repo-clean command when you enter the bash command window directly from the root directory of the git project and use the repo-clean command, you can enter cmd through [Start] in the lower left corner to enter the command window, then use the command to switch to the root directory of the git project, and then use the repo-clean command
  • At present, the scan operation and delete operation are performed on all branches by default, and --branchthe option only specifies the branch when deleting, and cannot specify the branch when scanning. So if this option is used to specify a branch, files in another branch may be selected from the scan results, so no files will actually be deleted.

Solution 4: Use the migrate command to optimize the .git directory

Migrate existing git warehouses and use git lfs to manage them. After rewriting the history, you need to execute git commit --force. Please confirm that the local operation is correct before submitting. If there are multiple copies of the warehouse before migrating to git lfs, other copies may need to execute git reset --hard origin/master to reset their local branches. Note that executing the git reset --hard command will lose local changes

git lfs migrate: used to save the files currently saved by the GIT repository (.git) as LFS files

# 重写master分⽀
# 将历史提交(指的是.git目录)中的*.zip都⽤lfs进⾏管理
$ git lfs migrate import --include-ref=master --include="*.zip"
 
# 重写所有分⽀及标签
# 将历史提交(指的是.git目录)中的*.rar,*.zip都⽤lfs进⾏管理
$ git lfs migrate import --everything --include="*.rar,*.zip"
 
# 切换后需要把切换之后的本地分支提交到远程仓库了,需要手动push更新远程仓库中的各个分支
$ git lfs push --force
 
# 切换成功后,GIT仓库的大小可能并没有变化
# 主要原因可能是之前的提交还在,因此需要做一些清理工作
# 如果不是历史记录非常重要的仓库,建议不要像上述这么做,而是重新建立一个新的仓库
$ git reflog expire --expire-unreachable=now --all
$ git gc --prune=now

expand

Manage large files with lfs

A better way to avoid the above problems is to use in time lfsto track, record and manage large files. Such large files will not pollute our .gitdirectory, but also allow us to use them more conveniently.

# 1.开启lfs功能
$ git lfs install
 
# 2.追踪所有后缀名为“.psd”的文件
$ git lfs track "*.iso"
 
# 3.追踪单个文件
git lfs track "logo.png"
 
# 4.提交存储信息文件
$ git add .gitattributes
 
# 5.提交并推送到GitHub仓库
$ git add .
$ git commit -m "Add some files"
$ git push origin master

At the same time, there is another method, which is to use files flexibly .gitignoreto exclude special directories or files that are not needed in the warehouse in time, so that files that should not exist will not appear in the code warehouse

.DS_Store
node_modules
/dist
 
*.zip
*.tar.gz

Guess you like

Origin blog.csdn.net/footless_bird/article/details/125686432