How to identify the same content on the Linux file

Sometimes a copy of the file is equivalent to a huge waste of disk space and will cause problems when you want to update the file. The following command is used to identify six of these documents.

In a recent post, we looked at how to identify and locate the file hard links (ie, point to the same hard disk and content sharing inode). In this article, we will look to find the same content , but not linked to the command file.

Hard links are useful because they enable multiple files stored in the local file system, but will not take up extra hard disk space. On the other hand, sometimes a copy of the file is equivalent to a huge waste of disk space when you want to update the file there will be a risk of causing distress. In this article, we will look at a variety of ways to identify these files.

With the diff command compares files

May compare two files The easiest way is to use the diffcommand. Output appears different from your file. <And the >symbol represents the parameter passed over when the first ( <) or second ( >if there is additional line of text) file. In this example, in backup.htmlthe additional lines of text.

$ diff index.html backup.html
2438a2439,2441
> <pre>
> That's all there is to report.
> </pre>
复制代码

If diffthere is no output that represent the same two files.

$ diff home.html index.html
$
复制代码

diff The only drawback is that it can only compare two files and you must specify the files to compare, this post can be found in some commands multiple duplicate files for you.

Use checksum

cksum(Checksum) command calculating a checksum file. The checksum is a text content into a long number (e.g. 2,819,078,353,228,029) the mathematical reduction. While the checksum is not entirely unique, but the probability of different contents of the file checksums but the same minimal.

$ cksum *.html
2819078353 228029 backup.html
4073570409 227985 home.html
4073570409 227985 index.html
复制代码

In the above example, you can see and check produce the same second and third files how the same may be a default.

Using the find command

While the findcommand and there is no option to find duplicate files, it can still be used to find files by name or type and run the cksumcommand. E.g:

$ find . -name "*.html" -exec cksum {} \;
4073570409 227985 ./home.html
2819078353 228029 ./backup.html
4073570409 227985 ./index.html
复制代码

Use fslint command

fslintCommand can be specifically used to find duplicate files. Note that we gave it a starting position. If it needs to traverse a significant number of documents, which take time to complete. Notice how it is listed duplicate files and look for other problems, such as empty directory and bad ID.

$ fslint .
-----------------------------------file name lint
-------------------------------Invalid utf8 names
-----------------------------------file case lint
----------------------------------DUPlicate files   <==
home.html
index.html
-----------------------------------Dangling links
--------------------redundant characters in links
------------------------------------suspect links
--------------------------------Empty Directories
./.gnupg
----------------------------------Temporary Files
----------------------duplicate/conflicting Names
------------------------------------------Bad ids
-------------------------Non Stripped executables
复制代码

You may need to install on your system fslint. You may also need to add it to your command search path:

$ export PATH=$PATH:/usr/share/fslint/fslint
复制代码

Use rdfind command

rdfindCommand will look for duplicate (same content) files. Its name means "Repeat Search", and it can be based on file date to determine which file is the original - This is useful because it removes the newer file when you choose to delete the copy.

$ rdfind ~
Now scanning "/home/shark", found 12 files.
Now have 12 files in total.
Removed 1 files due to nonunique device and inode.
Total size is 699498 bytes or 683 KiB
Removed 9 files due to unique sizes from list.2 files left.
Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.
Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.
Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left.
It seems like you have 2 files that are not unique
Totally, 223 KiB can be reduced.
Now making results file results.txt
复制代码

You can dryrunrun this command mode (in other words, just another change report may be made).

$ rdfind -dryrun true ~
(DRYRUN MODE) Now scanning "/home/shark", found 12 files.
(DRYRUN MODE) Now have 12 files in total.
(DRYRUN MODE) Removed 1 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 699352 bytes or 683 KiB
Removed 9 files due to unique sizes from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left.
(DRYRUN MODE) It seems like you have 2 files that are not unique
(DRYRUN MODE) Totally, 223 KiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
复制代码

rdfindCommand also provides similar ignore empty document ( -ignoreempty) and follow symbolic links ( -followsymlinks) function. Check the man page for an explanation.

-ignoreempty       ignore empty files
-minsize        ignore files smaller than speficied size
-followsymlinks     follow symbolic links
-removeidentinode   remove files referring to identical inode
-checksum       identify checksum type to be used
-deterministic      determiness how to sort files
-makesymlinks       turn duplicate files into symbolic links
-makehardlinks      replace duplicate files with hard links
-makeresultsfile    create a results file in the current directory
-outputname     provide name for results file
-deleteduplicates   delete/unlink duplicate files
-sleep          set sleep time between reading files (milliseconds)
-n, -dryrun     display what would have been done, but don't do it
复制代码

Note rdfindcommand provides -deleteduplicates truethe option to delete the copy settings. I hope this small problem on the command syntax will not annoy you. ;-)

$ rdfind -deleteduplicates true .
...
Deleted 1 files.    <==
复制代码

You will probably need to be installed on your system rdfindcommand. Test it to become familiar with how to use it might be a good idea.

Use the command fdupes

fdupesCommand also makes it easy to identify duplicate files. It also provides a number of useful options - for example, to iteration -r. In this case, it would like to duplicate files are grouped together:

$ fdupes ~
/home/shs/UPGRADE
/home/shs/mytwin

/home/shs/lp.txt
/home/shs/lp.man

/home/shs/penguin.png
/home/shs/penguin0.png
/home/shs/hideme.png
复制代码

This is an example of using iterative Note that many duplicate files are important (users .bashrcand .profilefiles) and should not be deleted.

# fdupes -r /home
/home/shark/home.html
/home/shark/index.html

/home/dory/.bashrc
/home/eel/.bashrc

/home/nemo/.profile
/home/dory/.profile
/home/shark/.profile

/home/nemo/tryme
/home/shs/tryme

/home/shs/arrow.png
/home/shs/PNGs/arrow.png

/home/shs/11/files_11.zip
/home/shs/ERIC/file_11.zip

/home/shs/penguin0.jpg
/home/shs/PNGs/penguin.jpg
/home/shs/PNGs/penguin0.jpg

/home/shs/Sandra_rotated.png
/home/shs/PNGs/Sandra_rotated.png
复制代码

fdupeMany options are listed command is as follows. Use fdupes -hcommand or read the man page for details.

-r --recurse     recurse
-R --recurse:    recurse through specified directories
-s --symlinks    follow symlinked directories
-H --hardlinks   treat hard links as duplicates
-n --noempty     ignore empty files
-f --omitfirst   omit the first file in each set of matches
-A --nohidden    ignore hidden files
-1 --sameline    list matches on a single line
-S --size        show size of duplicate files
-m --summarize   summarize duplicate files information
-q --quiet       hide progress indicator
-d --delete      prompt user for files to preserve
-N --noprompt    when used with --delete, preserve the first file in set
-I --immediate   delete duplicates as they are encountered
-p --permissions don't soncider files with different owner/group or
                 permission bits as duplicates
-o --order=WORD  order files according to specification
-i --reverse     reverse order while sorting
-v --version     display fdupes version
-h --help        displays help
复制代码

fdupes Another command you may need to install and use for some time to become familiar with the command of its many options.

to sum up

Linux systems provide a positioning and (potentially) can remove duplicate files of a series of good tools, and lets you specify the search area, and when options for handling duplicate files when you discovered.


via: www.networkworld.com/article/339…

Author: Sandra Henry-Stocker topics: lujun9972 Translator: tomjlw proofread: wxy

This article from the LCTT original compiler, Linux China is proud

Reproduced in: https: //juejin.im/post/5cfe74985188254ee433c032

Guess you like

Origin blog.csdn.net/weixin_34101784/article/details/93183697