shell learning 3

Table of contents

1. Find and delete duplicate files

2. List file type statistics

3. List only the directory


1. Find and delete duplicate files

Duplicate files are multiple copies of the same file. Sometimes we need to delete duplicate files and keep only one copy. Identifying duplicate files by looking at their contents can be quite interesting. Various shell tools can be combined to accomplish this task. In this how-to, we discuss how to find duplicate files and take actions based on the results.

We can identify them by comparing the file contents. The checksum is calculated based on the content of the file. Files with the same content will naturally generate the same checksum. Therefore, we can delete duplicate files by comparing the checksum.

 

 Screenplay:

#!/bin/bash
ls -lS --time-style=long-iso | awk 'BEGIN {
getline; getline;
name1=$8; size=$5
}
{
name2=$8;
if (size==$5)
{
"md5sum "name1 | getline; csum1=$1;
"md5sum "name2 | getline; csum2=$1;
if ( csum1==csum2 )
{
print name1; print name2
}
};
size=$5; name1=name2;
}' | sort -u > duplicate_files

cat duplicate_files | xargs -I {} md5sum {} | sort | uniq -w 32 | awk '{ print
"^"$2"$" }' | sort -u > duplicate_sample

echo Removing..
comm duplicate_files duplicate_sample -2 -3 | tee /dev/stderr | xargs rm
echo Removed duplicates files successfully.

operation result:

 

Code understanding:

1. ls -lS sorts all the files in the current directory according to the file size, and lists the detailed information of the files

--time-style=long-iso display date and time (including year), in long format yyyy-mm-dd hh:mm:ss

 

2、awk

awk 'BEGIN {
getline; getline;
name1=$8; size=$5
}
{
name2=$8;
if (size==$5)
{
"md5sum "name1 | getline; csum1=$1;
"md5sum "name2 | getline; csum2=$1;
if ( csum1==csum2 )
{
print name1; print name2
}
};
size=$5; name1=name2;
}'

The first getline reads line 1, then discards. That is to say, the first getline will read total 16 and delete it.

The second getline reads the second line, and then assigns the eighth column of the second line to name1, and assigns the fifth column to size.

The second getline reads the third line, then assigns the eighth column of the third line to name2, and compares the fifth column with size. If the comparison result is equal in size. Then md5sum name1 will calculate the md5 value of name1, and assign the first column of the result (the result has two columns, the first column is the md5 value of the file, and the second column is the file name.) to csum1; similarly, the csum2 variable What is stored is the md5 value of the name2 file. If the md5 values ​​of the files of name1 and name2 are the same, print name1 and name2. Otherwise, assign the fifth column of the current line to size. The eighth column of the current row is assigned to name1. That is, line 2 is compared to line 3, and line 3 is compared to line 2.

In awk, the output of an external command can be read with: "cmd" | getline

3、sort

sort defaults to sort from small to large

sort -u can sort and deduplicate

4、xargs

xargs -I Specifies the substitution string for each command-line argument.

 cat duplicate_files | xargs -I {} md5sum {} is equivalent to test md5sum, test_copy1 md5sum.

5、uniq -w 32

Compare the first 32 characters of each line and delete if they are the same. (the first 32 characters in md5sum output, the output of md5sum usually consists of 32 characters hash value and file name)

 

6、tee /dev/stderr

The tree command has a neat trick here: it also acts as a print while passing the filename to the rm command.
tee writes lines from stdin to a file, sending them to stdout at the same time. We can also redirect text to stderr
for terminal printing. /dev/stderr is the device corresponding to stderr (standard error). By redirecting to the stderr
device file, text from stdin will appear in the terminal as standard error


2. List file type statistics

Write a script that iterates through all the files in a directory and generates a report detailing the file types and the number of each file type.

Screenplay:

#!/bin/bash
# 文件名:filestat.sh
if [ $# -ne 1 ];
then
echo "Usage is $0 basepath";
exit
fi

path=$1
declare -A statarray;

while read line;
do
ftype=`file -b "$line" | cut -d, -f1`
let statarray["$ftype"]++;
done < <(find $path -type f -print)

echo ============ File types and counts =============
for ftype in "${!statarray[@]}";
do
echo $ftype : ${statarray["$ftype"]}
done

run:

 

 Code understanding:

1、declare -A statarray;

declares an associative array statarray

2、find $path -type f -print

List and print all files in a directory and its subdirectories

 

3、file -b

Identify the file type, when listing the identification results, the file name is not displayed

 

4、cut -d , -f1

Use a comma as a separator, and take out the first one.

 

5、let statarray["$ftype"]++;

Use the file type as the array index, and store the quantity of each file type in the array. Each time a file type is encountered, increment the count with let

 6、${!statarray[@]} 

List all indices of an associative array

7、<(find $path -type f -print)

<(find $path -type f -print) is equivalent to filename. It just replaces the filename with subprocess output.
Note that the first < is used for input redirection, and the second < is used to convert the output of the subprocess into a file name. There is a space between the two <
to prevent the shell from interpreting it as the << operator.


3. List only the directory

Guess you like

Origin blog.csdn.net/weixin_53308294/article/details/129227967