Detailed explanation of Linux text processing

1. Text processing

This section will introduce the most commonly used tools when using Shell to process text under Linux: find, grep, xargs, sort, uniq, tr, cut, paste, wc, sed, awk; the examples and parameters provided are commonly used; The principle of using shell scripts is to write commands in a single line, try not to exceed 2 lines; if you have more complex task requirements, consider python;

1.1. find file search

Find txt and pdf files:

find . \( -name "*.txt" -o -name "*.pdf" \) -print

Regular way to find .txt and pdf:

find . -regex  ".*\(\.txt|\.pdf\)$"

-iregex: Regex that ignores case

Negate the argument and find all non-txt text:

find . ! -name "*.txt" -print

Specify the search depth and print out the files in the current directory (the depth is 1):

find . -maxdepth 1 -type f

custom search

  • search by type
find . -type d -print  //只列出所有目录

-type f file / l symbolic link / d directory

The file retrieval type supported by find can distinguish between ordinary files and symbolic links, directories, etc., but binary files and text files cannot be directly distinguished by the type of find;

The file command can check the specific type of the file (binary or text):

$file redis-cli  # 二进制文件
redis-cli: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped
$file redis.pid  # 文本文件
redis.pid: ASCII text

Therefore, the following command combinations can be used to find all binary files in the local directory:

ls -lrt | awk '{print $9}'|xargs file|grep  ELF| awk '{print $1}'|tr -d ':'
    • search by time

      -atime access time (unit is day, minute unit is -amin, similar below) -mtime modification time (content is modified) -ctime change time (metadata or permission change)

All files accessed in the last 7 days:

find . -atime 7 -type f -print

All files accessed in the last 7 days:

find . -atime -7 -type f -print

Query all files accessed 7 days ago:

find . -atime +7 type f -print
  • Search by size:

w word k MG looking for files larger than 2k:

find . -type f -size +2k

Find by authority:

find . -type f -perm 644 -print //找具有可执行权限的所有文件

Find by user:

find . -type f -user weber -print// 找用户weber所拥有的文件

Follow-up actions after finding

  • delete

Delete all swp files in the current directory:

find . -type f -name "*.swp" -delete

Another syntax:

find . type f -name "*.swp" | xargs rm
  • Execute actions (mighty exec)

Change the ownership of the current directory to weber:

find . -type f -user root -exec chown weber {} \;

Note: {} is a special string, for each matching file, {} will be replaced with the corresponding file name;

Copy all found files to another directory:

find . -type f -mtime +10 -name "*.txt" -exec cp {} OLD \;
  • combine multiple commands

If multiple commands need to be executed subsequently, multiple commands can be written into a script. Then execute the script when -exec is called:

-exec ./commands.sh {} \;

-print delimiter

By default, '\n' is used as the delimiter of the file;

-print0 uses '\0' as the delimiter of the file, so that files containing spaces can be searched;

1.2. grep text search

grep match_patten file // 默认访问匹配行

Common parameters

  • -o output only matching lines of text VS -v output only non-matching text lines

    • -c counts the number of times the text is contained in the file

      grep -c “text” filename

  • -n print matching line number

  • -i ignore case when searching

  • -l print only the filename

Recursive search for text in multi-level directories (programmers' favorite code search):

grep "class" . -R -n

Matches multiple patterns:

grep -e "class" -e "vitural" file

grep outputs zero-terminated filenames (-z):

grep "test" file* -lZ| xargs -0 rm

Comprehensive application: find out all the sql with where condition in the log:

cat LOG.* | tr a-z A-Z | grep "FROM " | grep "WHERE" > b

Find Chinese example: In the project directory, there are two files in utf-8 format and gb2312 format, and the word to be searched is Chinese;

  1. Find its utf-8 encoding and gb2312 encoding are E4B8ADE69687 and D6D0CEC4 respectively

  2. Inquire:

    grep:grep -rnP "\xE4\xB8\xAD\xE6\x96\x87|\xD6\xD0\xCE\xC4" *即可
    

Chinese character code query: http://bm.kdd.cc/

1.3. xargs command line argument conversion

xargs can convert input data into command-line parameters for specific commands; in this way, it can be used in combination with many commands. Such as grep, such as find; - convert multi-line output into single-line output

cat file.txt| xargs

n is the delimiter between multiple lines of text

  • Convert single line to multiline output
cat single.txt | xargs -n 3

-n: Specifies the number of fields displayed per line

xargs parameter description

  • -d defines the delimiter (the default is space and the delimiter for multiple lines is n)
  • -n specifies output as multiple lines
  • -I {} specifies the replacement string, which will be replaced when xargs expands, and is used when the command to be executed requires multiple parameters
  • -0: Specify 0 as the input delimiter

Example:

cat file.txt | xargs -I {} ./command.sh -p {} -1

#统计程序行数
find source_dir/ -type f -name "*.cpp" -print0 |xargs -0 wc -l

#redis通过string存储数据,通过set存储索引,需要通过索引来查询出所有的值:
./redis-cli smembers $1  | awk '{print $1}'|xargs -I {} ./redis-cli get {}

1.4. sort sorting

field description

  • -n sorts numerically vs -d sorts lexicographically
  • -r sort in reverse order
  • -k N specifies to sort by the Nth column

Example:

sort -nrk 1 data.txt
sort -bd data // 忽略像空格之类的前导空白字符

1.5. uniq eliminates duplicate lines

  • Eliminate duplicate rows
sort unsort.txt | uniq
  • Count the number of times each line occurs in a file
sort unsort.txt | uniq -c
  • find duplicate rows
sort unsort.txt | uniq -d

You can specify the repeated content that needs to be compared in each line: -s start position -w compares the number of characters

1.6. Transformation with tr

  • General usage
echo 12345 | tr '0-9' '9876543210' //加解密转换,替换对应字符
cat text| tr '\t' ' '  //制表符转空格
  • tr deletes characters
cat file | tr -d '0-9' // 删除所有数字

-c complement set

cat file | tr -c '0-9' //获取文件中所有数字
cat file | tr -d -c '0-9 \n'  //删除非数字数据
  • tr compressed characters

tr -s compresses repeated characters that occur in text; most commonly used to compress redundant whitespace:

cat file | tr -s ' '
  • character class

  • Various character classes are available in tr:

    alnum: letters and numbers alpha: letters digit: numbers space: blank characters lower: lowercase upper: uppercase cntrl: control (non-printable) characters print: printable characters

Usage: tr [:class:] [:class:]

tr '[:lower:]' '[:upper:]'

1.7. cut Split text by columns

  • Capture the 2nd and 4th columns of the file
cut -f2,4 filename
  • Go to file for all columns except column 3
cut -f3 --complement filename
  • -d specifies the delimiter
cat -f2 -d";" filename
    • The range taken by cut

      N- Nth field to end-M The first field is MN-M N to M fields

    • unit of cut

      -b in bytes -c in characters -f in fields (use delimiters)

Example:

cut -c1-5 file //打印第一到5个字符
cut -c-2 file  //打印前2个字符

Extract the 5th to 7th columns of the text

$echo string | cut -c5-7

1.8. paste Concatenate text by column

Stitch two texts together column by column;

cat file1
1
2

cat file2
colin
book

paste file1 file2
1 colin
2 book

The default delimiter is tab, you can use -d to specify the delimiter:

paste file1 file2 -d ","
1,colin
2,book

1.9. wc tool for counting lines and characters

$wc -l file // 统计行数

$wc -w file // 统计单词数

$wc -c file // 统计字符数

1.10. sed text replacement tool

  • first replacement
sed 's/text/replace_text/' file   //替换每一行的第一处匹配的text
  • global replacement
sed 's/text/replace_text/g' file

After the default replacement, the replaced content is output. If you need to directly replace the original file, use -i:

sed -i 's/text/repalce_text/g' file
  • remove blank lines
sed '/^$/d' file
  • variable conversion

Matched strings are referenced by the token &.

echo this is en example | sed 's/\w+/[&]/g'
$>[this]  [is] [en] [example]
  • substring match tag

The first matching parenthesis content is referenced using token 1

sed 's/hello\([0-9]\)/\1/'
  • double quote evaluation

Sed is usually quoted with single quotes; double quotes can also be used, after using double quotes, double quotes will evaluate the expression:

sed 's/$var/HLLOE/'

When using double quotes, we can specify variables in sed patterns and replacement strings;

eg:
p=patten
r=replaced
echo "line con a patten" | sed "s/$p/$r/g"
$>line con a replaced
  • other examples

String insertion character: Convert each line of text (ABCDEF) to ABC/DEF:

sed 's/^.\{3\}/&\//g' file

1.11. awk data stream processing tool

  • awk script structure
awk ' BEGIN{ statements } statements2 END{ statements } '
  • Way of working

1. Execute the statement block in begin;

2. Read a line from the file or stdin, then execute statements2, and repeat this process until all the files are read;

3. Execute the end statement block;

print print the current line

  • When using print without parameters, the current line is printed
echo -e "line1\nline2" | awk 'BEGIN{print "start"} {print } END{ print "End" }'
  • When print is separated by commas, the parameters are delimited by spaces;
echo | awk ' {var1 = "v1" ; var2 = "V2"; var3="v3"; \
print var1, var2 , var3; }'
$>v1 V2 v3
  • The way of using - splicing character ("" as the splicing character);
echo | awk ' {var1 = "v1" ; var2 = "V2"; var3="v3"; \
print var1"-"var2"-"var3; }'
$>v1-V2-v3

Special variables: NR NF $0 $1 $2

NR: Indicates the number of records, corresponding to the current line number during execution;

NF: Indicates the number of fields, which always corresponds to the number of fields in the current row during execution;

$0: This variable contains the text content of the current line during execution;

$1: the text content of the first field;

$2: the text content of the second field;

echo -e "line1 f2 f3\n line2 \n line 3" | awk '{print NR":"$0"-"$1"-"$2}'
  • print the second and third fields of each line
awk '{print $2, $3}' file
  • Count the number of lines in the file
awk ' END {print NR}' file
  • Accumulates the first field of each row
echo -e "1\n 2\n 3\n 4\n" | awk 'BEGIN{num = 0 ;
print "begin";} {sum += $1;} END {print "=="; print sum }'

pass external variables

var=1000
echo | awk '{print vara}' vara=$var #  输入来自stdin
awk '{print vara}' vara=$var file # 输入来自文件

Use patterns to filter lines processed by awk

awk 'NR < 5' #行号小于5
awk 'NR==1,NR==4 {print}' file #行号等于1和4的打印出来
awk '/linux/' #包含linux文本的行(可以用正则表达式来指定,超级强大)
awk '!/linux/' #不包含linux文本的行

set delimiter

Use -F to set the delimiter (space is the default):

awk -F: '{print $NF}' /etc/passwd

read command output

Using getline, read the output of an external shell command into the variable cmdout:

echo | awk '{"grep root /etc/passwd" | getline cmdout; print cmdout }'

Using loops in awk

for(i=0;i<10;i++){print $i;}
for(i in array){print array[i];}

eg: the following string, print out the time string:

2015_04_02 20:20:08: mysqli connect failed, please check connect info
$echo '2015_04_02 20:20:08: mysqli connect failed, please check connect info'|awk -F ":" '{ for(i=1;i<=;i++) printf("%s:",$i)}'
>2015_04_02 20:20:08:  # 这种方式会将最后一个冒号打印出来
$echo '2015_04_02 20:20:08: mysqli connect failed, please check connect info'|awk -F':' '{print $1 ":" $2 ":" $3; }'
>2015_04_02 20:20:08   # 这种方式满足需求

And if you need to print the following part as well (the time part is printed separately from the following):

$echo '2015_04_02 20:20:08: mysqli connect failed, please check connect info'|awk -F':' '{print $1 ":" $2 ":" $3; print $4;}'
>2015_04_02 20:20:08
>mysqli connect failed, please check connect info

Print lines in reverse order: (implementation of the tac command):

seq 9| \
awk '{lifo[NR] = $0; lno=NR} \
END{ for(;lno>-1;lno--){print lifo[lno];}
} '

Awk combines grep to find the specified service, and then kill it

ps -fe| grep msv8 | grep -v MFORWARD | awk '{print $2}' | xargs kill -9;

Awk implements head and tail commands

  • head
awk 'NR<=10{print}' filename
  • tail
awk '{buffer[NR%10] = $0;} END{for(i=0;i<11;i++){ \
print buffer[i %10]} } ' filename

print the specified column

  • Awk way to achieve
ls -lrt | awk '{print $6}'
  • cut way to achieve
ls -lrt | cut -f6

Print the specified text area

  • Determine the line number
seq 100| awk 'NR==4,NR==6{print}'
  • OK text

Print the text between start_pattern and end_pattern:

awk '/start_pattern/, /end_pattern/' filename

Example:

seq 100 | awk '/13/,/15/'
cat /etc/passwd| awk '/mai.*mail/,/news.*news/'

Awk commonly used built-in functions

index(string, search_string): returns the position where search_string appears in string

sub(regex, replacement_str, string): Replace the first content matched by the regular expression with replacement_str;

match(regex, string): Check whether the regular expression can match the string;

length(string): returns the length of the string

echo | awk '{"grep root /etc/passwd" | getline cmdout; print length(cmdout) }'

printf is similar to printf in c language, format the output:

seq 10 | awk '{printf "->%4s\n", $1}'

1.12. Iterating over lines, words, and characters in a file

1. Iterate over each line in the file

  • while loop method
while read line;
do
echo $line;
done < file.txt

改成子shell:
cat file.txt | (while read line;do echo $line;done)
  • you
cat file.txt| awk '{print}'

2. Iterate over each word in a line

for word in $line;
do
echo $word;
done

3. Iterate over each character

${string:start_pos:num_of_chars}: extract a character from the string; (bash text slice)

${#word}: returns the length of the variable word

for((i=0;i<${#word};i++))
do
echo ${word:i:1);
done

Display files in ASCII characters:

$od -c filename

Guess you like

Origin blog.csdn.net/wgzblog/article/details/107863228