The Linux system provides a large number of convenient tools through the shell, such as: awk, grep, sort, more, less, tail, etc., which are convenient for programmers or data analysis personnel to quickly analyze some small files. Master these tools, you can Greatly improve the efficiency of simple data analysis.
One, awk common skills and methods
-
Each line of the file is deduplicated by the second column and prints the different values and occurrences of the second column:
awk -F"\t" '{a[$2]+=1}END{for(x in a) print x"\t"a[x]}' a
-
Find the intersection of the first column of data in two files:
Suppose the file names are a and b, and each line of the file is separated by \t
awk -F"\t" 'ARGIND==1{a[$1]=1}ARGIND==2{if($1 in a) print}' a b
in:
-F means specify the delimiter for each line
ARGIND==1 means process the first file
END
2. Common skills and methods of grep
-
Print out the lines in the file that contain china:
grep "china" a
Print out the lines in the file that start with the china string:
grep "^china" a
-
Print out the strings in the file that appear between abc and xx:
grep -o -P "(?<=abc).*?(?=xx)" a
END
3. Common skills and methods of sort command
-
Sort the file first by the second column, then by the third column, separating each column with a space:
sort -t' ' +1 -2 +2 -3 file
END
4. Combination usage
-
Count the number of occurrences of each ip in the accesslog log, and sort them in ascending order:
awk '{a[$1]+=1}END{for(x in a) print x,a[x]}' file | sort +1rg -2 > file.output
-
The file contains the line number of china:
grep "china" file | wc -l
-
View the log with warning in the online real-time rolling log:
tail -f file.log | grep "warning"
END
Five, other commonly used shell scripts
-
Randomly take 100 lines of a file A with 100w lines:
cat A | shuf -n 100
-
Query the files ending with .log under the /home directory and count the total file size:
find /home -name *.log | xargs du -s | awk '{print;sum+=$1}END{print sum}'
Delete all .svn files under the /home directory:
find /home -name .svn | xargs rm -rf