Awk tool use

AWK Introduction

AWK is a "pattern scanning and processing language". It allows you to create short programs that read input files, sort data, process data, perform calculations on input, and generate reports. Its name is taken from the first letters of the surnames of its founders Alfred Aho, Peter Weinberger and Brian Kernighan.

The awk command discussed in this article mainly refers to the built-in program / bin / gawk widely included in the Linux operating system, which is the GNU version of the Unix awk program. This command is mainly responsible for reading and running programs written in AWK language. On the Windows platform, you can use Cygwin to run awk commands in a simulated environment.

Basically, awk can determine whether there is a record of a specified pattern (that is, a line of text) from the input (standard input, or one or more files). Each time a match is found, the associated action (such as writing to standard output or an external file) is performed.

AWK language basics

In order to understand the AWK program, we outline its basic knowledge below. AWK programs can be composed of one or more lines of text, where the core part contains a combination of patterns and actions.

 pattern { action }

The pattern is used to match each line of text in the input. For each line of text on the match, awk executes the corresponding action (action). Use braces to separate patterns and actions. Awk sequentially scans each line of text, and uses a record separator (usually a newline character) to read each line as a record, and a field separator (usually a space or tab) to split a line of text into multiple fields, Each domain can be represented by $ 1, $ 2,… $ n. $ 1 indicates the first domain, $ 2 indicates the second domain, and $ n indicates the nth domain. $ 0 means the entire record. No pattern or action can be specified. In the case of the default pattern, all lines will be matched. In the case of the default action, the action {print} is executed, that is, the entire record is printed.

Use awk to break down the information in the log

Take the following example log as an example:

 202.189.63.115 - - [31/Aug/2012:15:42:31 +0800] "GET / HTTP/1.1" 200 1365 "-" 
 "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1"
    • $ 0 is the entire record line
    • $ 1 is the access IP ”202.189.63.115”
    • $ 4 is the first half of the request time "[31 / Aug / 2012: 15: 42: 31"
    • $ 5 is the second half of the request time "+0800]"

And so on ...

When we use the default domain separator, we can parse out the following different types of information from the log:

     awk '{print $ 1}' access.log # IP address (% h) 
     awk '{print $ 2}' access.log # RFC 1413 logo (% l) 
     awk '{print $ 3}' access.log # user ID (% u) 
     awk '{print $ 4, $ 5}' access.log # date and time (% t) 
     awk '{print $ 7}' access _log # URI (%> s) 
     awk '{print $ 9}' access _log # status code (%> s) 
     awk '{print $ 10}' access _log # response size (% b)

It is not difficult to find that only using the default domain separator, it is inconvenient to parse out other information such as the request line, the reference page, and the browser type, because the information contains an indefinite number of spaces. Therefore, we need to modify the domain separator to "" to be able to read this information easily.

     awk -F \ "'{print $ 2}' access.log # request line (% r) 
     awk -F \" '{print $ 4}' access.log # reference page 
     awk -F \ "'{print $ 6}' access .log # browser

Note: To avoid the misunderstanding of "Unix / Linux Shell" here, "beginning with a string, we used a backslash, escaped".

Examples of using awk scenarios

  • Statistics browser type

If we want to know which types of browsers have visited the website and arranged them in reverse order of occurrence, I can use the following command:

     awk -F\" '{print $6}' access.log | sort | uniq -c | sort -fr

This command line first parses out the browser domain, and then uses the pipeline to input the output as the first sort command. The first sort command is mainly to facilitate the uniq command to count the number of occurrences of different browsers. The last sort command will sort and output the previous statistical results in reverse order.

  • Discover system problems

We can use the following command line to count the status codes returned by the server and find possible problems with the system.

     awk '{print $9}' access.log | sort | uniq -c | sort

Under normal circumstances, the status code 200 or 30x should be the most frequent. 40x generally indicates a client access problem. 50x generally indicates a server-side problem.

Here are some common status codes:

    • 200-The request was successful. The response header or data body requested by the request will be returned with this response.
    • 206-The server has successfully processed some GET requests
    • 301-The requested resource has been permanently moved to a new location
    • 302-The requested resource now temporarily responds to the request from a different URI
    • 400-Bad request. The current request cannot be understood by the server
    • 401-The request is not authorized. The current request requires user authentication.
    • 403-No access. The server has understood the request, but refused to execute it.
    • 404-The file does not exist and the resource was not found on the server.
    • 500-The server encountered an unexpected condition that prevented it from completing the request.
    • 503-Due to temporary server maintenance or overload, the server is currently unable to process the request.

For the definition of HTTP protocol status codes, please refer to: Hypertext Transfer Protocol-HTTP / 1.1

Examples of awk commands for status codes:

1. Find and display all requests with status code 404

     awk '($9 ~ /404/)' access.log

2. Count all requests with status code 404

     awk '($9 ~ /404/)' access.log | awk '{print $9,$7}' | sort

3. Now we assume that a request (for example: URI: / path / to / notfound) generates a large number of 404 errors. We can use the following command to find out which reference page this request comes from and from what browser .

     awk -F\" '($2 ~ "^GET /path/to/notfound "){print $4,$6}' access.log
  • Track who is on the photos

System administrators sometimes find that other websites use pictures saved on their websites for some reason. If you want to know who is using the pictures on your website without authorization, we can use the following command:

     awk -F\" '($2 ~ /\.(jpg|gif|png)/ && $4 !~ /^http:\/\/www\.aaa\.com/) {print $4}' access.log \ | sort | uniq -c | sort

Note: Before use, change www.aaa.com to the domain name of your website.

    • Use "to break down each line;
    • The request line must include ".jpg", ".gif" or ".png";
    • The referring page does not start with your website domain name string (in this case, www.aaa.com);
    • Show all cited pages and count the number of occurrences.
  • awk filter access.log information related to IP address

Count how many different ip access:
  awk '{rint $ 1}' access.log | sort | uniq | wc -log | sort | uniq | wc -l

Count how many pages each ip visited
  awk '{++ s [$ 1]} END {for (a in s) print a, s [a]}' access.log

Sort the number of pages visited by each ip from small to large
  awk '{++ s [$ 1]} END {for (a in s) print s [a], a}' access.log | sort -n

Check which page a certain ip visits
  grep ^ 192.168.1.1 access.log | awk '{print $ 1, $ 7}'

Count how many visits
  awk '{print $ 4, $ 1}' access.log | grep 02 / Feb / 2020: 02 | awk '{print $ 2}' | sort | uniq | wc -l

Count the top ten most visited IP addresses
  awk '{print $ 1}' access.log | sort | uniq -c | sort -nr | head -10

  • awk filter access.log response page size information

List several files with the largest transfer size

     cat access.log |awk '{print $10 " " $1 " " $4 " " $7}'|sort -nr|head -100

List pages with output greater than 204800 byte (200kb) and the number of occurrences of corresponding pages

     cat access.log |awk '($10 > 200000){print $7}'|sort -n|uniq -c|sort -nr|head -100
  • Awk filter access.log and page response time related information

If the last column of the log records the page file transfer time (% T), for example, we can customize the log format as:

     LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T" combined

You can use the following command to count all log records with a response time of more than 3 seconds.

     awk '($NF > 3){print $0}' access.log

Note: NF is the number of fields in the current record. $ NF is the last domain.

List requests that exceed 5 seconds

     awk '($ NF> 5) {print $ 0}' access.log | awk -F \ "'{print $ 2}' | sort -n | uniq -c | sort -nr | head -20 




reproduced in: https: / /blog.csdn.net/jack_shuai/article/details/102835569



Guess you like

Origin www.cnblogs.com/psc0218/p/12752052.html