awk overview

VARIABLES, RECORDS AND FIELDS

    AWK  variables are dynamic; they come into existence when they are first used.  Their values are either floating-point numbers or strings, or both, depending upon how they are used.  AWK also has one dimensional arrays; arrays with multiple dimensions may be simulated.  Several pre-defined variables are set as a program runs; these are described as needed and summarized below.
 
    awk的变量不需要声明,第一次使用时就定义了。数据类型是弱类型,支持 数字,浮点数,字符串。也支持一维数组。自定义的变量后面再介绍。
 
   Records
    Normally, records are separated by newline characters.  You can control how records are separated by assigning values to the built-in variable RS.  If RS is any single character, that character  separates  records.   Otherwise, RS is a regular expression.  Text in the input that matches this regular expression separates the record.  However, in compatibility mode, only the first character of its string value is used for separating records.  If RS is set to the null string, then records are separated by blank lines.  When RS is set to  the  null  string,  the  newline character always acts as a field separator, in addition to whatever value FS may have.
 
    awk 是面向行的语言。每一行输入被称为一个record。RS是控制record分隔的,默认RS是 换行符。如果RS不是单个字符作为分隔符,那么就是一个正则表达式,即该正则表达式作为分隔符。在兼容模式,字符串的第一个字符才作为行分隔符。如果RS为NULL,空白行作为行分隔符;另外无论FS的值是什么,换行符总是作为域分隔符。
 
   Fields
   As each input record is read, gawk splits the record into fields, using the value of the FS variable as the field separator.  If FS is a single character, fields are separated by that character.  If FS is the null string, then each individual character becomes a separate field.  Otherwise, FS is expected to be a full regular expression.  In the special case that FS is  a  single space,  fields  are  separated  by  runs  of spaces and/or tabs and/or newlines.  (But see the section POSIX COMPATIBILITY, below).  NOTE: The value of IGNORECASE (see below) also affects how fields are split when FS is a regular expression, and how records are separated when RS is a regular expression.
       If the FIELDWIDTHS variable is set to a space separated list of numbers, each field is expected to have fixed width, and gawk splits up the record using the specified widths.  The value of FS is ignored.  Assigning a new value to FS or FPAT overrides the use of FIELDWIDTHS.
       Similarly,  if the FPAT variable is set to a string representing a regular expression, each field is made up of text that matches that regular expression. In this case, the regular expression describes the fields themselves, instead of the text that separates the fields.  Assigning a new value to FS or FIELDWIDTHS overrides the use of FPAT.
       Each field in the input record may be referenced by its position, $1, $2, and so on.  $0 is the whole record.  Fields need not be referenced by constants:
              n = 5
              print $n
       prints the fifth field in the input record.
 
The variable NF is set to the total number of fields in the input record.
       References to non-existent fields (i.e. fields after $NF) produce the null-string.  However, assigning to a non-existent field (e.g., $(NF+2) = 5) increases  the  value  of  NF,  creates  any intervening  fields  with  the  null  string as their value, and causes the value of $0 to be recomputed, with the fields being separated by the value of OFS.  References to negative numbered fields cause a fatal error.  Decrementing NF causes the values of fields past the new value to be lost, and the value of $0 to be recomputed, with the fields being separated by the  value  of OFS.
       Assigning  a  value  to an existing field causes the whole record to be rebuilt when $0 is referenced.  Similarly, assigning a value to $0 causes the record to be resplit, creating new values for the fields.
 
awk 将每一行(record)  通过域分隔符(FS) 将这个行(record)分为多个部分。如果FS是单个字符,那么就用该字符作为域分隔符。如果FS是空串,那么每一个字符都作为一个域。如果FS是一个空格,那么空格 和/或 tabs 和/或换行符 作为域分隔符。
 每一个record中的域通过$1,$2,...来引用。$0表示整行。
 
Built-in Variables
       Gawk's built-in variables are:
       ARGC             The number of command line arguments (does not include options to gawk, or the program source).
       ARGIND         The index in ARGV of the current file being processed.
       ARGV             Array of command line arguments.  The array is indexed from 0 to ARGC - 1.  Dynamically changing the contents of ARGV can control the files used for data.
       BINMODE     On non-POSIX systems, specifies use of “binary” mode for all file I/O.  Numeric values of 1, 2, or 3, specify that input files, output files, or all  files,  respectively,  should use binary I/O.  String values of "r", or "w" specify that input files, or output files, respectively, should use binary I/O.  String values of "rw" or "wr" specify that all files should use binary I/O.  Any other string value is treated as "rw", but generates a warning message.
       CONVFMT     The conversion format for numbers, "%.6g", by default.
       ENVIRON       An array containing the values of the current environment.  The array is indexed by the environment variables, each element being the value of that variable (e.g., ENVIRON["HOME"] might be /home/arnold).  Changing this array does not affect the environment seen by programs which gawk spawns via redirection or the system() function.
       ERRNO           If  a system error occurs either doing a redirection for getline, during a read for getline, or during a close(), then ERRNO will contain a string describing the error.  The value  is subject to translation in non-English locales.
       FIELDWIDTHS A whitespace separated list of field widths.  When set, gawk parses the input into fields of fixed width, instead of using the value of the FS variable  as  the  field  separator.  See Fields, above.
       FILENAME       The  name  of  the current input file.  If no files are specified on the command line, the value of FILENAME is “-”.  However, FILENAME is undefined inside the BEGIN block (unless set by getline).
       FNR                 The input record number in the current input file.---可获取文件的行数
       FPAT                A regular expression describing the contents of the fields in a record.  When set, gawk parses the input into fields, where the fields match the  regular  expression,  instead  of using the value of the FS variable as the field separator.  See Fields, above.
       FS                   The input field separator, a space by default.  See Fields, above.
 
       IGNORECASE  Controls  the  case-sensitivity  of all regular expression and string operations.  If IGNORECASE has a non-zero value, then string comparisons and pattern matching in rules, field splitting with FS and FPAT, record separating with RS, regular expression matching with ~ and !~, and the gensub(), gsub(), index(), match(), patsplit(), split(), and sub() built-in functions all ignore case when doing regular expression operations.  NOTE: Array subscripting is not affected.  However, the asort() and asorti() functions are affected. Thus,  if  IGNORECASE is not equal to zero, /aB/ matches all of the strings "ab", "aB", "Ab", and "AB".  As with all AWK variables, the initial value of IGNORECASE is zero, so all regular expression and string operations are normally case-sensitive.
       LINT        Provides dynamic control of the --lint option from within an AWK program.  When true, gawk prints lint warnings. When false, it does not.  When assigned the string value  "fatal", lint warnings become fatal errors, exactly like --lint=fatal.  Any other true value just prints warnings.
        NF          The number of fields in the current input record.---可获取每一行的字段(field)数
       NR          The total number of input records seen so far.
       OFMT      The output format for numbers, "%.6g", by default.
       OFS         The output field separator, a space by default.
       ORS         The output record separator, by default a newline.
 
       PROCINFO    The elements of this array provide access to information about the running AWK program.  On some systems, there may be elements in the array, "group1" through "groupn" for some n, which is the number of supplementary groups that the process has.  Use the in operator to test for these elements.  The following elements are guaranteed to be available:
                   PROCINFO["egid"]    the value of the getegid(2) system call.
                   PROCINFO["strftime"]
                                       The default time format string for strftime().
                   PROCINFO["euid"]    the value of the geteuid(2) system call.
                   PROCINFO["FS"]      "FS" if field splitting with FS is in effect, "FPAT" if field splitting with FPAT is in effect, or "FIELDWIDTHS" if field  splitting  with  FIELDWIDTHS  is  in effect.
                   PROCINFO["gid"]     the value of the getgid(2) system call.
                   PROCINFO["pgrpid"]  the process group ID of the current process.
                   PROCINFO["pid"]     the process ID of the current process.
                   PROCINFO["ppid"]    the parent process ID of the current process.
                   PROCINFO["uid"]     the value of the getuid(2) system call.
                   PROCINFO["sorted_in"]
                                       If this element exists in PROCINFO, then its value controls the order in which array elements are traversed in for loops.  Supported values are "@ind_str_asc", "@ind_num_asc", "@val_type_asc", "@val_str_asc", "@val_num_asc", "@ind_str_desc",  "@ind_num_desc",  "@val_type_desc",  "@val_str_desc",  "@val_num_desc",  and "@unsorted".  The value can also be the name of any comparison function defined as follows: function cmp_func(i1, v1, i2, v2)
                   where  i1 and i2 are the indices, and v1 and v2 are the corresponding values of the two elements being compared.  It should return a number less than, equal to, or greater than 0, depending on how the elements of the array are to be ordered.
                   PROCINFO["version"]
                          the version of gawk.
 
       RS          The input record separator, by default a newline.
       RT          The record terminator.  Gawk sets RT to the input text that matched the character or regular expression specified by RS.
       RSTART      The index of the first character matched by match(); 0 if no match.  (This implies that character indices start at one.)
       RLENGTH     The length of the string matched by match(); -1 if no match.
       SUBSEP      The character used to separate multiple subscripts in array elements, by default "\034".
       TEXTDOMAIN  The text domain of the AWK program; used to find the localized translations for the program's strings.
 
    Arrays
       Arrays are subscripted with an expression between square brackets ([ and ]).  If the expression is an expression list (expr, expr ...)  then the array subscript is a string consisting of  the concatenation of the (string) value of each expression, separated by the value of the SUBSEP variable.  This facility is used to simulate multiply dimensioned arrays.  For example:
              i = "A"; j = "B"; k = "C"
              x[i, j, k] = "hello, world\n"
       assigns the string "hello, world\n" to the element of the array x which is indexed by the string "A\034B\034C".  All arrays in AWK are associative, i.e. indexed by string values.
       The special operator in may be used to test if an array has an index consisting of a particular value:
              if (val in array)
                   print array[val]
       If the array has multiple subscripts, use (i, j) in array.
       The in construct may also be used in a for loop to iterate over all the elements of an array.
       An element may be deleted from an array using the delete statement.  The delete statement may also be used to delete the entire contents of an array, just by specifying the array name without a subscript.
 
       gawk supports true multidimensional arrays. It does not require that such arrays be ``rectangular'' as in C or C++.  For example:
              a[1] = 5
              a[2][1] = 6
              a[2][2] = 7
 
数组是通过一个表达式进行索引,而这个表达式放在一对方括号中。如果该表达式是一个表达式列表(expr,expr ...),那么这个数组的每一个索引是 每一个expr通过SUBSEP连接起来的整个expression.因为awk只支持一维数组,那么如果数组索引是一个表达式,那么就是一维数组;如果数组索引是 表达式列表,那么就可以模拟多维数组。通常C语言中数组的索引是数字,所以从C的角度去考虑就会不好理解。但是如果从 key-value 数组的形式去考虑,例如 php,就很好理解awk的数组了。
 
  Variable Typing And Conversion
       Variables and fields may be (floating point) numbers, or strings, or both.  How the value of a variable is interpreted depends upon its context.  If used in a numeric expression, it  will  be treated as a number; if used as a string it will be treated as a string.
       To force a variable to be treated as a number, add 0 to it; to force it to be treated as a string, concatenate it with the null string.
       When  a  string  must  be  converted  to  a  number,  the  conversion  is accomplished using strtod(3).  A number is converted to a string by using the value of CONVFMT as a format string for sprintf(3), with the numeric value of the variable as the argument.  However, even though all numbers in AWK are floating-point, integral values are always converted as integers.  Thus, given
              CONVFMT = "%2.2f"
              a = 12
              b = a ""
       the variable b has a string value of "12" and not "12.00".
       NOTE: When operating in POSIX mode (such as with the --posix command line option), beware that locale settings may interfere with the way decimal numbers are treated: the decimal separator of the numbers you are feeding to gawk must conform to what your locale would expect, be it a comma (,) or a period (.).
       Gawk  performs  comparisons as follows: If two variables are numeric, they are compared numerically.  If one value is numeric and the other has a string value that is a “numeric string,” then comparisons are also done numerically.  Otherwise, the numeric value is converted to a string and a string comparison is performed.  Two strings are compared, of course, as strings.
       Note that string constants, such as "57", are not numeric strings, they are string constants.  The idea of “numeric string” only applies to fields, getline  input,  FILENAME,  ARGV  elements, ENVIRON elements and the elements of an array created by split() or patsplit() that are numeric strings.  The basic idea is that user input, and only user input, that looks numeric, should be treated that way.
       Uninitialized variables have the numeric value 0 and the string value "" (the null, or empty, string).

猜你喜欢

转载自www.cnblogs.com/black-mamba/p/8881915.html
awk