Array in awk

Array in awk
20130110

As a scripting language, awk supports mainly simple variables and array variables. The array in awk is different from the traditional array in C and java. It is more similar to the map in C ++ STL or the dict in python. It is an associative array, which combines key and value through an association. And it does not limit the type of key and value, you can mix multiple types of key and value in an array (although it may not be so often used). Variables in awk do not need to be declared before they are used, and their type is determined when they are used for the first time, and will not be changed in the future. So if a variable arr is used as an array for the first time, then it cannot be used as a simple variable in the future. Assuming that arrary is an array variable, if the key is not in the array, then arrary [key] will return an empty string; if the key is in the array, the corresponding variable arrary [key] will be returned. Awk can use the keyword in to determine whether the key appears in the array. And you can traverse the elements in the array in the form of for (key in array).

Sorting of arrays in awk

The array in awk is associative, so when traversing using for (key in array), the output order is not sorted and output according to the key, because the array in awk is implemented in hash mode The order of output can be uncertain. When you need to output elements in a certain order, you need to use the asort and asorti functions to assist.

The prototype of
asort is: asort (array1 [, arrary [2]])
asort is to sort the value of the array and return the number of elements in the array. If the simple method of asort (array) is adopted, the rearranged data will be saved in the array, but the association will be cancelled. The following is a simple example:
[c-sharp] view plaincopy
"wang" = > 15 1 => 33
"lim" => 14 2 => 19
"zhuang" => 19 asort ()-> 3 => 15
"han" => 334 4 => 14
"feng" => 11 5 => 11 After the
sorting is finished, the original relationship is cancelled, and the correspondence between index and value is replaced.

If you need to sort the keys, you can use the asorti function. The prototype of this function is:
asorti (array1 [, array2])
asorti sorts the array key values ​​and returns the number of elements in the array. Similar to asort, if you use the simple way of asorti (array), then the sorted key value becomes the value value, and the original relationship is destroyed. Here is a simple example:
[c-sharp] view plaincopy
"wang" => 15 1 => feng
"lim" => 14 2 => han
"zhuang" => 19 asort ()-> 3 => lim
" han "=> 33 4 => wang
" feng "=> 11 5 => zhuang

If you want to keep the original association, you can use the form asorti (array1, array2), the sorted key is stored in the array2 array.
Assume the following problem: a file contains information about a group of people, the format type is:
[cpp] view plaincopy
Name Age Height Weight
wanger 18 173 74
liusan 20 177 80
zhaojun 24 167 49
tianjing 30 179 75
haobo 28 171 65
We want to sort according to one of the items (such as height), of course, it can be achieved with the sort tool, now we use awk to achieve. The following is the code to sort all entries according to height:
[cpp] view plaincopy
{
arrary [$ 2] = $ 0
}
END {
asorti (arrary, height);
for (h in height)
{
print arrary [height [h]] ;
}
}

awk hash

Requirements:
2 file lists a [n], b [n]. Separate lines from two files and save them in different files

Method 1: Read each line of a and grep to b [n] respectively. The idea is simple, but the efficiency is too low to be crazy.
Method 2: comm, it is not clear what the principle is, but the efficiency is also very low. . .
Method 3: awk + hash (bubbling algorithm?)

  To find examples of behaviors that only exist in a [n]: set an outer loop to take each file $ file_a in a [n]; set an inner loop to make $ file_a and each b [n] The file $ file_b is compared. Each time a line that exists in $ file_a and does not exist in $ file_b is assigned to $ tmp, and then $ tmp is used instead of $ file_a to search in the next $ file_b, so as more filtered $ file_b There will be fewer and fewer lines in $ tmp. After all $ file_b is filtered, it is guaranteed that all the lines in $ file_a that are the same as $ file_b have been filtered.

  Alas, the actual application is not the same as programming in schools: the algorithms that can be implemented in schools are good algorithms; in fact, efficiency issues must also be considered. In applications with large amounts of data, my first version of the program can run for a week, and it can only be done in a few hours when optimized to the third version.

for file_a in a[n]
do
cp $file_a tmp
for file_b in b[n]
do
awk '
BEGIN{file=0;}
FNR1{file++;}
file
1{hash[$0]=1;}
file2{if($0 in hash)hash[$0]=0;}
END{for(key in hash)
{if(hash[key]
1)print key;}}
' $tmp $file_b >new_in_a
mv $new_in_a $tmp
done
cat $tmp >>$final_result
done

Guess you like

Origin www.cnblogs.com/chanix/p/12738228.html
awk