Detailed explanation of sort command under linux

Original address: http://www.cnblogs.com/51linux/archive/2012/05/23/2515299.html

 

Sort is a very commonly used command in Linux. It is sorted, concentrated, and sorted in five minutes. Start now!

1 How sort works

 

Sort takes each line of the file as a unit and compares it with each other. The comparison principle is to compare the ASCII code values ​​from the first character backward, and finally output them in ascending order.

[rocrocket@rocrocket programming]$ cat seq.txt
banana
apple
pear
orange
[rocrocket@rocrocket programming]$ sort seq.txt
apple
banana
orange
pear

2 The -u option of sort

Its role is very simple, is to remove duplicate lines in the output line.

[rocrocket@rocrocket programming]$ cat seq.txt
banana
apple
pear
orange
pear
[rocrocket@rocrocket programming]$ sort seq.txt
apple
banana
orange
pear
pear
[rocrocket@rocrocket programming]$ sort -u seq.txt
apple
banana
orange
pear

pear was mercilessly removed by the -u option due to duplication.

3 The -r option of sort

The default sorting method of sort is ascending order. If you want to change it to descending order, just add -r and you will be done.

[rocrocket@rocrocket programming]$ cat number.txt
1
3
5
2
4
[rocrocket@rocrocket programming]$ sort number.txt
1
2
3
4
5
[rocrocket@rocrocket programming]$ sort -r number.txt
5
4
3
2
1

4 The -o option of sort

Since sort defaults to outputting the result to standard output, redirection is required to write the result to a file, such as sort filename > newfile.

However, if you want to output the sorted results to the original file, redirection will not work.

[rocrocket@rocrocket programming]$ sort -r number.txt > number.txt
[rocrocket@rocrocket programming]$ cat number.txt
[rocrocket@rocrocket programming]$
Look, the number is actually cleared.

At this time, the -o option appears, which successfully solves the problem and allows you to write the result to the original file with confidence. This is perhaps the only advantage of -o over redirection.

[rocrocket@rocrocket programming]$ cat number.txt
1
3
5
2
4
[rocrocket@rocrocket programming]$ sort -r number.txt -o number.txt
[rocrocket@rocrocket programming]$ cat number.txt
5
4
3
2
1

5 The -n option of sort

Have you ever encountered a situation where 10 is less than 2. I've come across it anyway. This happens because the sorting program sorts these numbers by characters. The sorting program will compare 1 and 2 first. Obviously, 1 is smaller, so put 10 in front of 2. This is the consistent style of sort.

If we want to change this situation, we need to use the -n option to tell sort, "sort by value"!

[rocrocket@rocrocket programming]$ cat number.txt
1
10
19
11
2
5
[rocrocket@rocrocket programming]$ sort number.txt
1
10
11
19
2
5
[rocrocket@rocrocket programming]$ sort -n number.txt
1
2
5
10
11
19

6 The -t option and -k option of sort

If there is a file with content like this:

[rocrocket@rocrocket programming]$ cat facebook.txt
banana:30:5.5
apple:10:2.5
pear:90:2.3
orange:20:3.4

This file has three columns separated by colons. The first column represents the type of fruit, the second column represents the quantity of fruit, and the third column represents the price of the fruit.

Then I want to sort by the number of fruits, that is, by the second column, how to use sort to achieve?

Fortunately, sort provides the -t option, which can be followed by a separator. (Did you remember the -d option of cut and paste, resonance~~)

After specifying the delimiter, you can use -k to specify the number of columns.

[rocrocket@rocrocket programming]$ sort -n -k 2 -t : facebook.txt
apple:10:2.5
orange:20:3.4
banana:30:5.5
pear:90:2.3

We use a colon as a delimiter and sort the second column in ascending numerical order, and the results are satisfactory.

7 Other common options for sort

-f will convert all lowercase letters to uppercase for comparison, that is, ignore case

-c will check whether the file is sorted, if it is out of order, it will output the relevant information of the first out of order line, and finally return 1

-C will check if the file is sorted, if it is out of order, don't output the content, just return 1

-M will be sorted by month, such as JAN is less than FEB, etc.

-b ignores all whitespace preceding each line, starting the comparison from the first visible character.

Sometimes when you learn scripts, you will find that the sort command is followed by a bunch of stuff like -k1,2, or -k1.2 -k3.4, which is a bit unbelievable. Today, let's get it done - the -k option!

1 Prepare materials

$ cat facebook.txt
google 110 5000
baidu 100 5000
guge 50 3000
sohu 100 4500

 

The first field is the company name, the second field is the number of people in the company, and the third field is the average employee salary. (Except for the company name, other letters are all written ^_^)

2 I want this file to be sorted alphabetically by company, i.e. by the first domain: (this facebook.txt file has three domains)

$ sort -t ‘ ‘ -k 1 facebook.txt
baidu 100 5000
google 110 5000
guge 50 3000
sohu 100 4500

See it, just use -k 1 to set it directly. (Actually not strict here, you will know later)

3 I want facebook.txt to be sorted by number of companies

$ sort -n -t ‘ ‘ -k 2 facebook.txt
guge 50 3000
baidu 100 5000
sohu 100 4500
google 110 5000

No need to explain, I'm sure you can understand.

However, there is a problem here, that is, the companies of baidu and sohu have the same number of 100 people. What should we do at this time? According to the default rules, it is sorted in ascending order from the first domain, so baidu is in front of sohu.

4 I want facebook.txt to be sorted according to the number of companies, and the same number of employees should be sorted in ascending order by the average salary of employees:

$ sort -n -t ‘ ‘ -k 2 -k 3 facebook.txt
guge 50 3000
sohu 100 4500
baidu 100 5000
google 110 5000

See, we added a -k2 -k3 and that solved the problem. Yes, sort supports this setting, that is to say, to set the priority of domain sorting, first sort by the second domain, and if they are the same, then sort by the third domain. (If you want, you can keep writing like this, setting as many sorting priorities as you want)

5 I want facebook.txt to be sorted in descending order by employee salary. If the number of employees is the same, it will be sorted in ascending order by the number of companies: (this is a bit difficult)

$ sort -n -t ‘ ‘ -k 3r -k 2 facebook.txt
baidu 100 5000
google 110 5000
sohu 100 4500
guge 50 3000

There are some tricks used here, if you look closely, a lowercase r is secretly added after -k 3. Think about it, combined with our previous article , can you get the answer? Revealed: The r and -r options have the same effect, that is, the reverse order. Because sort is sorted in ascending order by default, r needs to be added here to indicate that the third field (average salary of employees) is sorted in descending order. You can also add n here, which means that when sorting this field, it should be sorted according to the size of the value. For example:

$ sort -t ‘ ‘ -k 3nr -k 2n facebook.txt
baidu 100 5000
google 110 5000
sohu 100 4500
guge 50 3000

See, we got rid of the first -n option and added it to every -k option.

6 The specific syntax format of the -k option

If you want to go deeper, you have to come to some theoretical knowledge. You need to understand the syntax format of the -k option, as follows:

[ FStart [ .CStart ] ] [ Edit ] [ , [ FEnd [ .CEnd ] ][ Edit ] ]

This syntax format can be divided into two parts by the comma (","), the Start part and the End part.

First give you an idea, that is "if you don't set the End part, then think that End is set as the end of the line". This concept is important, but often you don't take it seriously.

The Start part is also composed of three parts, the Modifier part is the option part similar to n and r we said before. We focus on FStart and C.Start in the Start section.

C.Start can also be omitted, if omitted, it means starting from the beginning of this field. The -k 2 and -k 3 in the previous example are examples of omitting C.Start.

FStart.CStart, where FStart is the field used, and CStart is the "first character of sorting" from the first character in the FStart field.

Similarly, in the End section, you can set FEnd.CEnd, if you omit .CEnd, it means the end to the "end of the field", that is, the last character of the field. Or, if you set CEnd to 0 (zero), it also means end to "end of field".

7 On a whim, sort from the second letter of the company's English name:

$ sort -t ‘ ‘ -k 1.2 facebook.txt
baidu 100 5000
sohu 100 4500
google 110 5000
guge 50 3000

See, we used -k 1.2, which means to sort strings starting at the second character of the first field and ending at the last character of this field. You will find that baidu tops the list because the second letter is a. The second character of both sohu and google is o, but the h of sohu is before the o of google, so the two are ranked second and third respectively. guge can only be ranked fourth.

8 On a whim, only the second letter of the company's English name is sorted, if the same is sorted in descending order according to the employee's salary:

$ sort -t ‘ ‘ -k 1.2,1.2 -k 3,3nr facebook.txt
baidu 100 5000
google 110 5000
sohu 100 4500
guge 50 3000

Since only the second letter is sorted, we use the notation -k 1.2,1.2, which means that we sort "only" on the second letter. (If you ask "how come I can't use -k 1.2?", of course not, because you omit the End part, which means you will sort the string from the second letter to the last character of the field ). For sorting employee salaries, we also use -k 3,3, which is the most accurate expression, meaning we "only" sort this field, because if you omit the last 3, it becomes our "to the first" The contents of the 3 fields start to the last field position are sorted".

9 What other options are available in the modifier section?

You can use b, d, f, i, n or r.

where n and r you must already be familiar with.

b means to ignore the sign-in blanks in this field.

d indicates that this field is sorted lexicographically (ie, only spaces and letters are considered).

f means to sort this field ignoring case.

i means ignore "non-printable characters" and sort only for printable characters. (Some ASCII characters are non-printable characters, such as \a for alarm, \b for backspace, \n for newline, \r for carriage return, etc.)

10 Think about an example of the combined use of -k and -u:

$ cat facebook.txt
google 110 5000
baidu 100 5000
guge 50 3000
sohu 100 4500

This is the original facebook.txt file.

$ sort -n -k 2 facebook.txt
guge 50 3000
baidu 100 5000
sohu 100 4500
google 110 5000

$ sort -n -k 2 -u facebook.txt
guge 50 3000
baidu 100 5000
google 110 5000

When the setting is numerically sorted by the company's employee field, and then -u is added, the sohu line is deleted! It turns out that -u only recognizes the domain set with -k, and if it is found to be the same, it will delete all subsequent identical lines.

$ sort  -k 1 -u facebook.txt
baidu 100 5000
google 110 5000
guge 50 3000
sohu 100 4500

$ sort  -k 1.1,1.1 -u facebook.txt
baidu 100 5000
google 110 5000
sohu 100 4500

The same is true for this example, the guge, which starts with g, is not spared.

$ sort -n -k 2 -k 3 -u facebook.txt
guge 50 3000
sohu 100 4500
baidu 100 5000
google 110 5000

what! When two levels of sorting priority are set here, using -u does not delete any rows. It turns out that -u will weigh all the -k options, and delete them only if they are the same. As long as there is one level of difference, they will not be easily deleted :) (If you don't believe me, you can add a line of sina 100 4500 to try it out)

11 Weirdest sorts:

$ sort -n -k 2.2,3.1 facebook.txt
guge 50 3000
baidu 100 5000
sohu 100 4500
google 110 5000

Sorts starting with the second character of the second field and ending with the first character of the third field.

The first row will extract 0 3, the second row will extract 00 5, the third row will extract 00 4, and the fourth row will extract 10 5.

And because sort thinks that 0 is less than 00, less than 000, less than 0000....

So 0 3 must be in the first one. 10 5 is definitely in the last one. But why is 00 5 in front of 00 4? (You can experiment and think for yourself.)

The answer is revealed: It turns out that "the cross-domain setting is an illusion", sort will only compare the part from the second character of the second field to the last character of the second field, but not the beginning of the third field. included in the comparison. When it finds that 00 and 00 are the same, sort will automatically compare the first field. Of course baidu is in front of sohu. This can be demonstrated with an example:

$ sort -n -k 2.2,3.1 -k 1,1r facebook.txt
guge 50 3000
sohu 100 4500
baidu 100 5000
google 110 5000

12 Sometimes I see +1 -2 symbols after the sort command, what is this?

Regarding this syntax, the latest sort is explained as follows:

On older systems, `sort’ supports an obsolete origin-zero syntax `+POS1 [-POS2]‘ for specifying sort keys.  POSIX 1003.1-2001 (*note Standards conformance::) does not allow this; use `-k’ instead.

It turns out that this ancient representation has been eliminated, and in the future, you can justly despise scripts that use this representation!

(In order to prevent the existence of ancient scripts, let's talk about this representation method here. The plus sign represents the Start part, and the minus sign represents the End part. The most important point is that this method starts counting from 0, as I said before The first field of , represented here as the 0th field. The previous 2nd character, here as the 1st character. Got it?)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326481341&siteId=291194637