Use Linx to count the number of lines in the file, split, sort the id, and remove duplicates (wc, head, sort, uniq)!

As follows, we encounter a file above 2G

Text editors shake each other:

At this point, my heart is broken, but let's take a look at how many lines there are in the file.

You can use the following command to count the number of lines in the file: wc -l file name

wc -l  lesson_20201205.log

More than 12 million rows of data.

Then use head -n file name> new file

$ head -1000000 lesson_20201205.log > lesson_20201205_100.log

 

Then get a 163M 1 million rows of data

Next, we took out the user ID from the log and found many duplicates.

At this time, we definitely can't copy these IDs to Excel, and then choose to de-duplicate.

We must use the programmer's method to solve it.

We use cat file name | sort |uniq> the file name after deduplication

$ cat lesson_id_100.log | sort |uniq >lesson_id_100_uniq.log

Then we get the files saved in ascending order after deduplication.

That's it!

Guess you like

Origin blog.csdn.net/zhangyupeng0528/article/details/111071501