linux Log

First: the intersection of two files, and to set
preconditions: each file can not duplicate rows
1 and the set of two files (only one copy of the duplicate rows)
2. Remove the intersection of two files (only leaving files exist in both files)
3. delete the intersection, leaving the other line
1. CAT file1 file2 | the Sort | uniq> file3
2. CAT file1 file2 | the Sort | uniq -d> file3
3. CAT file1 file2 | sort | uniq -u> file3

Second: two files merge
a file on a file in the next
cat file1 file2> file3
a file on the left, a document in the right
paste file1 file2> file3

Third: a file to remove the duplicate rows:
the Sort File | uniq
Note: The repeated line in mind as a line, that these duplicate rows still there, but all omitted one line!
sort file | uniq -u
The above command can remove all duplicate rows, that is, non-repetitive file line!

Details details can be viewed, cat, sort, uniq commands such as

Fourth: a large file into multiple small files:

Using a 50M size of the log file for testing.
Log file name: log.txt.gz.
File line number: 208363

Method 1: (split split)
Syntax: split [- <line number>] [- b <byte>] [- C <byte>] [- l <line number>] [files to be cut] [Output File name]

# Gunzip log.txt.gz // must first unzip the file can not be divided or cat / zcat display;

# Wc -l log.txt // calculate the number of lines in a file;

Log.txt 208 363
# 120000 log.txt newLog the -l // by specifying the number of rows, split into two split the log file;
# * 50M du -SH log.txt
29M newlogaa
22M newlogab
# * // File divided files as with the original file attributes
log.txt: ASCII text, with Very Long Lines, with CRLF Line terminators
newlogaa: ASCII text, with Very Long Lines, with CRLF Line terminators
newlogab: ASCII text, with Very Long Lines, with CRLF Line terminators
# gzip newlogaa newlogab // file divided compressed for transmission

Method 2: (dd split)
# gunzip log.txt.gz // sure decompression, the file can not be divided or cat / zcat display;

#dd bs = 20480 count = 1500 if = log.txt of = newlogaa // first file by size

#dd bs = 20480 count = 1500 if = log.txt of = newlogab skip = 1500 // generated after another file size #file *

log.txt: ASCII text, with very long lines, with CRLF line terminators
newlogaa: ASCII text, with very long lines, with CRLF line terminators
newlogab: ASCII text, with very long lines, with CRLF line terminators

Segmentation no problem, but the situation will be the same line into different files, and log analysis system unless you can "tolerate"

Method 3: (+ tail split head)
#gzip log.txt.gz // understand such compression, please use the following zcat.
#wc -l log.txt // count a number of rows
208 363 log.txt
# `echo -n $ head ((208 363 /. 1 + 2))` log.txt> // newloga.txt output to the first x-line redirection a file;

#tail -n `echo $ ((208363-208362 / 2-1))` log.txt> newlogb.txt x rows redirect output to a file after the //;

#gzip newloga.txt newlogb.txt // two file compression

方法4：（awk分割）
#gzip log.txt.gz#awk ‘{if (NR<120000) print $0}’ log.txt >newloga.txt#awk ‘{if (NR>=120000) print $0}’ log.txt >newlogb.txt

More than two commands, we must traverse the entire file, so consider the efficiency, should be merged into use:

#awk ‘{if (NR<120000) print $0 >”newloga.txt”;if (NR>=120000) print $0>”newlogb.txt”}’ log.txt

The above four methods, three methods other than addition may well dd entire line dividing the log file. When the split should be considered simultaneously reading a document is completed, as otherwise, divided in the following manner:
Cat log.txt | head -12000> newloga.txt
Cat log.txt | tail -23 000> newlogb.txt
as this method after the split part of the file, then the implementation of the second line of the command file, the first x-line is read again in vain, the efficiency of the implementation of the poor, such as the file is too large, insufficient memory situation may also occur.

Guess you like