MIT2020 Cram School-(4) Data collation

Convert data from one format to another

Specifically, whether it is text format or binary format, the data must be processed until the required data is obtained.

In past lectures, we have seen some debates on basic data. Almost every time an |operator is used, some kind of data contention is performed. Consider journalctl | grep-i intelan order like this. It looks for all system log entries mentioned by Intel (not case sensitive). You may not think of it as a coiled data, but it is changing from one format (the entire system log) to a format that is more useful to you (just Intel log entries). Most data debates are about knowing what tools you can use and how to combine them.

Let's start from the beginning. To sort the data, we need two things: ** the data to be sorted, and things related to it. ** Logs are usually a good use case, because you often want to investigate things about them, and reading the whole thing is not feasible. Let's determine who tried to log in to my server by viewing my server log:

ssh myserver journalctl

We restrict it to ssh:

ssh myserver journalctl | grep sshd

Note that we are using pipes to greptransfer remote files on the local computer ! sshIt's amazing, we will discuss it further in the next lecture on the command line environment. But this is still much more than we want. And it's hard to read. Let's do better:

ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' | less

Why do you need additional references? Well, our log may be large, and it is a waste to stream it all to our computer and then filter it. Instead, we can filter on the remote server and then use the data locally. lessGive us a "pager" that allows us to scroll up and down in long output. In order to save some additional traffic when debugging the command line, we can even paste the currently filtered logs into a file so that we do n’t have to access the network during development:

$ ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' > ssh.log
$ less ssh.log

There is still a lot of noise here. There are many ways to solve this problem, but let's look at one of the most powerful tools in the toolbox: sed.

sedIs a "stream editor", it is built on the old ededitor. In it, you basically give a short command on how to modify the file instead of directly manipulating its contents (although you can do the same). There are many commands, but one of the most common commands is s: substitution. For example, we can write:

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed 's/.*Disconnected from //'

What we have just written is a simple regular expression; a powerful construct that allows you to match text to patterns. sThe format of the command is:, s/REGEX/SUBSTITUTION/where REGEXis the regular expression to be searched, and the SUBSTITUTIONtext to be replaced with the matching text.

Regular expression

Regular expressions are very common and useful, so it's worth taking some time to understand how they work. Let us look at the use of the above: /.*Disconnected from /. Regular expressions are usually (though not always) /surrounded. Most ASCII characters carry only their normal meaning, but some characters have "special" matching behavior. It is a very frustrating question which characters do different things between different implementations of regular expressions. Very common patterns are:

. Means "any single character" except newline

* Zero or more matches

+ One or more preceding matches

[abc] a, bAnd cany of the characters

(RX1 | RX2)With RX1or RX2match things

^ Beginning of line

$ End of line

sedThe regular expressions are somewhat strange, and you need to add one before most regular expressions \to give them special meanings. Or you can pass -E.

So, looking back /*Disconnected from/, we will find that it matches any text that starts with any number of characters, followed by the text string " Disconnected from". This is what we want. But please note that regular expressions are triples. What if someone tries to Disconnected fromlog in with the username " "? We will have:

Jan 17 03:13:00 thesquareplanet.com sshd[2631]: Disconnected from invalid user Disconnected from 46.97.239.16 port 55920 [preauth]

What will we get in the end? Well, *and +by default it is "greedy". They will match as much text as possible. We will get:

46.97.239.16 port 55920 [preauth]

This may not be what we want. In some regular expression implementations, you can add to *or +after them to make them not greedy, but sadly does not support this. However, we can switch to perl's command line mode, which does support the following structure:

perl -pe 's/.*?Disconnected from //'

We will continue to use sedit next , because it is a more common tool in these jobs. sedYou can also do other convenient things, such as printing lines after a given match, performing multiple replacements per call, searching for content, and so on, but we will not introduce them too much here. sedIt is basically a complete theme in itself, but there are usually better tools.

Okay, we have another suffix to remove. what should we do? Matching only the text after the username is a bit tricky, especially if the username can have spaces or something! What we need to do is to match the entire series:

| sed -E 's/.*Disconnected from (invalid |authenticating )?user .* [^ ]+ port [0-9]+( \[preauth\])?$//'

Let us see what is going on with the regex debugger. Well, the beginning is the same as before. Then, we match any "user" variable (there are two prefixes in the log). Then we match on any string where the username is. Then we match any single word ( [^]+; any sequence of non-empty non-space characters). Then there is the word "port" followed by a series of numbers. Then it may be the suffix [preauth], then the end of the line.

Note that with this technique, the username of "Disconnected from" will no longer confuse us. Do you know why?

But there is a problem, that is, the entire log becomes empty. After all, we want to keep the username. For this, we can use "capture groups". Any text matched by the regular expression surrounded by parentheses is stored in the numbered capture group. These are available in the replacement (in some engines, even in the pattern itself!) Such as " \1", " \2", " \3", etc .:

| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'

Back to the data debate

Okay, now we have

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'

sedYou can do various other interesting things, such as injecting text (using icommands), explicitly printing lines (using pcommands), selecting lines by index , and many other things (using man sedto view).

anyway. We now provide a list of all usernames that attempted to log in. But this is not helpful. Let's take a look at the common ones:

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c

sortThe input sort . uniq -cTo the same row is folded into a continuous line, and the number of occurrences in the prefix . We may also want to sort it, keeping only the most common login names:

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10

sort -nThe numerical (rather than dictionary) sort order . -k1,1Means ** "Sort only by the column separated by the first space" . ,nPartially means "sort to the nth field, the default is the end of the line **. In this particular example, sorting by the entire line is not important, but we are here to learn!

If we want the least common, we can use it headinstead tail. Also sort -r, it is sorted in reverse order.

Ok, this is cool, but we just want to give the username, maybe not one per line?

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10
 | awk '{print $2}' | paste -sd,

Let's start from the paste: it allows you to use a given single-character delimiter (- d) combination row ( -s). But awkhow is this going?

awk-another editor

awkIs a programming language that is very good at processing text streams.

First, {print$2}what do you do? Well, the awkprogram takes an optional pattern plus a block to explain what to do if the pattern matches a given line. The default pattern (which we used above) matches all lines . In the block, $0the content set to the entire line , $1to $nthe nth field set to the line , are separated by field separators (by default, spaces are -Fchanged). In this example, we mean that for each line, print the content of the second field, which happens to be the username!

Let us see if we can do something new and strange. Let's count the number of one-time usernames that cstart with and eend with:

| awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l

There are many things to dismantle here. First, notice that we now have a pattern (in the {…}previous content). The pattern indicates that the first field of the line should be equal to 1 (this is uniq -cthe count in), and the second field should match the given regular expression. The block simply says to print the user name. Then we use to wc -lcount the number of rows in the output.

However, it awkis a programming language, remember?

BEGIN { rows = 0 }
$1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 }
END { print rows }

BEGINIs a pattern that matches the beginning of the input ( ENDmatches the end). Now, each row of blocks only adds the count of the first field (although it is always 1 in this example), and then we print it at the end. In fact, we can be completely free grepand sed, because awkyou can do it all.

analyze data

You can make calculations! For example, add the numbers on each line:

| paste -sd+ | bc -l

Or produce a more detailed expression:

echo "2*($(data | paste -sd+))" | bc -l

You can obtain data in many ways. stVery neat, but if you already have R:

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | awk '{print $1}' | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'

R is another programming language, good at data analysis and drawing. We will not discuss too many details, but it can be said to summaryprint summary statistics about the matrix, we calculate a matrix from the input stream of numbers, so R gives the statistics we want!

If you just want some simple drawings, gnuplotbe your friend:

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10
 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'

Sometimes, you want to organize your data so that you can find content to install or remove based on some long lists. The data collation + we have discussed so far xargscan be a powerful combination:

rustup toolchain list | grep nightly | grep -vE "nightly-x86" | sed 's/-x86.*//' | xargs rustup toolchain uninstall

Organize binary data

So far, we have mainly discussed the collation of text data, but pipes are also useful for binary data. For example, we can use ffmpeg to capture an image from a camera, convert it to grayscale, compress it, send it via SSH to a remote computer, decompress it there, make a copy, and then display it.

ffmpeg -loglevel panic -i /dev/video0 -frames 1 -f image2 -
 | convert - -colorspace gray -
 | gzip
 | ssh mymachine 'gzip -d | tee copy.jpg | env DISPLAY=:0 feh -'
Published 28 original articles · won praise 2 · Views 3259

Guess you like

Origin blog.csdn.net/Maestro_T/article/details/104355732