[Turn] Grok regular capture

Grok is the most important plugin for Logstash. You can predefine a named regular expression in grok and refer to it later (in grok parameters or other regular expressions).

Regular Expression Syntax

Operation and maintenance engineers are more or less regular. You can write standard regex in grok like this:

\s+(?<request_time>\d+(?:\.\d+)?)\s+

Tips: This regular expression should be familiar to Perl or Ruby programmers. Python programmers may be more accustomed to writing  (?P<name>pattern)them. No way, get used to it.

Now add the first filter section configuration to our configuration file. The configuration should be added between the input and output sections (logstash does not depend on the order when executing the sections, but for your own convenience, write them in order):

input {stdin{}}
filter {
    grok {
        match => {
            "message" => "\s+(?<request_time>\d+(?:\.\d+)?)\s+"
        }
    }
}
output {stdout{}}

Run the logstash process and type "begin 123.456 end", you should see output similar to the following:

{
         "message" => "begin 123.456 end",
        "@version" => "1",
      "@timestamp" => "2014-08-09T11:55:38.186Z",
            "host" => "raochenlindeMacBook-Air.local",
    "request_time" => "123.456"
}

pretty! But the data type seems to be unsatisfactory... request_time  should be a number rather than a string.

We already mentioned that we will learn LogStash::Filters::Mutate to convert field value types later, but in grok, there is actually its own magic to achieve this function!

Grok Expression Syntax

Grok supports writing predefined  grok expressions  into files. For the official predefined grok expressions, see: https://github.com/logstash/logstash/tree/v1.4.2/patterns .

Note: In the new version of logstash, the pattern directory is empty, and the last commit prompts that the core patterns will be provided by the logstash-patterns-core gem, which can be used by users to store custom patterns

The following is the simplest but sufficient example of usage, excerpted from the official documentation:

USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}

The first line uses a normal regular expression to define a grok expression; the second line uses the previously defined grok expression to define another grok expression by printing the assignment format.

The full syntax of the print copy format of a grok expression is as follows:

%{PATTERN_NAME:capture_name:data_type}

Tip: data_type currently only supports two values: int and  float.

So we can improve our configuration to look like this:

filter {
    grok {
        match => {
            "message" => "%{WORD} %{NUMBER:request_time:float} %{WORD}"
        }
    }
}

Rerun the process and get the following result:

{
         "message" => "begin 123.456 end",
        "@version" => "1",
      "@timestamp" => "2014-08-09T12:23:36.634Z",
            "host" => "raochenlindeMacBook-Air.local",
    "request_time" => 123.456
}

This time  request_time  has become a numeric type.

Best Practices

In practice, we need to deal with various log files. If you write your own expression in the configuration file, it will be completely unmanageable. Therefore, our suggestion is to write all grok expressions in one place. Then specify it with   the filter/grokpatterns_dir  option.

If you grok all the information in "message" into different fields, the data is essentially equivalent to being stored twice. So you can use  remove_field parameters to remove  the message  field, or use  overwrite parameters to override the default  message  field and keep only the most important parts.

An example of overriding parameters is as follows:

filter {
    grok {
        patterns_dir => "/path/to/your/own/patterns"
        match => {
            "message" => "%{SYSLOGBASE} %{DATA:message}"
        }
        overwrite => ["message"]
    }
}

Tips

multiline match

When  using it with codec/multiline  , you need to pay attention to a problem. The grok regex is the same as the ordinary regex. By default, it does not support matching carriage return and line feed. Just as you need  it, you need to specify it separately, and the specific way to write it is to add a  marker =~ //m at the beginning of the expression  . (?m)As follows:

match => {
    "message" => "(?m)\s+(?<request_time>\d+(?:\.\d+)?)\s+"
}

multiple choice

Sometimes we run into a situation where a log has multiple possible formats. At this time, it is more difficult to write a single regular, or  | it is ugly to separate them all. At this point, logstash's syntax provides us with an interesting solution.

In the documentation, it is stated that the parameters of the logstash/filters/grok plugin  match should accept a Hash value. But because the hash value in the early logstash syntax is also written in  [] this way, it is  match no problem to pass the Array value to the parameter now. So, we can actually pass multiple regexes here to match the same field:

match => [
    "message", "(?<request_time>\d+(?:\.\d+)?)",
    "message", "%{SYSLOGBASE} %{DATA:message}",
    "message", "(?m)%{WORD}"
]

Logstash will try to match in this defined order until the match is successful. | Although the effect is the same as writing a big regular with  division, the readability is much better.

Last but not least, I strongly recommend that everyone use the  Grok Debugger  to debug their grok expressions.

Reprinted from: http://udn.yyuap.com/doc/logstash-best-practice-cn/filter/grok.html

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326748467&siteId=291194637