Grok is the most important plugin for Logstash. You can predefine a named regular expression in grok and refer to it later (in grok parameters or other regular expressions).
Regular Expression Syntax
Operation and maintenance engineers are more or less regular. You can write standard regex in grok like this:
\s+(?<request_time>\d+(?:\.\d+)?)\s+
Tips: This regular expression should be familiar to Perl or Ruby programmers. Python programmers may be more accustomed to writing (?P<name>pattern)
them. No way, get used to it.
Now add the first filter section configuration to our configuration file. The configuration should be added between the input and output sections (logstash does not depend on the order when executing the sections, but for your own convenience, write them in order):
input {stdin{}}
filter {
grok {
match => {
"message" => "\s+(?<request_time>\d+(?:\.\d+)?)\s+"
}
}
}
output {stdout{}}
Run the logstash process and type "begin 123.456 end", you should see output similar to the following:
{
"message" => "begin 123.456 end",
"@version" => "1",
"@timestamp" => "2014-08-09T11:55:38.186Z",
"host" => "raochenlindeMacBook-Air.local",
"request_time" => "123.456"
}
pretty! But the data type seems to be unsatisfactory... request_time should be a number rather than a string.
We already mentioned that we will learn LogStash::Filters::Mutate
to convert field value types later, but in grok, there is actually its own magic to achieve this function!
Grok Expression Syntax
Grok supports writing predefined grok expressions into files. For the official predefined grok expressions, see: https://github.com/logstash/logstash/tree/v1.4.2/patterns .
Note: In the new version of logstash, the pattern directory is empty, and the last commit prompts that the core patterns will be provided by the logstash-patterns-core gem, which can be used by users to store custom patterns
The following is the simplest but sufficient example of usage, excerpted from the official documentation:
USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}
The first line uses a normal regular expression to define a grok expression; the second line uses the previously defined grok expression to define another grok expression by printing the assignment format.
The full syntax of the print copy format of a grok expression is as follows:
%{PATTERN_NAME:capture_name:data_type}
Tip: data_type currently only supports two values: int
and float
.
So we can improve our configuration to look like this:
filter {
grok {
match => {
"message" => "%{WORD} %{NUMBER:request_time:float} %{WORD}"
}
}
}
Rerun the process and get the following result:
{
"message" => "begin 123.456 end",
"@version" => "1",
"@timestamp" => "2014-08-09T12:23:36.634Z",
"host" => "raochenlindeMacBook-Air.local",
"request_time" => 123.456
}
This time request_time has become a numeric type.
Best Practices
In practice, we need to deal with various log files. If you write your own expression in the configuration file, it will be completely unmanageable. Therefore, our suggestion is to write all grok expressions in one place. Then specify it with the filter/grokpatterns_dir
option.
If you grok all the information in "message" into different fields, the data is essentially equivalent to being stored twice. So you can use remove_field
parameters to remove the message field, or use overwrite
parameters to override the default message field and keep only the most important parts.
An example of overriding parameters is as follows:
filter {
grok {
patterns_dir => "/path/to/your/own/patterns"
match => {
"message" => "%{SYSLOGBASE} %{DATA:message}"
}
overwrite => ["message"]
}
}
Tips
multiline match
When using it with codec/multiline , you need to pay attention to a problem. The grok regex is the same as the ordinary regex. By default, it does not support matching carriage return and line feed. Just as you need it, you need to specify it separately, and the specific way to write it is to add a marker =~ //m
at the beginning of the expression . (?m)
As follows:
match => {
"message" => "(?m)\s+(?<request_time>\d+(?:\.\d+)?)\s+"
}
multiple choice
Sometimes we run into a situation where a log has multiple possible formats. At this time, it is more difficult to write a single regular, or |
it is ugly to separate them all. At this point, logstash's syntax provides us with an interesting solution.
In the documentation, it is stated that the parameters of the logstash/filters/grok plugin match
should accept a Hash value. But because the hash value in the early logstash syntax is also written in []
this way, it is match
no problem to pass the Array value to the parameter now. So, we can actually pass multiple regexes here to match the same field:
match => [
"message", "(?<request_time>\d+(?:\.\d+)?)",
"message", "%{SYSLOGBASE} %{DATA:message}",
"message", "(?m)%{WORD}"
]
Logstash will try to match in this defined order until the match is successful. |
Although the effect is the same as writing a big regular with division, the readability is much better.
Last but not least, I strongly recommend that everyone use the Grok Debugger to debug their grok expressions.
Reprinted from: http://udn.yyuap.com/doc/logstash-best-practice-cn/filter/grok.html