Logstash is an important member of the ELK technology stack. This article introduces some uses of Logstash.
Official website Logstash: Collect, parse and transform logs | Elastic
There are links to Logstash documentation , Logstash forums , and downloads below .
1 Introduction
Logstash is an open source data collection engine. It has real-time data transmission capabilities, and can collect, analyze and store data according to our customized specifications. That is to say, Logstash has three core components, namely data collection, data analysis and data dumping. These three parts form a data flow similar to a pipeline. The input end collects data, the pipeline itself performs data filtering and analysis, and the output end outputs the filtered and analyzed data to the target database.
A quick look at the official manual shows that its functions are still very powerful. Logstash Reference [8.4] | Elastic
There are many that can be used, such as elasticsearch/redis/file/http/kafka/tcp/udp/stdout/websocket/mongodb, etc.
#Input有如下插件
Input plugins
azure_event_hubs
beats
cloudwatch
couchdb_changes
dead_letter_queue
elastic_agent
elasticsearch
exec
file
ganglia
gelf
generator
github
google_cloud_storage
google_pubsub
graphite
heartbeat
http
http_poller
imap
irc
java_generator
java_stdin
jdbc
jms
jmx
kafka
kinesis
log4j
lumberjack
meetup
pipe
puppet_facter
rabbitmq
redis
relp
rss
s3
s3-sns-sqs
salesforce
snmp
snmptrap
sqlite
sqs
stdin
stomp
syslog
tcp
twitter
udp
unix
varnishlog
websocket
wmi
xmpp
#output有如下插件
Output plugins
boundary
circonus
cloudwatch
csv
datadog
datadog_metrics
dynatrace
elastic_app_search
elastic_workplace_search
elasticsearch
email
exec
file
ganglia
gelf
google_bigquery
google_cloud_storage
google_pubsub
graphite
graphtastic
http
influxdb
irc
java_stdout
juggernaut
kafka
librato
loggly
lumberjack
metriccatcher
mongodb
nagios
nagios_nsca
opentsdb
pagerduty
pipe
rabbitmq
redis
redmine
riak
riemann
s3
sink
sns
solr_http
sqs
statsd
stdout
stomp
syslog
tcp
timber
udp
webhdfs
websocket
xmpp
zabbix
https://www.elastic.co/guide/en/logstash/current/index.html
2. Logstash installation
1. Install java environment
Not much to say here, just make sure the java environment is ok as follows.
2. Download and install
curl -L -O https://artifacts.elastic.co/downloads/logstash/logstash-7.3.0.tar.gz
tar -xzvf logstash-7.3.0.tar.gz
3. Startup and verification
cd logstash-7.3.0
#启动logstash,-e选项指定输入输出;这里输入采用标准输入,标准输出作为输出。
./bin/logstash -e 'input { stdin { } } output { stdout {} }'
After the startup is complete, enter a character string, and you can get the corresponding output, as shown below, the installation is successful.
For more installation methods, see How to install Logstash in the Elastic stack_Elastic China Community Official Blog Blog-CSDN Blog
3. Use of Logstash
A Logstash pipeline has two required elements, inputs and outputs, and an optional element filters.
3.1. Configure config/logstash.yml
Change the config.reload.automatic option to true. The advantage is that Logstash does not need to be restarted every time the configuration file is changed, and the changed configuration file will be automatically loaded.
3.2, practice - accept tcp port data
In the config directory (in fact, any directory is fine), create a test.conf file with the following content.
input 监听9900端口的的tcp数据
output 打印到标准输出
input {
tcp {
port => 9900
}
}
output {
stdout {
codec => rubydebug #以rubydebug格式在控制台输出
#codec => json #以json格式在控制台输出
}
}
execute logstash
./bin/logstash -f ./config/test.conf
In addition, start a terminal and send data to port 9900 through the nc command.
echo 'hello logstash!!!!!!!' | nc localhost 9900
#注:别的机器发也行,ip位置指定为logstash所在的ip即可
The effect is as follows:
As shown in the figure below: the upper part is the log record automatically loaded after modifying the configuration file, and the lower part is the output after changing to json format.
3.3. Practice - data processing of Grok filter
input {
tcp {
port => 9900
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
}
output {
stdout {
codec => rubydebug #以rubydebug格式在控制台输出
}
}
Create a text file test.log with the following content:
14.49.42.25 - - [12/May/2019:01:24:44 +0000] "GET /articles/ppp-over-ssh/ HTTP/1.1" 200 18586 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:24:15 +0000] "GET /articles/openldap-with-saslauthd/ HTTP/1.1" 200 12700 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:24:06 +0000] "GET /articles/dynamic-dns-with-dhcp/ HTTP/1.1" 200 18848 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:24:54 +0000] "GET /articles/ssh-security/ HTTP/1.1" 200 16543 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:25:25 +0000] "GET /articles/week-of-unix-tools/ HTTP/1.1" 200 9313 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:25:33 +0000] "GET /blog/geekery/headless-wrapper-for-ephemeral-xservers.html HTTP/1.1" 200 11902 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
66.249.73.135 - - [12/May/2019:01:25:58 +0000] "GET /misc/nmh/replcomps HTTP/1.1" 200 891 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
114.80.81.51 - - [12/May/2019:01:26:10 +0000] "GET /blog/geekery/xvfb-firefox.html HTTP/1.1" 200 10975 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
46.105.14.53 - - [12/May/2019:01:26:18 +0000] "GET /blog/tags/puppet?flav=rss20 HTTP/1.1" 200 14872 "-" "UniversalFeedParser/4.2-pre-314-svn +http://feedparser.org/"
61.55.141.10 - - [12/May/2019:01:26:17 +0000] "GET /blog/tags/boredom-induced-research HTTP/1.0" 200 17808 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
Excuting an order
head -n 1 test.log | nc localhost 9900
The effect is as follows:
That is to say, through the Grok filter, he will match the unstructured data we input into structured data through regular expression matching. From the above, you can see that it has extracted fields such as request, port, host, and clientip.
For more information about Grok filters, see Logstash: Getting Started with Grok filters_Elastic China Community Official Blog Blog-CSDN Blog
3.4. Practice - Geoip filter
I know clientip before, but I don’t know where this IP comes from, that is, the specific country, longitude and latitude geographic information. A Geoip filter can be used for this.
input {
tcp {
port => 9900
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
geoip {
source => "clientip"
}
}
output {
stdout { }
}
implement
head -n 1 test.log | nc localhost 9900
The effect is as follows
3.5. Practice - Useragent filter
We noticed that the agent field is relatively long. However, fields such as browser and language are not clearly distinguished. This can be further enriched using the useragent filter.
3.6. Practice - mutate/convert filter
Note that bytes is a string type, but you might actually expect it to be a number. You can use the mutate:convert filter for this.
input {
tcp {
port => 9900
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
mutate {
convert => {
"bytes" => "integer"
}
}
geoip {
source => "clientip"
}
useragent {
source => "agent"
target => "useragent"
}
}
output {
stdout { }
}
However, the agent does not seem to extract the corresponding fields
3.7, practice - import data to elastic
All the previous output is stdout, which is output to the console (console) where Logstash runs. The following demonstrates outputting data to Elasticsearch.
Note: Elasticsearch and kibana have been deployed on this machine, and have been successfully started and available.
input {
tcp {
port => 9900
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
mutate {
convert => {
"bytes" => "integer"
}
}
geoip {
source => "clientip"
}
useragent {
source => "agent"
target => "useragent"
}
date {
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
}
}
output {
stdout { }
elasticsearch {
hosts => ["localhost:9200"]
#user => "elastic"
#password => "changeme"
}
}
关于输入我们保留了stdout和elasticsearch,前者主要是为了方便调试。
Excuting an order
head -n 1 test.log | nc localhost 9900
Execute the following command in kibana to view the data written to es.
#统计数据条数
GET logstash/_count
#访问数据
GET logstash/_search
#也可以看到对应的logstash命名的索引
GET _cat/aliases
3.8. Practice - migrating ES data
Note: The ip of some ES instances may be prohibited from pinging. We can use the curl command to determine whether the machine where logstash is located can successfully access the ES instance. See the curl command commonly used in ES (elasticsearch)
As for how to configure it, you can find the corresponding elasticsearch plug-ins for input, output, etc. in the official documentation and look at the examples.
Elasticsearch input plugin | Logstash Reference [8.4] | Elastic
(1) input setting elasticsearch
According to the official manual, some commonly used parameters are listed.
1、hosts:指定一个或多个你要查询的ES的主机。每个主机可以是 IP,HOST,IP:port,或者 HOSY:port。默认的端口是9200.
2、index:指定作用的索引。所有索引 "*" 即可
3、query:指定查询语句
4、proxy:设置为正向的http代理。空的话默认为没有设置代理。
5、request_timeout_seconds:秒单位的单次请求ES的最大时间,当单次请求的数量十分巨大的时候,超时极易发生。数值默认为60s。
6、schedule:顾名思义 定期的运行cron格式的时间表,例如 "* * * * *" 表示每分钟执行一次查询。默认情况认为无时间表,此时仅执行一次。
7、scroll: 参数控制scroll请求两次间隔间的保活时间(单位是秒),并且启动scroll过程。超时适用于每次往返即上一次滚动请求到下一个滚动请求之间. 默认值1m
8、size:设置每次scroll返回的最大消息条数。默认 1000
9、docinfo:如果设置的话在事件中就会包括诸如index,type,docid等文档信息。bool值默认为false
10、docinfo_fields: 如已经通过设置docinfo为true执行需要元数据存储,此字段列出事件中需要保存的元数据字段有哪些。默认值是 ["_index", "_type", "_id"]。
11、docinfo_target:如已经通过设置docinfo为true执行需要元数据存储,则此选项将在其下存储元数据字段的字段命名为子字段。
Create the es_sync.conf file:
#好用的
input {
elasticsearch {
hosts => ["http://11.168.176.227:9200"]
index => "es_qidian_flow_oa_20220906"
query => '{"query":{"bool":{ "must":[{"term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}}]}}}'
}
}
output {
stdout { }
}
#最好有缩进,看起来更舒服。这个也是可以的
input {
elasticsearch {
hosts => ["http://11.168.176.227:9200"]
index => "es_qidian_flow_oa_20220906"
query => '{
"query":{
"bool":{
"must":[
{
"term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}
}
]
}
}
}'
}
}
output {
stdout { }
}
#其中的type感觉不用指定也是ok的。
Note: If you continue to write qualified data to the source ES at this time, it will not be incrementally synchronized (only executed once).
Set up scheduled tasks. The following is the configuration of querying once per minute and outputting the result to the standard output of the console.
input {
elasticsearch {
hosts => ["http://11.168.xxx.227:9200"]
index => "es_qidian_flow_oa_20220906"
query => '{
"query":{
"bool":{
"must":[
{
"term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}
}
]
}
}
}'
scroll => "1m"
docinfo => true
size => 2000
schedule => "* * * * *" #定时任务,每分钟1次
}
}
filter {
mutate {
remove_field => ["flow_type", "source", "@version", "@timestamp"]
}
}
output {
stdout { }
}
Q: How to achieve incremental synchronization?
Large: use the schedule to do timing tasks, and configure docid deduplication at the output. This alone may not be enough, the query statement specifies to filter the data within the last t minutes for synchronization. The value of t should match the scroll interval. The following configuration synchronizes the data within the last 3 minutes every 1 minute, so that data should not be lost; in addition, deduplication can also be achieved through docid.
Note: After looking at logstash to export data from elasticsearch, incremental synchronization seems to be implemented in this way.
input {
elasticsearch {
hosts => "1.1.1.1:9200"
index => "es-runlog-2019.11.20"
query => '{"query":{"range":{"@timestamp":{"gte":"now-3m","lte":"now/m"}}}}'
size => 5000
scroll => "1m"
docinfo => true
schedule => "* * * * *" #定时任务,每分钟执行一次
}
}
filter {
mutate {
remove_field => ["source", "@version"]
}
}
output {
stdout {}
}
(2) filter plug-in
There are also many functions, here are a few. Mutate filter plugin | Logstash Reference [8.4] | Elastic
The date plug-in, grok plug-in, and geoip plug-in have been briefly demonstrated before. Here is a special introduction to the mutate (mutation) plug-in .
Common options applicable to all filter plugins:
1、add_field:在事件中添加任意字段。字段名称可以是动态的,并使用%{Field}包含事件的各个部分。
filter {
mutate {
add_field => { "foo_%{somefield}" => "Hello world, from %{host}" }
}
}
2、remove_field:从此事件中删除任意字段。字段名称可以是动态的,并使用%{field}包含事件的各个部分示例。
filter {
mutate {
remove_field => [ "foo_%{somefield}" ]
}
}
3、add_tag:
4、remove_tag:
Some options for the mutate plugin
1、convert:将字段的值转换为其他类型,如将字符串转换为整数。如果字段值是数组,则将转换所有成员。如果字段是散列,则不会采取任何操作。
filter {
mutate {
convert => {
"fieldname" => "integer"
"booleanfield" => "boolean"
}
}
}
2、copy:将现有字段复制到其他字段。现有目标字段将被覆盖。
filter {
mutate {
copy => { "source_field" => "dest_field" }
}
}
3、merge:
4、rename:
5、replace:
6、update:
After adding mutate→remove_field to the above conf, you can see that the output does not include those fields listed in remove.
(3) output plug-in settings elasticsearch
See the official website Output plugins | Logstash Reference [8.4] | Elastic
Import the data of one ES into another ES, and the index, type, and docid remain unchanged; at the same time, the schedule timing task is also configured. as follows.
input {
elasticsearch {
hosts => ["http://11.168.xxx.227:9200"]
index => "es_qidian_flow_oa_20220906"
query => '{
"query":{
"bool":{
"must":[
{
"term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}
}
]
}
}
}'
scroll => "1m"
docinfo => true
size => 2000
schedule => "* * * * *"
}
}
filter {
mutate {
remove_field => ["flow_type", "source", "@version", "@timestamp"]
}
}
output {
elasticsearch {
hosts => ["http://10.101.xxx.15:9200"]
index => "%{[@metadata][_index]}"
document_type => "%{[@metadata][_type]}"
document_id => "%{[@metadata][_id]}"
}
stdout { }
}
After testing, the effect is completely in line with expectations.
What if there is no field in the event that contains the prefix of the target index?
At this time, you can use the mutate filter and condition to add the [@metadata] field to set the target index of each event. [@metadata] fields will not be sent to Elasticsearch.
input {
elasticsearch {
hosts => ["http://11.168.176.227:9200"]
index => "es_qidian_flow_oa_20220906"
query => '{
"query":{
"bool":{
"must":[
{
"term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}
}
]
}
}
}'
scroll => "1m"
docinfo => true
size => 2000
schedule => "* * * * *"
}
}
filter {
mutate {
remove_field => ["flow_type", "source", "@version", "@timestamp"]
add_field => { "[@metadata][new_index]" => "es_qidian_flow_oa_zs_%{+YYYY_MM}"}
}
}
output {
elasticsearch {
hosts => ["http://10.101.203.15:9200"]
index => "%{[@metadata][new_index]}"
document_type => "%{[@metadata][_type]}"
document_id => "%{[@metadata][_id]}"
}
stdout { }
}
The index name that comes out at this time is as follows.
Give another example. In short, all kinds of splicing are ok.
filter {
if [log_type] in [ "test", "staging" ] {
mutate { add_field => { "[@metadata][target_index]" => "test-%{+YYYY.MM}" } }
} else if [log_type] == "production" {
mutate { add_field => { "[@metadata][target_index]" => "prod-%{+YYYY.MM.dd}" } }
} else {
mutate { add_field => { "[@metadata][target_index]" => "unknown-%{+YYYY}" } }
}
}
output {
elasticsearch {
index => "%{[@metadata][target_index]}"
}
}
For incremental migration, you can refer to this practice: