ELK's Logstash installation and use

Logstash is an important member of the ELK technology stack. This article introduces some uses of Logstash.

Official website   Logstash: Collect, parse and transform logs | Elastic

There are links to Logstash documentation   , Logstash forums  , and downloads below .

1 Introduction

Logstash is an open source data collection engine. It has real-time data transmission capabilities, and can collect, analyze and store data according to our customized specifications. That is to say, Logstash has three core components, namely data collection, data analysis and data dumping. These three parts form a data flow similar to a pipeline. The input end collects data, the pipeline itself performs data filtering and analysis, and the output end outputs the filtered and analyzed data to the target database.

A quick look at the official manual shows that its functions are still very powerful. Logstash Reference [8.4] | Elastic

There are many that can be used, such as elasticsearch/redis/file/http/kafka/tcp/udp/stdout/websocket/mongodb, etc. 

#Input有如下插件

Input plugins
	azure_event_hubs
	beats
	cloudwatch
	couchdb_changes
	dead_letter_queue
	elastic_agent
	elasticsearch
	exec
	file
	ganglia
	gelf
	generator
	github
	google_cloud_storage
	google_pubsub
	graphite
	heartbeat
	http
	http_poller
	imap
	irc
	java_generator
	java_stdin
	jdbc
	jms
	jmx
	kafka
	kinesis
	log4j
	lumberjack
	meetup
	pipe
	puppet_facter
	rabbitmq
	redis
	relp
	rss
	s3
	s3-sns-sqs
	salesforce
	snmp
	snmptrap
	sqlite
	sqs
	stdin
	stomp
	syslog
	tcp
	twitter
	udp
	unix
	varnishlog
	websocket
	wmi
	xmpp
#output有如下插件

Output plugins
	boundary
	circonus
	cloudwatch
	csv
	datadog
	datadog_metrics
	dynatrace
	elastic_app_search
	elastic_workplace_search
	elasticsearch
	email
	exec
	file
	ganglia
	gelf
	google_bigquery
	google_cloud_storage
	google_pubsub
	graphite
	graphtastic
	http
	influxdb
	irc
	java_stdout
	juggernaut
	kafka
	librato
	loggly
	lumberjack
	metriccatcher
	mongodb
	nagios
	nagios_nsca
	opentsdb
	pagerduty
	pipe
	rabbitmq
	redis
	redmine
	riak
	riemann
	s3
	sink
	sns
	solr_http
	sqs
	statsd
	stdout
	stomp
	syslog
	tcp
	timber
	udp
	webhdfs
	websocket
	xmpp
	zabbix

https://www.elastic.co/guide/en/logstash/current/index.html

2. Logstash installation

1. Install java environment

Not much to say here, just make sure the java environment is ok as follows. 

2. Download and install

curl -L -O https://artifacts.elastic.co/downloads/logstash/logstash-7.3.0.tar.gz
tar -xzvf logstash-7.3.0.tar.gz

3. Startup and verification

cd logstash-7.3.0
#启动logstash,-e选项指定输入输出;这里输入采用标准输入,标准输出作为输出。
./bin/logstash -e 'input { stdin { } } output { stdout {} }'

After the startup is complete, enter a character string, and you can get the corresponding output, as shown below, the installation is successful.

For more installation methods, see    How to install Logstash in the Elastic stack_Elastic China Community Official Blog Blog-CSDN Blog

3. Use of Logstash

A Logstash pipeline has two required elements, inputs and outputs, and an optional element filters.

3.1. Configure config/logstash.yml

Change the config.reload.automatic option to true. The advantage is that Logstash does not need to be restarted every time the configuration file is changed, and the changed configuration file will be automatically loaded.

3.2, practice - accept tcp port data

In the config directory (in fact, any directory is fine), create a test.conf file with the following content.

input 监听9900端口的的tcp数据
output  打印到标准输出

input {
  tcp {
    port => 9900
  }
}
 
output {
  stdout {
     codec => rubydebug  #以rubydebug格式在控制台输出
     #codec => json   #以json格式在控制台输出
  }
}




execute logstash

./bin/logstash -f ./config/test.conf

In addition, start a terminal and send data to port 9900 through the nc command.

echo 'hello logstash!!!!!!!' | nc localhost 9900

#注:别的机器发也行,ip位置指定为logstash所在的ip即可

The effect is as follows:

 As shown in the figure below: the upper part is the log record automatically loaded after modifying the configuration file, and the lower part is the output after changing to json format.

3.3. Practice - data processing of Grok filter

input {
  tcp {
    port => 9900
  }
}
 
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
}
 
output {
  stdout {
    codec => rubydebug  #以rubydebug格式在控制台输出
  }
}

Create a text file test.log with the following content:

14.49.42.25 - - [12/May/2019:01:24:44 +0000] "GET /articles/ppp-over-ssh/ HTTP/1.1" 200 18586 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:24:15 +0000] "GET /articles/openldap-with-saslauthd/ HTTP/1.1" 200 12700 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:24:06 +0000] "GET /articles/dynamic-dns-with-dhcp/ HTTP/1.1" 200 18848 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:24:54 +0000] "GET /articles/ssh-security/ HTTP/1.1" 200 16543 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:25:25 +0000] "GET /articles/week-of-unix-tools/ HTTP/1.1" 200 9313 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
14.49.42.25 - - [12/May/2019:01:25:33 +0000] "GET /blog/geekery/headless-wrapper-for-ephemeral-xservers.html HTTP/1.1" 200 11902 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
66.249.73.135 - - [12/May/2019:01:25:58 +0000] "GET /misc/nmh/replcomps HTTP/1.1" 200 891 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
114.80.81.51 - - [12/May/2019:01:26:10 +0000] "GET /blog/geekery/xvfb-firefox.html HTTP/1.1" 200 10975 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"
46.105.14.53 - - [12/May/2019:01:26:18 +0000] "GET /blog/tags/puppet?flav=rss20 HTTP/1.1" 200 14872 "-" "UniversalFeedParser/4.2-pre-314-svn +http://feedparser.org/"
61.55.141.10 - - [12/May/2019:01:26:17 +0000] "GET /blog/tags/boredom-induced-research HTTP/1.0" 200 17808 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2b1) Gecko/20091014 Firefox/3.6b1 GTB5"

Excuting an order 

head -n 1 test.log | nc localhost 9900

The effect is as follows:

 That is to say, through the Grok filter, he will match the unstructured data we input into structured data through regular expression matching. From the above, you can see that it has extracted fields such as request, port, host, and clientip.

For more information about Grok filters, see    Logstash: Getting Started with Grok filters_Elastic China Community Official Blog Blog-CSDN Blog

3.4. Practice - Geoip filter

I know clientip before, but I don’t know where this IP comes from, that is, the specific country, longitude and latitude geographic information. A Geoip filter can be used for this.

input {
  tcp {
    port => 9900
  }
}
 
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
 
  geoip {
    source => "clientip"
  }
}
 
output {
  stdout { }
}

implement

head -n 1 test.log | nc localhost 9900

The effect is as follows

3.5. Practice - Useragent filter

We noticed that the agent field is relatively long. However, fields such as browser and language are not clearly distinguished. This can be further enriched using the useragent filter.

3.6. Practice - mutate/convert filter

Note that bytes is a string type, but you might actually expect it to be a number. You can use the mutate:convert filter for this.

input {
  tcp {
    port => 9900
  }
}
 
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
 
  mutate {
    convert => {
      "bytes" => "integer"
    }
  }
 
  geoip {
    source => "clientip"
  }
 
  useragent {
    source => "agent"
    target => "useragent"
  }
 
}
 
output {
  stdout { }
}

However, the agent does not seem to extract the corresponding fields

3.7, practice - import data to elastic

All the previous output is stdout, which is output to the console (console) where Logstash runs. The following demonstrates outputting data to Elasticsearch.

Note: Elasticsearch and kibana have been deployed on this machine, and have been successfully started and available.

input {
  tcp {
    port => 9900
  }
}
 
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
 
  mutate {
    convert => {
      "bytes" => "integer"
    }
  }
 
  geoip {
    source => "clientip"
  }
 
  useragent {
    source => "agent"
    target => "useragent"
  }
 
  date {
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
  }
}
 
output {
  stdout { }
 
  elasticsearch {
    hosts => ["localhost:9200"]
    #user => "elastic"
    #password => "changeme"
  }
}


关于输入我们保留了stdout和elasticsearch,前者主要是为了方便调试。

Excuting an order

head -n 1 test.log | nc localhost 9900

Execute the following command in kibana to view the data written to es.

#统计数据条数
GET logstash/_count

#访问数据
GET logstash/_search

#也可以看到对应的logstash命名的索引
GET _cat/aliases

3.8. Practice - migrating ES data

Note: The ip of some ES instances may be prohibited from pinging. We can use the curl command to determine whether the machine where logstash is located can successfully access the ES instance. See   the curl command commonly used in ES (elasticsearch)

As for how to configure it, you can find the corresponding elasticsearch plug-ins for input, output, etc. in the official documentation and look at the examples.

Elasticsearch input plugin | Logstash Reference [8.4] | Elastic

(1) input setting elasticsearch

 According to the official manual, some commonly used parameters are listed.

1、hosts:指定一个或多个你要查询的ES的主机。每个主机可以是 IP,HOST,IP:port,或者 HOSY:port。默认的端口是9200.
2、index:指定作用的索引。所有索引 "*" 即可
3、query:指定查询语句
4、proxy:设置为正向的http代理。空的话默认为没有设置代理。
5、request_timeout_seconds:秒单位的单次请求ES的最大时间,当单次请求的数量十分巨大的时候,超时极易发生。数值默认为60s。
6、schedule:顾名思义 定期的运行cron格式的时间表,例如 "* * * * *" 表示每分钟执行一次查询。默认情况认为无时间表,此时仅执行一次。
7、scroll: 参数控制scroll请求两次间隔间的保活时间(单位是秒),并且启动scroll过程。超时适用于每次往返即上一次滚动请求到下一个滚动请求之间. 默认值1m
8、size:设置每次scroll返回的最大消息条数。默认 1000
9、docinfo:如果设置的话在事件中就会包括诸如index,type,docid等文档信息。bool值默认为false
10、docinfo_fields: 如已经通过设置docinfo为true执行需要元数据存储,此字段列出事件中需要保存的元数据字段有哪些。默认值是 ["_index", "_type", "_id"]。
11、docinfo_target:如已经通过设置docinfo为true执行需要元数据存储,则此选项将在其下存储元数据字段的字段命名为子字段。

Create the es_sync.conf file:

#好用的
input {
  elasticsearch {
    hosts => ["http://11.168.176.227:9200"]
    index => "es_qidian_flow_oa_20220906"
    query => '{"query":{"bool":{ "must":[{"term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}}]}}}'
  }
}

output {
  stdout { }
}


#最好有缩进,看起来更舒服。这个也是可以的
input {
  elasticsearch {
    hosts => ["http://11.168.176.227:9200"]
    index => "es_qidian_flow_oa_20220906"
    query => '{
      "query":{
        "bool":{
          "must":[
            {
              "term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}
            }
          ]
        }
      }
    }'
  }
}

output {
  stdout { }
}



#其中的type感觉不用指定也是ok的。

Note: If you continue to write qualified data to the source ES at this time, it will not be incrementally synchronized (only executed once).

Set up scheduled tasks. The following is the configuration of querying once per minute and outputting the result to the standard output of the console.

input {
  elasticsearch {
    hosts => ["http://11.168.xxx.227:9200"]
    index => "es_qidian_flow_oa_20220906"
    query => '{
      "query":{
        "bool":{
          "must":[
            {
              "term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}
            }
          ]
        }
      }
    }'
    scroll => "1m"
    docinfo => true 
    size => 2000
    schedule => "* * * * *"  #定时任务,每分钟1次
  }
}

filter {
    mutate {
       remove_field => ["flow_type", "source", "@version", "@timestamp"]
 }
}

output {
  stdout { }
}

Q: How to achieve incremental synchronization? 

Large: use the schedule to do timing tasks, and configure docid deduplication at the output. This alone may not be enough, the query statement specifies to filter the data within the last t minutes for synchronization. The value of t should match the scroll interval. The following configuration synchronizes the data within the last 3 minutes every 1 minute, so that data should not be lost; in addition, deduplication can also be achieved through docid.

Note: After looking at logstash to export data from elasticsearch, incremental synchronization seems to be implemented in this way.

input {
    elasticsearch {
        hosts => "1.1.1.1:9200"
        index => "es-runlog-2019.11.20"
        query => '{"query":{"range":{"@timestamp":{"gte":"now-3m","lte":"now/m"}}}}'
        size => 5000
        scroll => "1m"
        docinfo => true
        schedule => "* * * * *" #定时任务,每分钟执行一次
      }
}
filter {
     mutate {
   remove_field => ["source", "@version"]
 }
}
output {
    stdout {}
}

(2) filter plug-in

There are also many functions, here are a few. Mutate filter plugin | Logstash Reference [8.4] | Elastic

The date plug-in, grok plug-in, and geoip plug-in have been briefly demonstrated before. Here is a special introduction to the mutate (mutation) plug-in .

 Common options applicable to all filter plugins:

1、add_field:在事件中添加任意字段。字段名称可以是动态的,并使用%{Field}包含事件的各个部分。
    filter {
      mutate {
        add_field => { "foo_%{somefield}" => "Hello world, from %{host}" }
      }
    }
2、remove_field:从此事件中删除任意字段。字段名称可以是动态的,并使用%{field}包含事件的各个部分示例。
    filter {
      mutate {
        remove_field => [ "foo_%{somefield}" ]
      }
    }
3、add_tag:
4、remove_tag:

Some options for the mutate plugin 

1、convert:将字段的值转换为其他类型,如将字符串转换为整数。如果字段值是数组,则将转换所有成员。如果字段是散列,则不会采取任何操作。
    filter {
      mutate {
        convert => {
          "fieldname" => "integer"
          "booleanfield" => "boolean"
        }
      }
    }
2、copy:将现有字段复制到其他字段。现有目标字段将被覆盖。
    filter {
      mutate {
         copy => { "source_field" => "dest_field" }
      }
    }

3、merge:
4、rename:
5、replace:
6、update:

After adding mutate→remove_field to the above conf, you can see that the output does not include those fields listed in remove.

(3) output plug-in settings elasticsearch

See the official website   Output plugins | Logstash Reference [8.4] | Elastic

Import the data of one ES into another ES, and the index, type, and docid remain unchanged; at the same time, the schedule timing task is also configured. as follows.

input {
  elasticsearch {
    hosts => ["http://11.168.xxx.227:9200"]
    index => "es_qidian_flow_oa_20220906"
    query => '{
      "query":{
        "bool":{
          "must":[
            {
              "term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}
            }
          ]
        }
      }
    }'
    scroll => "1m"
    docinfo => true 
    size => 2000
    schedule => "* * * * *"
  }
}

filter {
    mutate {
       remove_field => ["flow_type", "source", "@version", "@timestamp"]
 }
}

output {
  elasticsearch {
        hosts => ["http://10.101.xxx.15:9200"]
        index => "%{[@metadata][_index]}"
        document_type => "%{[@metadata][_type]}"
        document_id => "%{[@metadata][_id]}"
  }
  stdout { }
}

After testing, the effect is completely in line with expectations.

What if there is no field in the event that contains the prefix of the target index?

At this time, you can use the mutate filter and condition to add the [@metadata] field to set the target index of each event. [@metadata] fields will not be sent to Elasticsearch.

input {
  elasticsearch {
    hosts => ["http://11.168.176.227:9200"]
    index => "es_qidian_flow_oa_20220906"
    query => '{
      "query":{
        "bool":{
          "must":[
            {
              "term":{"session_id": "webim_2852199659_240062447027410_1662447030899"}
            }
          ]
        }
      }
    }'
    scroll => "1m"
    docinfo => true 
    size => 2000
    schedule => "* * * * *"
  }
}

filter {
    mutate {
       remove_field => ["flow_type", "source", "@version", "@timestamp"]
       add_field => { "[@metadata][new_index]" => "es_qidian_flow_oa_zs_%{+YYYY_MM}"}
 }
}

output {
  elasticsearch {
        hosts => ["http://10.101.203.15:9200"]
        index => "%{[@metadata][new_index]}"
        document_type => "%{[@metadata][_type]}"
        document_id => "%{[@metadata][_id]}"
  }
  stdout { }
}

The index name that comes out at this time is as follows.

Give another example. In short, all kinds of splicing are ok.

    filter {
      if [log_type] in [ "test", "staging" ] {
        mutate { add_field => { "[@metadata][target_index]" => "test-%{+YYYY.MM}" } }
      } else if [log_type] == "production" {
        mutate { add_field => { "[@metadata][target_index]" => "prod-%{+YYYY.MM.dd}" } }
      } else {
        mutate { add_field => { "[@metadata][target_index]" => "unknown-%{+YYYY}" } }
      }
    }
    output {
      elasticsearch {
        index => "%{[@metadata][target_index]}"
      }
    }

For incremental migration, you can refer to this practice:

Remember to migrate ES data across clusters online once - Tencent Cloud Developer Community - Tencent Cloud

Guess you like

Origin blog.csdn.net/mijichui2153/article/details/127113364