[elasticsearch topic]: Logstash from getting started to synchronizing MySQL data

1 Introduction

  Elasticsearch is an open source distributed search and analysis engine in the data processing ecosystem , designed for storing , retrieving and analyzing large amounts of data. Accompanying this is Kibana , an open source data visualization platform for elegantly presenting data in Elasticsearch and empowering users to create dashboards, charts, and reports. However, achieving a complete data flow doesn't stop there. Logstash plays the role of data processing, making the whole process of data processing more complete. These three components constitute what we usually call ELK . Let's start with a detailed introduction to Logstash.

1.1 What is Logstash?

As an open source data collection engine with real-time pipeline function , Logstash has powerful capabilities. It can collect data from different sources, aggregate it dynamically, and then convert it according to the specifications we define or output it to the target address we define.

1.2 Main features of Logstash

By cleaning and diversifying data, Logstash makes data suitable for a variety of advanced downstream analysis and visualization use cases . In addition, Logstash provides a wide range of input, filter and output plugins, and many native codecs further simplify the process of data ingestion. Logstash is a powerful and flexible solution, whether it is data collation or provision to downstream applications.
The picture quoted from the official website of Logstash can better explain his role and function:
insert image description here

2. Download and configuration

  This article uses the latest version 7.17.12 of 7.17 . The 7.x version is also the last version that supports jdk8. The subsequent 8.0 defaults to jdk11.

2.1 download

Log in to the elastic download address, select the product Logstash and the version number 7.17.12
insert image description here
  about the system selection , if it is Linux with x86 architecture, choose to download LINUX X86_64 , if it is windows, choose WINDOWS , there is no difference between the two systems, start and The configurations are similar, and it is recommended to use the windows version for learning and testing.

2.2 File structure

The decompressed file directory is as follows:
insert image description here

  • bin : This directory contains executable scripts used to start Logstash. For example, the startup script, by running this script, you can start Logstash.

  • config : Store the configuration file of Logstash.

  • data : The data directory is used to store Logstash's persistent data, such as internal state information, temporary files, etc.

  • jdk : Contains the Java Development Kit (JDK) version required by Logstash. It is a specific JDK version that comes with Logstash to ensure that Logstash has the required Java environment at runtime.

  • lib : This directory usually contains Logstash's dependent libraries and plugins.

  • logstash-core and logstash-core-plugin-api : These directories contain the code for the Logstash core functionality and the plugin API. The Logstash core implements the core logic of the data processing pipeline, and the plug-in API allows developers to create custom plug-ins to extend the functionality of Logstash.

  • modules : The modules directory contains some predefined Logstash modules for processing specific types of data such as logs, network traffic, etc. These modules simplify configuration and provide some default settings.

  • tools : This directory contains some tools for Logstash, such as performance analysis tools, which can be used to diagnose and optimize the performance of Logstash.

  • vendor : This directory contains dependent libraries and plugins required by Logstash, as well as some other tools.

  • x-pack : is an extended feature suite for Elasticsearch, Kibana, Logstash, and Beats, which provides a series of security, monitoring, alerting, machine learning, and other advanced features designed to enhance the functionality of the Elastic Stack.

2.3 Environment configuration

  Logstash recommends configuring the LS_JAVA_HOME variable in the environment variable to point to the jdk directory to use jdk. When configuring the jdk directory, we can directly use the jdk directory in Logstash, eliminating the need for additional downloads and problems that may cause version unusability. (The jdk versions supported by 7.17.12 are only jdk8, jdk11 and jdk15)

  In versions 7.17.12 and earlier, it is also compatible to use the JAVA_HOME environment variable we configured, and the support for this variable will be canceled later.

3. Three core components of Logstash

A Logstash pipeline has two required elements input(collect source data) and output(output data), and one optional element filter(format data). The relationship between the three plug-ins in source data and Elasticsearch is as follows:
insert image description here

3.1 Input

inputis a plugin or configuration section for collecting data from different data sources. inputPlugins allow you to define sources of data input and send data to Logstash for subsequent processing. Logstash supports several types of input plugins, each suitable for different data source types. Commonly used are:

  • file : read source data from a file
  • github : read source data from the network provided by github
  • http : Receive data via http/https
  • jdbc : read data through jdbc driver

Detailed plug-ins can be viewed on the official website of Logstash

3.2 Filter

filterIt is the part used to process, transform and filter the input data. filterPlug-ins allow various operations to be performed on data after it enters Logstash (after input processing) to meet specific needs, such as data cleaning, parsing, standardization, etc. Logstash supports several types of "filter" plugins, each suitable for different data processing needs. Here are some common Logstash "filter" plugins:

  • csv : Parses comma-separated value data into a single field
  • clone : duplicate event
  • date : The date in the parsed field to use as the Logstash timestamp for the event
  • grok : Parse unstructured event data into fields.
    Detailed plug-ins can be viewed on the Logstash official website

3.3 Output

outputis a plugin or configuration section for sending processed data to different destinations. outputPlugins allow the definition of destinations for data output and the transfer of Logstash processed data to these destinations. Logstash supports several types of output plugins, each suitable for different data storage, transmission, or processing needs. Here are some common Logstash "output" plugins:

  • elasticsearch : store to elasticsearch
  • email : send an email to the specified address when output is received
  • file : store to file
  • mongodb : write data to mongodb.
    Detailed plug-ins can be found on the official website of Logstash

4. Hands-on practice: Hello World example

4.1 How to start Logstash

It is relatively simple to start logstash, just execute logstash (linxu) or logstash.bat (windows) in the bin directory .

# linux启动命令
./bin/logstash

# windows启动命令
.\bin\logstash.bat

4.2 Detailed Explanation of Commonly Used Configuration Files

Before writing an example, you need to understand the important configuration files

  • logstash.yml : is the main configuration file of Logstash, which contains the global settings and options of Logstash. In this file, you can configure various global parameters such as network settings, paths, log settings, etc. This file can affect the overall behavior of Logstash.

Some common configuration items include:
pipeline.batch.size: specifies the number of events processed per batch.
pipeline.batch.delay: Specifies the delay between each batch.
path.data: Specifies the storage path of Logstash data.
http.host: Specifies the host name for HTTP listening.
http.port: Specify the port number for HTTP listening.
pipeline.workers: Specifies the number of worker threads to process events in parallel.
queue.type: Specify the storage type of the queue, optional memory (memory) and persisted (persistent)

  • pipelines.yml : A configuration file for configuring and managing Logstash data processing pipelines. Logstash can run multiple data processing pipelines concurrently, each with its own configuration of inputs, filters, and outputs.
  • jvm.options : It is a file used to configure Logstash JVM (Java Virtual Machine) options. This file affects Logstash performance and resource allocation. You can configure the heap memory size, garbage collection options, etc. in this file.

Some common configuration items include:
-Xmx: Specifies the maximum value of Java heap memory.
-Xms: Specifies the initial value of the Java heap memory.
-XX:+UseConcMarkSweepGC: Specifies to use the CMS (Concurrent Mark-Sweep) garbage collector.
-Djava.io.tmpdir: Specifies the storage path for temporary files.

  • logstash-sample.conf : is a sample Logstash configuration file used to demonstrate how to configure data input, filtering and output. This file contains configuration examples of various plug-ins to help us understand how to build a complete Logstash data processing pipeline.

4.3 Writing and running the "Hello World" example

Execute in the logstash root directory

# windows执行
.\bin\logstash.bat -e "input { stdin { } } output { stdout {} }"

#linux执行
bin/logstash -e 'input { stdin { } } output { stdout {} }'

After starting Logstash, wait until you see it ][main] Pipeline started {"pipeline.id"=>"main"}, then at the command prompt enter: hello world
insert image description here

4.4 Use the -f parameter to specify the configuration file to start

In the above example, we directly -estart with parameters followed by pipeline configuration. We can also use -fparameters to specify our configuration file to start. Create a file in the following directory hello.conf, the content of the file:

input { stdin { } } 
output { stdout { } }

Then execute the start command

# windows执行
.\bin\logstash.bat -f hello.conf

#linux执行
bin/logstash -f hello.conf

4.5 Configure startup in pipeline

Open the pipelines.yml file in the config directory and enter

- pipeline.id: hello
  pipeline.workers: 1
  pipeline.batch.size: 1
  config.string: "input { stdin { } } output { stdout {} }"

Instead of directly configuring pipeline processing rules, we can also point to the hello.conf file we just wrote

- pipeline.id: hello
  pipeline.workers: 1
  pipeline.batch.size: 1
  path.config: "/usr/local/logstash/hello.conf"

After saving the file execute:

# windows执行
.\bin\logstash.bat

#linux执行
bin/logstash

5. Actual combat: timed rolling synchronization of MySQL data

5.1 Environment and data preparation

5.1.1 Database preparation

The table structure and test data of mysql need to be prepared in advance:

# 建表语句
CREATE TABLE `test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `content` varchar(255) DEFAULT NULL,
  `status` int(11) DEFAULT NULL,
  `update_time` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
# 插入数据
insert into test (content,status,update_time) VALUES ("aaa",1,UNIX_TIMESTAMP());
insert into test (content,status,update_time) VALUES ("bbb",1,UNIX_TIMESTAMP());
insert into test (content,status,update_time) VALUES ("ccc",2,UNIX_TIMESTAMP());
insert into test (content,status,update_time) VALUES ("ddd",1,UNIX_TIMESTAMP());

5.1.2 Start elasticsearch and kibana

You need to start elasticsearch and kibana , and elasticsearch needs to be enabled to allow automatic index creation. If not enabled, you need to create an index in advance

5.1.3 Import mysql jar

Create a mylib directory under the logstash root directory to store the jar file that java connects to mysql, for example:mysql-connector-java-8.0.27.jar

5.2 Writing scripts

5.2.1 Scrolling and synchronizing data according to id

Requirement: Execute every minute according to the id of the test table from small to large and the status is equal to 1 to save in elasticsearch. The number of each execution is 2. Create the mysql-by-id-to-es.conf file
under the logstash directory, and the file content :

input {
  jdbc {
    jdbc_driver_library => "./mylib/mysql-connector-java-8.0.27.jar"
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/test"
    jdbc_user => "root"
    jdbc_password => "123456"
    parameters => { "myStatus" => 1 }
    schedule => "* * * * *"
    statement => "SELECT id,content,status,update_time FROM test WHERE status = :myStatus AND id > :sql_last_value ORDER BY id ASC LIMIT 2"
    last_run_metadata_path => "mysql-by-id-to-es.index"
    tracking_column => "id"
    use_column_value => true
    tracking_column_type => "numeric"
  }
}
filter {
	mutate { add_field => { "from" => "logstash" } }
}
output {
  elasticsearch {
        index => "test-by-id-%{+YYYY.MM}"
  }
  stdout {
	
  }
}

-fSave the contents of the file and execute the startup command in the logstash root directory

# windows执行
.\bin\logstash.bat -f mysql-by-id-to-es.conf
# linux执行
./bin/logstash.bat -f mysql-by-id-to-es.conf

Console print information:
insert image description here

Go to kibana to execute and view the data:
1. Execute first GET _cat/indices?vto check whether the index starting with test-by-id is stored
insert image description here

2. If the index exists, execute the statement to view the data GET test-by-id-2023.08/_search{ "query": {"match_all": {}}}, and you can find that the records with status equal to 2 will not be entered

{
    
    
        "_index" : "test-by-id-2023.08",
        "_type" : "_doc",
        "_id" : "FsWU74kBjZ5FwUtCTy7a",
        "_score" : 1.0,
        "_source" : {
    
    
          "update_time" : 1691939803,
          "@version" : "1",
          "content" : "aaa",
          "@timestamp" : "2023-08-13T15:47:01.008Z",
          "status" : 1,
          "id" : 1,
          "from" : "logstash"
        }
      },
      {
    
    
        "_index" : "test-by-id-2023.08",
        "_type" : "_doc",
        "_id" : "FcWU74kBjZ5FwUtCTy7a",
        "_score" : 1.0,
        "_source" : {
    
    
          "update_time" : 1691939803,
          "@version" : "1",
          "content" : "bbb",
          "@timestamp" : "2023-08-13T15:47:01.020Z",
          "status" : 1,
          "id" : 2,
          "from" : "logstash"
        }
      },
      {
    
    
        "_index" : "test-by-id-2023.08",
        "_type" : "_doc",
        "_id" : "F8WV74kBjZ5FwUtCNS6W",
        "_score" : 1.0,
        "_source" : {
    
    
          "update_time" : 1691939803,
          "@version" : "1",
          "content" : "ddd",
          "@timestamp" : "2023-08-13T15:48:00.413Z",
          "status" : 1,
          "id" : 4,
          "from" : "logstash"
        }
      }

5.2.2 Rolling and synchronizing data according to update time

If you want to enter through the update time of the data, create a new ** mysql-by-uptime-to-es.conf file, the file content:

input {
  jdbc {
    jdbc_driver_library => ",/mylib/mysql-connector-java-8.0.27.jar"
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/test"
    jdbc_user => "root"
    jdbc_password => "123456"
    parameters => { "myStatus" => 1 }
    schedule => "* * * * *"
    statement => "SELECT id,content,status,update_time FROM test WHERE status = :myStatus AND update_time > :sql_last_value"
    last_run_metadata_path => "mysql-by-uptime-to-es.index"
  }
}

filter {
	mutate { add_field => { "from" => "logstash" } }
}

output {
  elasticsearch {
        index => "test-by-uptime-%{+YYYY.MM}"
  }
  stdout {
	
  }
}

Save the file, start and view the data, please refer to the previous 5.2.1

5.3 Detailed explanation of configuration parameters

  • jdbc_driver_library : Specifies the path to the JDBC driver. A JDBC driver is a library for communicating with a specific type of database. You need to provide the path to the JDBC driver so that Logstash can load and use it.
  • jdbc_driver_class : Specifies the Java class name of the JDBC driver. This class name tells Logstash which specific JDBC driver to use to connect to the database.
  • jdbc_connection_string : Specify the connection string to establish a connection with the database. This string includes information such as the location of the database, port, database name, etc.
  • jdbc_user : Specify the username required to connect to the database.
  • jdbc_password : Specifies the password required to connect to the database.
  • parameters : This configuration allows you to specify custom JDBC connection parameters to be passed in the connection string. This can include options such as SSL configuration, charset, etc.
  • schedule : Specifies the scheduling schedule for extracting data from the database. Use cron expressions to define intervals for data extraction.

example:

  • 5 * 1-3 * will be executed every day from May to <> at <> am.
    0 * * * * will execute every day at the 0th minute of every hour.
    0 6 * * * America/Chicago will execute every day at 6:00 AM (UTC/GMT -5).
  • statement : This configuration defines the SQL query to extract data from the database. You can write custom SQL queries here to select the required data.
  • last_run_metadata_path : Specify a file path to store the metadata of the last run. This helps Logstash keep track of when the last data pump was made so that it can continue from where it was last pumped.
  • tracking_column and use_column_value : These two configurations are used together to identify a column for incremental extraction.
    tracking_columnSpecifies the name of the column to track when incrementally extracting, and use_column_valueindicates whether to use the value of the column as a tracking flag.
  • tracking_column_type : Specifies the data type of the tracking column. This is a tracking column used to support different data types like datetime, number etc. Currently only numeric and are supported timestamp, the default is timestamp.
  • add_field : Add an attribute and specify the default value.
  • elasticsearch : output to elasticsearch, the commonly used configuration is as follows:
elasticsearch {
  # 配置es地址,有多个使用逗号隔开,不填默认就是 localhost:9200
  hosts => ["localhost:9200"]
  # 配置索引
  index => "test-by-uptime-%{+YYYY.MM}"
  # 配置账号和密码,默认不填
  user => "elastic"
  password => "123456"
}

6. Summary

  This article introduces the core components and functions of Logstash in detail, covering the complete process from downloading and installing to writing the first Hello World example, and then synchronizing MySQL data to Elasticsearch. Hope this tutorial can be an important reference for you to learn and master Logstash.

Guess you like

Origin blog.csdn.net/dougsu/article/details/132261486