logstash_output_kafka: Mysql synchronization Kafka in-depth explanation

0, caption

In actual business scenarios, you will encounter scenarios where basic data is stored in Mysql and a large amount of data is written in real time. Migrating to Kafka is a better business selection plan.

logstash_output_kafka: Mysql synchronization Kafka in-depth explanation

The selection schemes for mysql to write into
Kafka are: Scheme 1: logstash_output_kafka plugin.
Option 2: kafka_connector.
Option 3: debezium plug-in.
Option four: flume.
Scheme 5: Other similar schemes.
Among them: debezium and flume are implemented based on mysql binlog.

If you need to synchronize the full amount of historical data + update the data in real time, it is recommended to use logstash.

1. Logstash synchronization principle

The commonly used logstash plug-in is: logstash_input_jdbc to realize the synchronization of relational database to Elasticsearch.

In fact, mastering the synchronization principle of core logstash helps everyone understand the synchronization between similar libraries.

The core principle of logstash: input generates events, filters modify them, and output sends them to other places.

The core of logstash consists of three parts: input, filter, and output.

logstash_output_kafka: Mysql synchronization Kafka in-depth explanation

input { }
filter { }
output { }

1.1 input

Including but not limited to:

  1. jdbc: Relational database: mysql, oracle, etc.
  2. file: Read from a file on the file system.
  3. syslog: Listen for syslog messages on the known port 514.
  4. redis: redis message. beats: Process events sent by Beats.
  5. Kafka: Kafka real-time data stream.

1.2 filter

Filters are intermediate processing devices in the Logstash pipeline. You can combine filters with conditions to perform actions on events when certain conditions are met.

It can be compared to the ETL link of data processing.

Some useful filters include:

  1. grok: Parse and construct arbitrary text. Grok is currently the best way to parse unstructured log data into structured and queryable content in Logstash. With 120 modes built into Logstash, you are likely to find a mode that meets your needs!
  2. mutate: Perform regular conversion on event fields. You can rename, delete, replace and modify the fields in the event.
  3. drop: Delete events completely, such as debugging events.
  4. clone: ​​Make a copy of the event, possibly adding or deleting fields.
  5. geoip: Add information about the geographic location of the IP address.

1.3 output

The output is the final stage of the Logstash pipeline. Some commonly used outputs include:

elasticsearch: Send event data to Elasticsearch.
file: write event data to a file on disk.
Kafka: Write events to Kafka.
Detailed filter demo reference: http://t.cn/EaAt4zP

2. Synchronize Mysql to Kafka configuration reference

input {
    jdbc {
      jdbc_connection_string => "jdbc:mysql://192.168.1.12:3306/news_base"
      jdbc_user => "root"
      jdbc_password => "xxxxxxx"
      jdbc_driver_library => "/home/logstash-6.4.0/lib/mysql-connector-java-5.1.47.jar"
      jdbc_driver_class => "com.mysql.jdbc.Driver"
      #schedule => "* * * * *"
      statement => "SELECT * from news_info WHERE id > :sql_last_value  order by id"
      use_column_value => true
      tracking_column => "id"        
      tracking_column_type => "numeric"
      record_last_run => true
      last_run_metadata_path => "/home/logstash-6.4.0/sync_data/news_last_run"    

    }

}

filter {
   ruby{
        code => "event.set('gather_time_unix',event.get('gather_time').to_i*1000)"
    }
    ruby{
        code => "event.set('publish_time_unix',event.get('publish_time').to_i*1000)"
    }
  mutate {
    remove_field => [ "@version" ]
    remove_field => [ "@timestamp" ]
    remove_field => [ "gather_time" ]
    remove_field => [ "publish_time" ]
  }
}

 output {
      kafka {
            bootstrap_servers => "192.168.1.13:9092"
            codec => json_lines
            topic_id => "mytopic"

    }
    file {
            codec => json_lines
            path => "/tmp/output_a.log"
    }
 }

The above content is not complicated and will not go into detail.

Note: After
Mysql synchronizes with logstash, the date type format: "2019-04-20 13:55:53" has been recognized as a date format.

code =>
"event.set('gather_time_unix',event.get('gather_time').to_i*1000)",

It converts the time format in Mysql into a timestamp format.

3. Pit summary

3.1 Case of Pit 1 Field

from Xingyou: Use logstash to synchronize mysql data, because the attribute lowercase_column_names
=> "false" is not added in jdbc.conf , so logstash changes the list of query results to lowercase by default, and synchronizes to es, so it leads to es The field names seen inside are all lowercase.

Final summary: es supports uppercase field names. The problem is that logstash does not work well. You need to add lowercase_column_names => "false" to the synchronization configuration. Record it and hope to help more people.

3.2 Will the data synchronized to ES be repeated?

If you want to synchronize the data of the relational database to ES, if you start logstash on multiple servers in the cluster at the same time.

Interpretation: In the actual project, the random id is not used. The specified id is used as the _id of es. The specified id can be the md5 of the url. In this way, the same data will be updated and overwritten.

3.3 With the same configuration of logstash, data cannot be synchronized after upgrading to 6.3.

Interpretation: The high version is optimized based on time increments.

tracking_column_type => "timestamp" should be designated as a time type, the default is numeric type numeric

3.4 Where are the ETL fields handled?

Interpretation: It can be processed in the sql query stage when logstash is synchronized with mysql, such as: select a_value as avalue***.

Or filter stage processing, mutate rename processing.

mutate {
        rename => ["shortHostname", "hostname" ]
    }

Or the kafka stage is processed by kafka stream.

4. Summary

  • Relevant configuration and synchronization are not complicated, and the complexity often lies in the analysis of the filter phase and the performance of logstash.
  • In-depth research and performance analysis need to be combined with actual business scenarios.
  • If you have any questions, please leave a message to discuss.

Recommended reading:
1. Actual combat | canal realizes real-time incremental synchronization of Mysql to Elasticsearch
2. Dry goods | Debezium realizes efficient real-time synchronization of Mysql to Elasticsearch
3. A picture to clarify the synchronization of relational databases with Elasticsearch http://t.cn/EaAceD3
4. New implementation: http://t.cn/EaAt60O
5. mysql2mysql: http://t.cn/EaAtK7r
6. Recommended open source implementation: http://t.cn/EaAtjqN
logstash_output_kafka: Mysql synchronization Kafka in-depth explanation
to join the planet, more in a shorter time Learn more dry goods quickly!

Guess you like

Origin blog.51cto.com/15050720/2562057