Elasticsearch cluster real-time synchronization of database data (mysql)

Help document:
jdbc input plugin
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html
real-time synchronization
https://www.elastic.co/cn/blog/logstash-jdbc -input-plugin#
https://segmentfault.com/a/1190000014387486
Output plug-in:
https://www.elastic.co/guide/en/logstash/current/output-plugins.html#output-plugins

Realize low-latency retrieval of data in ES or other data analysis processing.

Logstash mysql is synchronized to elasticsearch mysql in quasi real-time.
As a mature and stable data persistence solution, it is widely used in various fields, but it is slightly inadequate in data analysis. Elasticsearch, as a leader in data analysis, can just make up for this Not enough, and all we need to do is to synchronize the data in mysql to elasticsearch, and logstash just can support it, all you need to do is write a configuration file.

Install logstash:
here I use es cluster 6.0.1, so logstash is also 6.0.1

wget https://artifacts.elastic.co/downloads/logstash/logstash-6.0.1.zip
unzip logstash-6.2.3.zip && cd logstash-6.0.1

Install jdbc and elasticsearch plugin

bin/logstash-plugin install logstash-input-jdbc
bin/logstash-plugin install logstash-output-elasticsearch

Get the jdbc mysql driver
Address: https://dev.mysql.com/downloads/connector/j/3.1.html

wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.46.zip
unzip mysql-connector-java-5.1.46.zip

Edit the configuration file: Use the logstash-input-jdbc plug-in to read mysql data
in the config directory of
logstash. The working principle of this plug-in is relatively simple. It executes a SQL regularly, and then writes the results of the SQL execution to the stream. The method of obtaining the amount is not synchronized by binlog, but an incremental field is used as a condition to query, and the current query position is recorded each time. Due to the incremental nature, only records larger than the current need to be queried can be retrieved during this period of time. There are two general increment fields, the primary key id of AUTO_INCREMENT and the update_time field of ON UPDATE CURRENT_TIMESTAMP. The id field is only applicable to tables that are only inserted and not updated. Update_time is more general. It is recommended to design in mysql table Add an update_time field every time

input {
    
    
  jdbc {
    
    
    jdbc_driver_library => "/usr/local/mysql-connector-java-5.1.48/mysql-connector-java-5.1.48-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://192.168.2.16:3306/user?useSSL=false"
    jdbc_user => "root"
    jdbc_password => "Zksw@2019"
    schedule => "* * * * *"
    statement => "SELECT * FROM contacts"
    #statement => "SELECT * FROM table WHERE update_time >= :sql_last_value"
    jdbc_validate_connection => true
    #use_column_value => true
    #tracking_column_type => "timestamp"
    #tracking_column => "t_user"
    #last_run_metadata_path => "syncpoint_table"
  }
}
output {
    
    
       #stdout { codec => json_lines }
       elasticsearch {
    
    
                hosts => "0.0.0.0:9200"
                document_type => "contact"
                #user => "elasticsearch"
                #password => "123456"
                index => "contacts"
                document_id => "%{uid}"
        }
}

Field explanation:
jdbc_driver_library: path of jdbc mysql driver.
jdbc_driver_class: the name of the driver class, mysql fills in com.mysql.jdbc.Driver just fine
jdbc_connection_string: mysql address
jdbc_user: mysql user
jdbc_password: mysql password
schedule: execution timing of sql, similar to crontab scheduling
statement: sql to be executed, with " :" At the beginning are the defined variables, which can be set by parameters, where sql_last_value is a built-in variable, which represents the value of update_time in the last SQL execution.
use_column_value: use the value of the incremental column
tracking_column_type: the type of the incremental field, numeric represents the numeric type, timestamp represents the timestamp type
tracking_column: the name of the incremental field, here we use the update_time column, the type of this column is timestamp
last_run_metadata_path: synchronization point file, this The file records the last synchronization point. This file will be read when restarting. This file can be manually modified
jdbc_validate_connection: connection pool configuration. Verify the connection before use.
OUPUT:
hosts: es cluster address
user: es username
password: es password
index: The name of the index imported into es.
document_id: the id of the document (updated primary key id) imported into es, this needs to be set as the primary key, otherwise two records will appear in es after the same record is updated, %{id} refers to the value of the id field in the mysql table
document_type: Type event, used to set the tpye type of the output document. Default log

The created database is called user and the table is called contacts:

create table contacts (
    uid serial,
    email VARCHAR(80) not null,
    first_name VARCHAR(80) NOT NULL,
    last_name VARCHAR(80) NOT NULL
);
INSERT INTO contacts(email, first_name, last_name) VALUES('[email protected]', 'Jim', 'Smith');
INSERT INTO contacts(email, first_name, last_name) VALUES(null, 'John', 'Smith');
INSERT INTO contacts(email, first_name, last_name) VALUES('[email protected]', 'Carol', 'Smith');
INSERT INTO contacts(email, first_name, last_name) VALUES('[email protected]', 'Sam', null);

Run:
Be careful not to use the root user to run.

../bin/logstash -f ./sync_table.cfg

Insert picture description here
es view:
Insert picture description here
test update and insert to see if es synchronizes the database in real time:

UPDATE contacts SET last_name = 'Smith' WHERE email = '[email protected]';
UPDATE contacts SET email = '[email protected]' WHERE uid = 3;
INSERT INTO contacts(email, first_name, last_name) VALUES('[email protected]', 'New', 'Smith');

Multi-table synchronization:
A logstash instance can synchronize multiple tables with the help of the pipeline mechanism, just write multiple configuration files. Suppose we have two tables table1 and table2, corresponding to two configuration files sync_table1.cfg and sync_table2.cfg

在 config/pipelines.yml 中配置
- pipeline.id: table1
  path.config: "config/sync_table1.cfg"
- pipeline.id: table2
  path.config: "config/sync_table2.cfg"
直接 bin/logstash 启动即可

@timestamp field
By default, the @timestamp field is a field added by logstash-input-jdbc. The default is the current time. This field is very useful in data analysis, but sometimes we want to use certain fields in the data to specify this field At this time, you can use filter.date. This plug-in is specifically used to set the @timestamp field. For
example, I have the field timeslice to represent @timestamp. Timeslice is a string with the format %Y%m%d%H %M

filter {
    
    
  date {
    
    
    match => [ "timeslice", "yyyyMMddHHmm" ]
    timezone => "Asia/Shanghai"
  }
}

Add this configuration to sync_table.cfg, now @timestamp is consistent with timeslice

Searching for data in kibana:
An important feature of migrating part of the data to Elasticsearch is the ability to use Kibana to generate excellent insightful visualizations.
Installation help document: https://www.elastic.co/guide/cn/kibana/current/targz.html#targz-configuringConfiguration
help document: https://www.elastic.co/guide/cn/kibana/current /settings.html
kibana help document: https://www.elastic.co/guide/cn/kibana/current/createvis.html
download and install:

wget https://artifacts.elastic.co/downloads/kibana/kibana-6.0.1-linux-x86_64.tar.gz
sha1sum kibana-6.0.1-linux-x86_64.tar.gz
tar -zxf kibana-6.0.1-linux-x86_64.tar.gz 
cd kibana-6.0.1-linux-x86_64/

下载darwin包
wget https://artifacts.elastic.co/downloads/kibana/kibana-6.0.1-darwin-x86_64.tar.gz
tar -zxf kibana/kibana-6.0.1-darwin-x86_64.tar.gz
cd kibana/kibana-6.0.1-darwin-x86_64/

Start kibana from the command line

./bin/kibana

By default, Kibana starts in the foreground and prints the log to the standard output (stdout), which can be terminated by the Ctrl-C command.

.tar.gz file directory
.tar.gz The entire package is independent. By default, all files and directories are in $KIBANA_HOME — the directory created when the package was unzipped. This is very convenient because you don't need to create any directories to use Kibana. To uninstall Kibana is to simply delete the $KIBANA_HOME directory. But it is still recommended to modify the configuration file and data directory, so that important data will not be deleted.

home
Kibana home 目录或 $KIBANA_HOME 。
解压包时创建的目录
bin
二进制脚本,包括 kibana 启动 Kibana 服务和 kibana-plugin 安装插件。
$KIBANA_HOME\bin
config
配置文件,包括 kibana.yml 。
$KIBANA_HOME\config
data
Kibana 和其插件写入磁盘的数据文件位置。
$KIBANA_HOME\data
optimize
编译过的源码。某些管理操作(如,插件安装)导致运行时重新编译源码。
$KIBANA_HOME\optimize
plugins
插件文件位置。每一个插件都有一个单独的二级目录。
$KIBANA_HOME\plugins

Configure the kibana configuration file:
Kibana loads the configuration file from $KIBANA_HOME/config/kibana.yml by default.
Insert picture description here
server.port: 5601 Port opened by
kibana server.host: "0.0.0.0" kibana listening address elasticsearch.url
: "http://192.168.10.181:9200" to establish contact with
elasticsearch kibana.index: ".kibana" Add .kibana index in elasticsearch

Browser login: http://ip address:5621
You need to add an elasticsearch index for the first login
Insert picture description here

Guess you like

Origin blog.csdn.net/ZhanBiaoChina/article/details/105430871