Introduction to canal and introduction to canal deployment, principle and use

Introduction to Alibaba canal and introduction to canal deployment, principle and use

Getting started with canal

what is canal

Alibaba B2B company, due to the characteristics of its business, sellers are mainly concentrated in China, and buyers are mainly concentrated abroad, so there is a demand for remote computer rooms in Hangzhou and the United States. Starting in 2010, Alibaba companies began to gradually try database-based Log parsing, obtaining incremental changes for synchronization, derived from this the incremental subscription & consumption business.

Canal is a middleware developed in Java that provides incremental data subscription & consumption based on database incremental log parsing. At present, canal mainly supports MySQL's binlog analysis. After the analysis is completed, the canal client is used to process the obtained related data. (Database synchronization requires Alibaba's otter middleware, based on canal).

Here we can simply understand canal as a tool for synchronizing incremental data:

Insert image description here

canal obtains the change data through binlog synchronization, and then sends it to the storage destination, such as MySQL, Kafka, Elastic Search and other multi-source synchronization.

canal usage scenarios

Scenario 1: Original scenario, part of Aliotter middleware

Insert image description here

Scenario 2: Update cache

Insert image description here

Scenario 3: Capture business data and add a new change table for making zipper tables. (Zipper table: records the life cycle of each piece of information. Once the life cycle of a record ends, a new record must be started again and the current date must be put into the effective start date)

Scenario 4: Capture the new change data of the business table and use it to produce real-time statistics.

canal operating principle

Insert image description here

The copying process is divided into three steps:

  1. The Master main library will write the change record to the binary log (binary log)

  2. Slave sends a dump protocol from the database to the mysql master, and copies the binary log events of the master database to its relay log (relay log);

  3. The slave reads and redo events in the relay log from the library, and synchronizes the changed data to its own database.

The working principle of canal is very simple, that is, it disguises itself as a slave and pretends to copy data from the master.

Insert image description here

Introduction to MySQL binlog

What is binlog

MySQL's binary log can be said to be the most important log of MySQL. It records all DDL and DML (except data query statements) statements in the form of events, and also includes the time consumed by the statement execution. MySQL's binary log is Transaction safe.

Generally speaking, turning on the binary log will cause about 1% performance loss. Binaries have two most important use cases:

One: MySQL Replication enables binlog on the Master side, and the Master passes its binary log to the slaves to achieve master-slave data consistency.

Second: Recover data by using the mysqlbinlog tool.

The binary log includes two types of files: the binary log index file (the file name suffix is ​​.index) is used to record all binary files, and the binary log file (the file name suffix is ​​.00000*) records all DDL and DML of the database (except data query statement) statement event.

Enable MySQL binlog

Open and restart MySQL in the mysql configuration file to take effect. Generally, the MySQL configuration file path under Linux systems is basically /etc/my.cnf; log-bin=mysql-bin

This means that the prefix of the binlog log is mysql-bin, and the log files generated in the future will be mysql-bin.123456. The numbers after the file are generated in order. Each time MySQL restarts or reaches the threshold of a single file size, a new file is created and numbered sequentially.

Binlog classification settings

There are three formats of MySQL's binlog, namely STATEMENT, MIXED, and ROW. Configuration options can be specified in the configuration file: binlog_format=

statement [statement level]

​ At the statement level, binlog will record every statement that performs a write operation.

​ Compared with row mode, it saves space, but may cause inconsistency, for example: update table_name set create_date=now();

​ If you use binlog logs for recovery, the data that may be generated will be different due to different execution times (create_date is 2021-08-08 11:10:30 when the master drops the data from the database, but the create_date time when the binlog executes the statement from the database may be It becomes 2021-08-08 11:11:23, mainly because the statement execution time is asynchronous)

Advantages: Save space

Disadvantages: May cause data inconsistency

row [row level]

​ At the row level, binlog will record the changes in each row after each operation.

Advantages: Maintain absolute consistency of data. Because no matter what the sql is or what function it refers to, it only records the effect after execution.

Disadvantages: takes up a lot of space.

mixed [combined statement level and row level]

​The upgraded version of statement solves to a certain extent the problem of statement mode inconsistency caused by some situations.

​ In some cases such as:

​ ○ When the function contains UUID();

​ ○ When the table containing the AUTO_INCREMENT field is updated;

​ ○ When executing the INSERT DELAYED statement;

​ ○ When using UDF;

​ Will be processed according to ROW method

​ Advantages: Saves space while taking into account a certain degree of consistency.

​ Disadvantages: There are still some rare cases that may still cause inconsistencies. In addition, statement and mixed are not convenient for situations where binlog monitoring is required.

Environmental preparation

Machine planning

I use 4 machines here:

Machine planning: ops01, ops02, ops03 are used to install kafka + zookeeper + canal cluster; ops04 is used to deploy MySQL service. MySQL can be deployed in one of the three clusters during testing.

11.8.37.50 ops01

11.8.36.63 ops02

11.8.36.76 ops03

11.8.36.86 ops04

All four machines are configured with hostname hosts resolution in /etc/hosts.

Install and configure MySQL

Create a new database and table for business simulation. The installation steps are not introduced here. If you have not installed MySQL, you can check the previous article "Introduction to Hive and Introduction to Hive Deployment, Principles and Usage" for detailed installation steps of MySQL;

After installing MySQL, make basic settings and configurations

# 登录mysql
root@ops04:/root #mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 442523
Server version: 5.7.29 MySQL Community Server (GPL)

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
# 增加canal用户并配置权限
mysql> GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%' IDENTIFIED BY 'canal';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> quit;
Bye
# 修改MySQL配置文件,增添binlog相关配置项
root@ops04:/root #vim /etc/my.cnf
# binlog
server-id=1
log-bin=mysql-bin
binlog_format=row
binlog-do-db=gmall

Create a new gmall library. In fact, any library can be used, as long as it corresponds to the configuration file above.

Restart MySQL:

root@ops04:/root #mysql -V
mysql  Ver 14.14 Distrib 5.7.29, for Linux (x86_64) using  EditLine wrapper
root@ops04:/root #systemctl status mysqld
● mysqld.service - MySQL Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2021-05-26 09:30:25 CST; 2 months 22 days ago
     Docs: man:mysqld(8)
           http://dev.mysql.com/doc/refman/en/using-systemd.html
 Main PID: 32911 (mysqld)
   Memory: 530.6M
   CGroup: /system.slice/mysqld.service
           └─32911 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid

May 26 09:30:18 ops04 systemd[1]: Starting MySQL Server...
May 26 09:30:25 ops04 systemd[1]: Started MySQL Server.
root@ops04:/root #
root@ops04:/root #systemctl restart mysqld
root@ops04:/root #

[Note]: After adding the binlog configuration and restarting the MySQL service, there will be relevant binlog files in the storage directory with the following format

root@ops04:/var/lib/mysql #ll | grep mysql-bin
-rw-r----- 1 mysql mysql     1741 Aug 17 14:27 mysql-bin.000001
-rw-r----- 1 mysql mysql       19 Aug 17 11:18 mysql-bin.index

Verify canal user login:

root@ops04:/root #mysql -ucanal -pcanal -e "show databases"
mysql: [Warning] Using a password on the command line interface can be insecure.
+--------------------+
| Database           |
+--------------------+
| information_schema |
| gmall              |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
root@ops04:/root #

Create a new table in the gmall library and insert some sample data for testing:

CREATE TABLE `canal_test` (
  `体温` varchar(255) DEFAULT NULL,
  `身高` varchar(255) DEFAULT NULL,
  `体重` varchar(255) DEFAULT NULL,
  `文章` varchar(255) DEFAULT NULL,
  `日期` date DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('36.5', '1.70', '180', '4', '2021-06-01');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('36.4', '1.70', '160', '8', '2021-06-02');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('36.1', '1.90', '134', '1', '2021-06-03');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('37.3', '1.70', '110', '14', '2021-06-04');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('35.7', '1.70', '133', '0', '2021-06-05');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('36.8', '1.90', '200', '6', '2021-06-06');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('37.5', '1.70', '132', '25', '2021-06-07');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('35.7', '1.70', '160', '2', '2021-06-08');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('36.3', '1.80', '131.4', '9', '2021-06-09');
INSERT INTO `canal_test`(`体温`, `身高`, `体重`, `文章`, `日期`) VALUES ('37.3', '1.70', '98.8', '4', '2021-06-10');

Install kafka + zookeeper

In order to achieve high availability of canal, the specific installation steps are not introduced here to reduce the length. You can refer to the previous article "Introduction to Kafka and Introduction to Kafka Deployment, Principles and Usage" for detailed installation steps of kafka;

Query the cluster running status of each port of kafka and zookeeper:

wangting@ops03:/opt/module >ssh ops01 'sudo netstat -tnlpu| grep -E "9092|2181"'
tcp6       0      0 :::9092                 :::*                    LISTEN      42305/java          
tcp6       0      0 :::2181                 :::*                    LISTEN      41773/java          
wangting@ops03:/opt/module >ssh ops02 'sudo netstat -tnlpu| grep -E "9092|2181"'
tcp6       0      0 :::9092                 :::*                    LISTEN      33518/java          
tcp6       0      0 :::2181                 :::*                    LISTEN      33012/java          
wangting@ops03:/opt/module >ssh ops03 'sudo netstat -tnlpu| grep -E "9092|2181"'
tcp6       0      0 :::9092                 :::*                    LISTEN      102886/java         
tcp6       0      0 :::2181                 :::*                    LISTEN      102422/java   

Install and deploy canal

The address of Alibaba's canal project is: https://github.com/alibaba/canal. For the download link, you can click release on the right side of the github page to view the downloads of each version. It is recommended that you check out the popular projects on Alibaba's homepage if you have the energy. There are many projects that are more and more popular. Come more and more popular.

Download the installation package

# 下载安装包
wangting@ops03:/opt/software >wget https://github.com/alibaba/canal/releases/download/canal-1.1.5/canal.deployer-1.1.5.tar.gz
wangting@ops03:/opt/software >ll | grep canal
-rw-r--r-- 1 wangting wangting  60205298 Aug 17 11:23 canal.deployer-1.1.5.tar.gz

Unzip and install

# 新建canal解压目录【注意】: 官方项目解压出来没有顶级canal目录,所以新建个目录用于解压组件
wangting@ops03:/opt/software >mkdir -p /opt/module/canal
wangting@ops03:/opt/software >tar -xf canal.deployer-1.1.5.tar.gz -C /opt/module/canal/

Modify canal main configuration

# 修改canal主配置文件
wangting@ops03:/opt/module/canal >cd conf/
wangting@ops03:/opt/module/canal/conf >ll
total 28
-rwxrwxr-x 1 wangting wangting  319 Apr 19 15:48 canal_local.properties
-rwxrwxr-x 1 wangting wangting 6277 Apr 19 15:48 canal.properties
drwxrwxr-x 2 wangting wangting 4096 Aug 17 13:49 example
-rwxrwxr-x 1 wangting wangting 3437 Apr 19 15:48 logback.xml
drwxrwxr-x 2 wangting wangting 4096 Aug 17 13:49 metrics
drwxrwxr-x 3 wangting wangting 4096 Aug 17 13:49 spring
# 改动如下相关配置: zk | 同步策略目标方式 | kafka
wangting@ops03:/opt/module/canal/conf >vim canal.properties 
canal.zkServers =ops01:2181,ops02:2181,ops03:2181
canal.serverMode = kafka
kafka.bootstrap.servers = ops01:9092,ops02:9092,ops03:9092

Modify the instance configuration of canal - (mysql to kafka)

# 配置实例相关配置:canal可以启多实例,一个实例对应一个目录配置,例如把example目录复制成xxx,把xxx目录下的配置改动启动,就是一个新实例
wangting@ops03:/opt/module/canal/conf >cd example/
wangting@ops03:/opt/module/canal/conf/example >ll
total 4
-rwxrwxr-x 1 wangting wangting 2106 Apr 19 15:48 instance.properties
# 注意11.8.38.86:3306需要改成自己环境的mysql地址和端口,其次用户名密码改成自己环境的,topic自定义一个
wangting@ops03:/opt/module/canal/conf/example >vim instance.properties 
canal.instance.master.address=11.8.38.86:3306
canal.instance.dbUsername=canal
canal.instance.dbPassword=canal
canal.mq.topic=wangting_test_canal
canal.mq.partitionsNum=12

Distribution installation directory

# 将修改后的canal目录分发至另外2台服务器:
wangting@ops03:/opt/module >scp -r /opt/module/canal ops01:/opt/module/
wangting@ops03:/opt/module >scp -r /opt/module/canal ops02:/opt/module/

Start the canal cluster

# 各服务器依次启动集群canal
wangting@ops03:/opt/module >cd /opt/module/canal/bin/
wangting@ops03:/opt/module/canal/bin >./startup.sh 

wangting@ops02:/home/wangting >cd /opt/module/canal/bin/
wangting@ops02:/opt/module/canal/bin >./startup.sh 

wangting@ops01:/home/wangting >cd /opt/module/canal/bin/
wangting@ops01:/opt/module/canal/bin >./startup.sh 

Validation results

# 在一台服务器上监测kafka
wangting@ops03:/opt/module/canal/bin >kafka-console-consumer.sh --bootstrap-server ops01:9092,ops02:9092,ops03:9092 --topic wangting_test_canal
[2021-08-17 14:21:29,924] WARN [Consumer clientId=consumer-console-consumer-17754-1, groupId=console-consumer-17754] Error while fetching metadata with correlation id 2 : {
    
    wangting_test_canal=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)

As expected, if the gmall library in MySQL on ops04 is now successfully monitored, then if there are data changes in the tables in the gmall library, the console will have information output and be synchronously updated to the front desk in real time.

Current data in the table:

Insert image description here

Change the data in the table and observe the console output:

Insert image description here

1. Change 2021-06-10 -> 2021-08-17

2. Add a new piece of data

3. Change a value from 1 -> 1111

wangting@ops03:/opt/module/canal/bin >kafka-console-consumer.sh --bootstrap-server ops01:9092,ops02:9092,ops03:9092 --topic wangting_test_canal
[2021-08-17 14:21:29,924] WARN [Consumer clientId=consumer-console-consumer-17754-1, groupId=console-consumer-17754] Error while fetching metadata with correlation id 2 : {
    
    wangting_test_canal=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)

{
    
    "data":[{
    
    "体温":"37.3","身高":"1.70","体重":"98.8","文章":"4","日期":"2021-08-17"}],"database":"gmall","es":1629185045000,"id":6,"isDdl":false,"mysqlType":{
    
    "体温":"varchar(255)","身高":"varchar(255)","体重":"varchar(255)","文章":"varchar(255)","日期":"date"},"old":[{
    
    "日期":"2021-06-10"}],"pkNames":null,"sql":"","sqlType":{
    
    "体温":12,"身高":12,"体重":12,"文章":12,"日期":91},"table":"canal_test","ts":1629185063194,"type":"UPDATE"}

{
    
    "data":[{
    
    "体温":"35.55","身高":"1.999","体重":"99.99","文章":"999","日期":"2021-08-17"}],"database":"gmall","es":1629185086000,"id":7,"isDdl":false,"mysqlType":{
    
    "体温":"varchar(255)","身高":"varchar(255)","体重":"varchar(255)","文章":"varchar(255)","日期":"date"},"old":null,"pkNames":null,"sql":"","sqlType":{
    
    "体温":12,"身高":12,"体重":12,"文章":12,"日期":91},"table":"canal_test","ts":1629185104967,"type":"INSERT"}

{
    
    "data":[{
    
    "体温":"36.1","身高":"1.90","体重":"134","文章":"1111","日期":"2021-06-03"}],"database":"gmall","es":1629185104000,"id":8,"isDdl":false,"mysqlType":{
    
    "体温":"varchar(255)","身高":"varchar(255)","体重":"varchar(255)","文章":"varchar(255)","日期":"date"},"old":[{
    
    "文章":"1"}],"pkNames":null,"sql":"","sqlType":{
    
    "体温":12,"身高":12,"体重":12,"文章":12,"日期":91},"table":"canal_test","ts":1629185122499,"type":"UPDATE"}

It can be clearly seen that every change can be shown in the record, and the old data and the current data can correspond one-to-one. Up to this point, the entire process chain of canal can be considered complete. The methods of synchronizing canal to different storage media are basically the same.

Extensions:

Canal information can be viewed in the zookeeper command line:

wangting@ops01:/opt/module/canal/bin >zkCli.sh
Connecting to localhost:2181
[zk: localhost:2181(CONNECTED) 0] ls -w /
[hbase, kafka, otter, wangting, zookeeper]
[zk: localhost:2181(CONNECTED) 1] ls -w /otter
[canal]
[zk: localhost:2181(CONNECTED) 2] ls -w /otter/canal
[cluster, destinations]

Guess you like

Origin blog.csdn.net/wt334502157/article/details/119763273