Introduction to Canal of Big Data Technology
Article Directory
written in front
- Canal version:
Canal-1.1.2
Official website: https://github.com/alibaba/canal/
Official document: https://github.com/alibaba/canal/wiki
Chapter 1 Getting Started with Canal
1.1 What is Canal
Alibaba B2B companies, because of the characteristics of the business, the sellers are mainly concentrated in the country, and the buyers are mainly concentrated in foreign countries, so the demand has been derived. 同步杭州和美国异地机房
Since 2010, Alibaba companies have gradually tried to analyze logs based on the database to obtain increased Synchronize volume changes, and thus derive incremental subscription & consumption business.
Canal is a middleware that provides incremental data subscription & consumption based on database incremental log parsing Java
developed . at present. Canal mainly supports it MySQL 的 Binlog 解析
, and the Canal Client is used to process the obtained related data after the analysis is completed. (Database synchronization requires Ali's Otter middleware, based on Canal).
1.2 MySQL Binlog
1.2.1 What is Binlog
MySQL's binary log can be said to be the most important log of MySQL. It records all DDL and DML (except data query statements) statements in the form of events, and also includes the time consumed by statement execution. MySQL's binary log is a transaction Safe type.
Generally speaking, there will be a performance loss of about 1% when the binary log is turned on. Binary has two most important usage scenarios:
-
One: MySQL Replication opens Binlog on the Master side, and the Master passes its binary logs to the Slaves to achieve the
数据一致
purpose of Master-Slave. -
Second:
数据恢复
Naturally , restore data by using the MySQL Binlog tool.
Binary logs include two types of files: 二进制日志索引文件
(file name suffix .index
) is used to record all binary files, 二进制日志文件
(file name suffix .00000*
) records all DDL and DML (except data query statements) statement events of the database.
1.2.2 Binlog classification
There are three formats of MySQL Binlog, namely STATEMENT,MIXED,ROW
. You can choose to configure binlog_format= statement|mixed|row in the configuration file. The difference between the three formats:
1) statement
: At the statement level, binlog will record every statement that executes a write operation. Compared with the row mode, it saves space, but may cause inconsistency. For example update tt set create_date=now()
, if the binlog log is used for recovery, the data may be different due to different execution times.
- Advantages: save space.
- Disadvantage: It may cause data inconsistency.
2) row
: Row level, binlog will record the change of each row record after each operation.
- Advantages: Maintain absolute consistency of data. Because no matter what sql is or what function is referenced, it only records the effect after execution.
- Disadvantages: Take up a lot of space.
3) mixed
: The upgraded version of statement solves the inconsistency of the statement mode caused by some circumstances. The default is still statement. In some cases, for example:
- When the function contains UUID();
- When a table containing an AUTO_INCREMENT field is updated;
- When executing the INSERT DELAYED statement; when using UDF; it will be processed in the way of ROW
Advantages and disadvantages:
- Advantages: save space, while taking into account a certain degree of consistency.
- Disadvantages: There are some rare cases that still cause inconsistencies. In addition, statement and mixed
The monitoring of binlog is not convenient.
Based on the comparison above, Canal wants to do monitoring and analysis, so the choice
row 格式
is more appropriate.
1.3 How Canal works
1.3.1 MySQL master-slave replication process
- The master main library will change the record and write it to the binary log (Binary Log);
- Slave sends a dump protocol from the library to the MySQL Master, and copies the binary log events of the Master main library to its relay log (relay log);
- Slave reads and redoes the events in the relay log from the library, and synchronizes the changed data to its own database.
1.3.2 How Canal works
It's very simple, just
伪装成 Slave
pretend to copy data from Master.
1.4 Usage Scenarios
- Original scene: part of Ali Otter middleware
Otter is Ali's synchronization framework between remote databases, and Canal is a part of it.
- Common Scenario 1: Updating the cache
- Common Scenario 2: Grab the new change data of the business table and use it to make real-time statistics (this is our scenario)
Chapter 2. MySQL Preparation
2.1 Create a database
2.2 Create a data table
CREATE TABLE user_info(
`id` VARCHAR(255),
`name` VARCHAR(255),
`sex` VARCHAR(255)
);
2.3 Modify the configuration file to enable Binlog
[zhangsan@node01 module]$ sudo vim /etc/my.cnf
server-id=1 #配置mysql replaction需要定义,不能和canal的slaveId重复
log-bin=mysql-bin
binlog_format=row
binlog-do-db=gmall-2021
Note: binlog-do-db is modified according to your own situation, specifying the specific database to be synchronized, if not configured, it means that all databases have Binlog enabled
2.4 Restart MySQL to make the configuration take effect
sudo systemctl restart mysqld
Go to the /var/lib/mysql directory to view the initial file size: 154
[zhangsan@node01 lib]$ pwd
/var/lib
[zhangsan@node01 lib]$ sudo ls -l mysql
总用量 474152
-rw-r-----. 1 mysql mysql 56 8 月 7 2020 auto.cnf
drwxr-x---. 2 mysql mysql 4096 9 月 25 2020 azkaban
-rw-------. 1 mysql mysql 1680 8 月 7 2020 ca-key.pem
-rw-r--r--. 1 mysql mysql 1112 8 月 7 2020 ca.pem
drwxr-x--- 2 mysql mysql 4096 8 月 18 16:56 cdc_test
-rw-r--r--. 1 mysql mysql 1112 8 月 7 2020 client-cert.pem
-rw-------. 1 mysql mysql 1676 8 月 7 2020 client-key.pem
drwxr-x---. 2 mysql mysql 4096 9 月 25 2020 gmall_report
-rw-r----- 1 mysql mysql 1085 12 月 1 09:12 ib_buffer_pool
-rw-r-----. 1 mysql mysql 79691776 12 月 13 08:45 ibdata1
-rw-r-----. 1 mysql mysql 50331648 12 月 13 08:45 ib_logfile0
-rw-r-----. 1 mysql mysql 50331648 12 月 13 08:45 ib_logfile1
-rw-r----- 1 mysql mysql 12582912 12 月 13 08:45 ibtmp1
drwxr-x--- 2 mysql mysql 4096 9 月 22 15:30 maxwell
drwxr-x---. 2 mysql mysql 4096 8 月 12 2020 metastore
drwxr-x---. 2 mysql mysql 4096 9 月 22 15:43 mysql
-rw-r-----. 1 mysql mysql 154 12 月 13 08:45 mysql-bin.000001
-rw-r----- 1 mysql mysql 19 12 月 13 08:45 mysql-bin.index
srwxrwxrwx 1 mysql mysql 0 12 月 13 08:45 mysql.sock
-rw------- 1 mysql mysql 5 12 月 13 08:45 mysql.sock.lock
drwxr-x---. 2 mysql mysql 4096 8 月 7 2020 performance_schema
-rw-------. 1 mysql mysql 1680 8 月 7 2020 private_key.pem
-rw-r--r--. 1 mysql mysql 452 8 月 7 2020 public_key.pem
-rw-r--r--. 1 mysql mysql 1112 8 月 7 2020 server-cert.pem
-rw --. 1 mysql mysql 1680 8 月 7 2020 server-key.pem
drwxr-x---. 2 mysql mysql 12288 8 月 7 2020 sys
drwxr-x--- 2 mysql mysql 4096 2 月 2 2021 test
[zhangsan@node01 lib]$
As you can see, the file size of mysql-bin.000001 is 154
2.5 Test whether Binlog is enabled
- insert data
INSERT INTO user_info VALUES('1001','zhangsan','male');
- Go to the /var/lib/mysql directory again to check the size of the index file
-rw --. 1 mysql mysql 1680 8 月 7 2020 ca-key.pem
-rw-r--r--. 1 mysql mysql 1112 8 月 7 2020 ca.pem
drwxr-x--- 2 mysql mysql 4096 8 月 18 16:56 cdc_test
-rw-r--r--. 1 mysql mysql 1112 8 月 7 2020 client-cert.pem
-rw --. 1 mysql mysql 1676 8 月 7 2020 client-key.pem
drwxr-x---. 2 mysql mysql 4096 9 月 25 2020 gmall_report
-rw-r----- 1 mysql mysql 1085 12 月 1 09:12 ib_buffer_pool
-rw-r-----. 1 mysql mysql 79691776 12 月 13 08:45 ibdata1
-rw-r-----. 1 mysql mysql 50331648 12 月 13 08:45 ib_logfile0
-rw-r-----. 1 mysql mysql 50331648 12 月 13 08:45 ib_logfile1
-rw-r----- 1 mysql mysql 12582912 12 月 13 08:45 ibtmp1
drwxr-x--- 2 mysql mysql 4096 9 月 22 15:30 maxwell
drwxr-x---. 2 mysql mysql 4096 8 月 12 2020 metastore
drwxr-x---. 2 mysql mysql 4096 9 月 22 15:43 mysql
-rw-r-----. 1 mysql mysql 452 12 月 13 08:45 mysql-bin.000001
-rw-r----- 1 mysql mysql 19 12 月 13 08:45 mysql-bin.index
srwxrwxrwx 1 mysql mysql 0 12 月 13 08:45 mysql.sock
-rw------- 1 mysql mysql 5 12 月 13 08:45 mysql.sock.lock
drwxr-x---. 2 mysql mysql 4096 8 月 7 2020 performance_schema
-rw --. 1 mysql mysql 1680 8 月 7 2020 private_key.pem
-rw-r--r--. 1 mysql mysql 452 8 月 7 2020 public_key.pem
-rw-r--r--. 1 mysql mysql 1112 8 月 7 2020 server-cert.pem
-rw --. 1 mysql mysql 1680 8 月 7 2020 server-key.pem
drwxr-x---. 2 mysql mysql 12288 8 月 7 2020 sys
drwxr-x--- 2 mysql mysql 4096 2 月 2 2021 test
[zhangsan@node01 lib]$
It can be seen that the file size of mysql-bin.000001 has become larger (452)
2.6 Authorization
Execute in MySQL: modify the MySQL password length; grant the canal user select permission
mysql> set global validate_password_length=4;
mysql> set global validate_password_policy=0;
mysql> GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO
'canal'@'%' IDENTIFIED BY 'canal' ;
View the user table under the mysql library
Chapter 3 Canal Download and Installation
3.1 Download and decompress the Jar package
https://github.com/alibaba/canal/releases
After downloading, copy canal.deployer-1.1.2.tar.gz to the /opt/sortware directory, and then extract it to the /opt/module/canal-1.1.2 package
Note: canal is scattered after decompression, we need to specify canal when specifying the decompression directory
3.2 Modify the configuration of canal.properties
[zhangsan@node01 conf]$ pwd
/opt/module/canal/conf
[zhangsan@node01 conf]$ vim canal.properties #################################################
######### common argument ############# #################################################
canal.id = 1 canal.ip = canal.port = 11111
canal.metrics.pull.port = 11112 canal.zkServers =
# flush data to zk canal.zookeeper.flush.period = 1000 canal.withoutNetty = false
# tcp, kafka, RocketMQ canal.serverMode = tcp
# flush meta cursor/parse position to file
Description: This file is the basic general configuration of canal, the default port number of canal is 11111, modify the output model of canal, default tcp, change to output to kafka
多实例配置如果创建多个实例
, through the previous canal architecture, we can know that there can be multiple instances in a canal service conf/下的每一个 example 即是一个实例,每个实例下面都有独立的配置文件
. By default, there is only one instance example. If multiple instances are required to process different MySQL data, copy multiple examples directly and rename them. The name is consistent with the name specified in the configuration file, and then modify the canal in canal.properties. destinations=instance1, instance2, instance3.
#################################################
######### destinations ############# #################################################
canal.destinations = example
3.3 modify instance.properties
We only read one MySQL data here, so there is only one instance, and the configuration file of this instance is in conf/example
the directory
[zhangsan@node01 example]$ pwd
/opt/module/canal/conf/example
[zhangsan@node01 example]$ vim instance.properties
- Configure the MySQL server address
Note: the value of canal.instance.mysql.slaveId cannot be the same as the server-id value of /etc/my.cnf; because canal is equivalent to a slave node, the server-id cannot be the same when master-slave replication.
#################################################
## mysql serverId , v1.0.26+ will autoGen
canal.instance.mysql.slaveId=20
# enable gtid use true/false
canal.instance.gtidon=false
# position info
canal.instance.master.address=node01:3306
- Configure the username and password to connect to MySQL, the default is the canal we authorized earlier
# username/password
canal.instance.dbUsername=canal
canal.instance.dbPassword=canal
canal.instance.connectionCharset = UTF-8
canal.instance.defaultDatabaseName =test
# enable druid Decrypt database password
canal.instance.enableDruid=false
Case Test Error Description
Error message: Check the log, it is in the canal installation directory
logs/canal/canal.log
, provided that the configuration item of canal.propertiescanal.destinations = example
has not been changed, if it is changed to [test_xxx], the log is located in the canal installation directorylogs/test_xxx/test_xxx.log
[zhangsan@node01 canal]$ cat canal.log
2023-01-07 15:10:56.713 [main] INFO com.alibaba.otter.canal.deployer.CanalLauncher - ## set default uncaught exception handler
2023-01-07 15:10:56.759 [main] INFO com.alibaba.otter.canal.deployer.CanalLauncher - ## load canal configurations
2023-01-07 15:10:56.771 [main] INFO com.alibaba.otter.canal.deployer.CanalStarter - ## start the canal server.
2023-01-07 15:10:56.851 [main] INFO com.alibaba.otter.canal.deployer.CanalController - ## start the canal server[192.102.153.10(192.102.153.10):11111]
2023-01-07 15:10:58.627 [main] INFO com.alibaba.otter.canal.deployer.CanalStarter - ## the canal server is running now ......
2023-01-07 15:10:58.822 [canal-instance-scan-0] INFO com.alibaba.otter.canal.deployer.CanalController - auto notify start doris-load successful.
2023-01-07 15:15:34.251 [New I/O server worker #1-1] ERROR c.a.otter.canal.server.netty.handler.SessionHandler - something goes wrong with channel:[id: 0x71dc2f6d, /192.102.153.1:57500 => /192.102.153.10:11111], exception=java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:322)
at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:281)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:201)
at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Regarding this error, I searched for some information, but none of them helped
https://github.com/alibaba/canal/issues/3585
Also encountered the following error
https://github.com/alibaba/canal/issues/640
Failed to monitor MySQL data in real time
Reason: When decompressing canal at the beginning, without creating an installation directory first, decompressing canal directly, causing the canal directory to be scattered, and then moving the scattered directories to the newly created directory canal-1.1.5
Solution:
Just delete the canal-1.1.5 directory directly, and re-extract and install canal
Official Documentation Reference
- AdminGuide
https://github.com/alibaba/canal/wiki/AdminGuide
- ClientAPI
https://github.com/alibaba/canal/wiki/ClientAPI
- ClientExample
https://github.com/alibaba/canal/wiki/ClientExample
Finish!