Video Tutorial Portal:
1. Pre-knowledge points
1.1 Canal pronunciation
Popular reading: Canal
1.2 Prerequisite knowledge points
-
Basic operation of MySQL
-
Java basics
-
SpringBoot
2. Introduction to Canal
2.1 Historical Background
In the early days, Alibaba deployed computer rooms in Hangzhou and the United States, and there was a business requirement for synchronization across computer rooms. The initial implementation was based on business triggers, which was not very convenient. In 2010, it was gradually replaced by database log analysis, which derived the A large number of database incremental subscription and consumption operations. In this context, Canal came out.
Around 2014, Tmall Double Eleven was first introduced to solve the problem of high concurrent read and write of the MySQL database for large-scale promotional activities. Later, it was widely used and promoted within Ali, and was officially open sourced in 2017.
Github:https://github.com/alibaba/canal
2.2 Definition
The Canal component is a component based on MySQL database incremental log parsing, which provides incremental data subscription and consumption, and supports delivery of incremental data to downstream consumers (such as Kafka, RocketMQ, etc.) or storage (such as Elasticsearch, HBase, etc.).
Plain language: Canal senses the changes in MySQL data, then parses the changed data, sends the changed data to MQ or synchronizes to other databases, and waits for further business logic processing.
3. The working principle of Canal
3.1 MySQL master-slave replication principle
-
The MySQL master writes data changes to the binary log, or Binlog for short.
-
MySQL slave copies the master's binary log to its relay log (relay log)
-
MySQL slave replays the relay log operation to synchronize the changed data to the latest.
3.2 MySQL Binlog log
3.2.1 Introduction
MySQL's Binlog can be said to be the most important log of MySQL. It records all DDL and DML statements in the form of events.
By default, MySQL does not enable Binlog, because it takes time to record Binlog logs, and the official data shows that there is a 1% performance loss.
Whether to enable it or not depends on the actual situation during development.
Generally speaking, Binlog logs are enabled in the following two scenarios:
-
When MySQL master-slave cluster is deployed, Binlog needs to be enabled on the Master side to facilitate data synchronization to Slaves.
-
The data is restored, and the data is restored by using the MySQL Binlog tool.
3.2.1 Binlog classification
There are three formats of MySQL Binlog, STATEMENT, MIXED, ROW. In the configuration file, you can choose to configure the
置 binlog_format= statement|mixed|row。
Classification | introduce | advantage | shortcoming |
---|---|---|---|
STATEMENT | At the statement level, it records every statement that performs a write operation, which saves space compared to the ROW mode, but may cause data inconsistencies such as update tt set create_date=now(), and the hungry data will be different due to different execution times | save space | May cause data inconsistency |
ROW | Row level, records the changes of each row record after each operation. If the SQL execution result of an update is 10,000 rows and only one statement is saved, if it is a row, the result of the 10,000 rows will be saved here. | Maintain absolute consistency of data. Because no matter what sql is, what function is referenced, it only records the effect after execution | Take up a lot of space |
MIXED | It is an upgrade of the statement. For example, when the function contains UUID(), when the table containing the AUTO_INCREMENT field is updated, when the INSERT DELAYED statement is executed, when using UDF, it will be processed in the way of ROW | Save space while maintaining a certain degree of consistency | There are still some rare cases that still cause inconsistencies. In addition, statement and mixed are inconvenient for situations that require binlog monitoring. |
Based on the above comparison, Canal wants to do monitoring and analysis, and it is more appropriate to choose the row format.
3.3 Canal working principle
-
Canal disguises itself as MySQL slave (slave library) and sends dump protocol to MySQL master (main library)
-
MySQL master (main library) receives dump request and starts to push binary log to slave (that is, canal)
-
Canal receives and parses Binlog logs, obtains changed data, and executes subsequent logic
4. Canal application scenarios
4.1 Data Synchronization
Canal can help users perform various data synchronization operations, such as real-time synchronization of MySQL data to Elasticsearch, Redis and other data storage media.
4.2 Database real-time monitoring
Canal can monitor the update operation of MySQL in real time, and can notify relevant personnel in time of the modification of sensitive data.
4.3 Data Analysis and Mining
Canal can post MySQL incremental data to message queues such as Kafka to provide data sources for data analysis and mining.
4.4 Database backup
Canal can copy the data incremental log on the MySQL master database to the standby database to realize database backup.
4.5 Data Integration
Canal can integrate data from multiple MySQL databases to provide more efficient and reliable solutions for data processing.
4.6 Database migration
Canal can assist in completing MySQL database version upgrades and data migration tasks.
Five, MySQL preparation
5.1 Create a database
New library: canal-demo
5.2 Create table
user table
CREATE TABLE `user` (
`id` bigint NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`age` int DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;
5.3 Modify the configuration file to enable Binlog support
Modify the configuration file of mysql, a name is: my.ini
server-id=1
log-bin=C:/ProgramData/MySQL/MySQL Server 8.0/binlogs/mysql-bin.log
binlog_format=row
binlog-do-db=canal-demo
server-id: mysql instance id, used to distinguish instances when clustering
lob-bin: binlog log file name
binlog_format: binlog log data storage format
binlog-do-db: Specifies to enable the binlog log database.
Note: Generally, the database to be synchronized is specified according to the situation. If not configured, it means that all databases have Binlog enabled.
5.4 Verify that Binlog takes effect
Restart the MySQL service and view the Binlog log
Method 1:
show VARIABLES like 'log_bin'
Method 2:
Enter the specified directory:
insert into user(name, age) values('dafei', 18);
insert into user(name, age) values('dafei', 18);
insert into user(name, age) values('dafei', 18);
6. Canal installation and configuration
6.1 Download
Address: Releases · alibaba/canal · GitHub
Just unzip.
6.2 Configuration
6.2.1 Modify the configuration of canal.properties
canal.port = 11111
# tcp, kafka, rocketMQ, rabbitMQ, pulsarMQ
canal.serverMode = tcp
canal.destinations = example
canal.port: default port 11111
canal.serverMode: service mode, tcp means input client, xxMQ output to various message middleware
canal.destinations: Canal can collect data from multiple MySQL databases, and each MySQL database has an independent configuration file control. Specific configuration rules: Under the conf/ directory, use a folder to place, and the folder name represents a MySQL instance. canal.destinations is used to configure the database that needs to monitor data. If multiple, use, separated
6.2.2 Modify the MySQL instance configuration file instance.properties
config/directory
canal.instance.mysql.slaveId=20
# position info
canal.instance.master.address=127.0.0.1:3306
# username/password
canal.instance.dbUsername=root
canal.instance.dbPassword=admin
canal.instance.mysql.slaveId: use canal slave stage id
canal.instance.master.address: database ip port
canal.instance.dbUsername: connect to mysql account
canal.instance.dbPassword: password to connect to mysql
6.3 start
Double click to start
7. Canal programming
7.1 Helloworld
1>Create project: canal-hello
2> Import related dependencies
<dependency>
<groupId>com.alibaba.otter</groupId>
<artifactId>canal.client</artifactId>
<version>1.1.0</version>
</dependency>
3> Write test code
package com.langfeiyes.hello;
import com.alibaba.otter.canal.client.CanalConnector;
import com.alibaba.otter.canal.client.CanalConnectors;
import com.alibaba.otter.canal.protocol.CanalEntry;
import com.alibaba.otter.canal.protocol.Message;
import com.google.protobuf.ByteString;
import com.google.protobuf.InvalidProtocolBufferException;
import java.net.InetSocketAddress;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class CanalDemo {
public static void main(String[] args) throws InvalidProtocolBufferException {
//1.获取 canal 连接对象
CanalConnector canalConnector = CanalConnectors.newSingleConnector(new InetSocketAddress("localhost", 11111), "example", "", "");
while (true) {
//2.获取连接
canalConnector.connect();
//3.指定要监控的数据库
canalConnector.subscribe("canal-demo.*");
//4.获取 Message
Message message = canalConnector.get(100);
List<CanalEntry.Entry> entries = message.getEntries();
if (entries.size() <= 0) {
System.out.println("没有数据,休息一会");
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
} else {
for (CanalEntry.Entry entry : entries) {
// 获取表名
String tableName = entry.getHeader().getTableName();
// Entry 类型
CanalEntry.EntryType entryType = entry.getEntryType();
// 判断 entryType 是否为 ROWDATA
if (CanalEntry.EntryType.ROWDATA.equals(entryType)) {
// 序列化数据
ByteString storeValue = entry.getStoreValue();
// 反序列化
CanalEntry.RowChange rowChange = CanalEntry.RowChange.parseFrom(storeValue);
// 获取事件类型
CanalEntry.EventType eventType = rowChange.getEventType();
// 获取具体的数据
List<CanalEntry.RowData> rowDatasList = rowChange.getRowDatasList();
// 遍历并打印数据
for (CanalEntry.RowData rowData : rowDatasList) {
List<CanalEntry.Column> beforeColumnsList = rowData.getBeforeColumnsList();
Map<String, Object> bMap = new HashMap<>();
for (CanalEntry.Column column : beforeColumnsList) {
bMap.put(column.getName(), column.getValue());
}
Map<String, Object> afMap = new HashMap<>();
List<CanalEntry.Column> afterColumnsList = rowData.getAfterColumnsList();
for (CanalEntry.Column column : afterColumnsList) {
afMap.put(column.getName(), column.getValue());
}
System.out.println("表名:" + tableName + ",操作类型:" + eventType);
System.out.println("改前:" + bMap );
System.out.println("改后:" + afMap );
}
}
}
}
}
}
}
4> test
Perform DML operations on the user table in the canal-demo library, and observe the printed values
Canal API system analysis
7.2 SpringBoot integration
1>Create project: canal-sb-demo
2> Import related dependencies
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-parent</artifactId>
<version>2.7.11</version>
</parent>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>top.javatool</groupId>
<artifactId>canal-spring-boot-starter</artifactId>
<version>1.2.6-RELEASE</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.12</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.21.4</version>
</dependency>
</dependencies>
3>Configuration file
canal:
server: 127.0.0.1:11111 #canal 默认端口11111
destination: example
spring:
application:
name: canal-sb-demo
datasource:
driver-class-name: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://127.0.0.1:3306/canal-demo?useUnicode=true&characterEncoding=utf-8&serverTimezone=UTC&useSSL=false
username: root
password: admin
4> Entity Object
package com.langfeiyes.sb.domain;
public class User {
private Long id;
private String name;
private Integer age;
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Integer getAge() {
return age;
}
public void setAge(Integer age) {
this.age = age;
}
@Override
public String toString() {
return "User{" +
"id=" + id +
", name='" + name + '\'' +
", age=" + age +
'}';
}
}
5>Monitoring processing class
package com.langfeiyes.sb.handler;
import com.langfeiyes.sb.domain.User;
import org.springframework.stereotype.Component;
import top.javatool.canal.client.annotation.CanalTable;
import top.javatool.canal.client.handler.EntryHandler;
@Component
@CanalTable(value = "user")
public class UserHandler implements EntryHandler<User> {
@Override
public void insert(User user) {
System.err.println("添加:" + user);
}
@Override
public void update(User before, User after) {
System.err.println("改前:" + before);
System.err.println("改后:" + after);
}
@Override
public void delete(User user) {
System.err.println("删除:" + user);
}
}
6>Start class
@SpringBootApplication
public class App {
public static void main(String[] args) {
SpringApplication.run(App.class, args);
}
}
7> test
-
Start the canal server first
-
restart project
-
Modify the user table
-
Observation results
Eight, the same type of technology
Type 1: Data synchronization component based on log parsing
This type of component mainly obtains the addition, deletion and modification operations of the database by parsing the log files such as the Binlog (MySQL) or Redo Log (Oracle) of the database, and records these operations. Then, these operation records can be transferred to another database for the purpose of data synchronization. Representative products of such components include Ali's open source Canal , Tencent Cloud's DBSync , etc.
Type 2: ETL-based data synchronization components
ETL is Extract-Transform-Load, which refers to extracting data from the source system, transforming the data, and finally loading it into the target system. Such components usually need to write complex data conversion rules and data mapping relationships, and are suitable for scenarios with frequent data structure changes, large data volumes, and multiple data sources. Representative products include Alibaba Cloud's DataWorks , Informatica PowerCenter , etc.
Type 3: CDC-based data synchronization components
CDC (Change Data Capture) is change data capture. It is a data synchronization technology that can capture data changes in a database in real time or quasi-real time and transmit them to another database. The CDC technology is implemented based on the transaction log or redo log of the database, which can realize low-latency, high-performance data synchronization. Representative products of CDC components include Oracle GoldenGate, IBM Infosphere Data Replication , etc.
Type 4: Data synchronization component based on message queue
Such components usually abstract the change operations that occur in the database into a data structure, and publish it to other systems for processing through the message queue, so as to realize asynchronous transmission and decoupling of data. Representative products include Apache Kafka, RabbitMQ , etc.
Nine, Canal common interview questions
Q: What is Canal? What are the characteristics?
Answer: Canal is a distributed, high-performance, and reliable message queue based on Netty, which is open sourced by Alibaba. It has a wide range of applications in real-time data synchronization and data distribution scenarios. Canal has the following features: supports log parsing and subscription of MySQL, Oracle and other databases; supports multiple data output methods, such as Kafka, RocketMQ, ActiveMQ, etc.; supports data filtering and format conversion; has excellent features such as low latency and high reliability Performance.
Q: How does Canal work?
Answer: Canal mainly obtains the addition, deletion, and modification operations of the database by parsing the binlog logs of the database, and then sends these change events to downstream consumers. The core components of Canal include two parts: Client and Server. Client is responsible for connecting to the database, starting log parsing, and sending the parsed data to Server; Server is responsible for receiving the data sent by Client, and filtering and distributing the data. Canal also supports a variety of data exporters, such as Kafka, RocketMQ, ActiveMQ, etc., which can send the parsed data to different message queues for further processing and analysis.
Q: What are the pros and cons of Canal?
Answer: The advantages of Canal mainly include: high performance, distributed, good reliability, support for data filtering and conversion, and cross-database types (such as MySQL, Oracle, etc.). Disadvantages include: it is difficult to use, it has a certain impact on the database log, and it does not support data backtracking (that is, it cannot obtain historical data), etc.
Q: What application scenarios does Canal have in business?
A: Canal is mainly used in real-time data synchronization and data distribution scenarios. Common application scenarios include: data backup and disaster recovery, incremental data extraction and synchronization, real-time data analysis, online data migration, etc. Especially in the Internet big data scenario, Canal has become one of the important tools for various data processing tasks.