Quickly build HTAP system based on open source applications

Use ProxySQL, MySQL, ClickHouse to quickly build HTAP system

1. About ClickHouse

With the increase in data volume and increasingly complex analytical business requirements in enterprises, the pressure on MySQL, which is mainly suitable for OLTP scenarios, is increasing. The community version of infobright, which can be tried for free many years ago, has long since disappeared. After infinidb was acquired by MariaDB, it was transformed into ColumnStore. However, the development in recent years has been dull and not an ideal OLAP solution.

ClickHouse produced by Fighting Nation has been in the limelight in recent years, and there are more and more domestic users. Several public clouds also provide corresponding products and services. It is currently the fastest OLAP database on the market, and its performance far exceeds Vertica and Sybase. IQ etc. Regarding the performance of ClickHouse, you can also follow my previous test report: ClickHouse Performance Test .

ClickHouse is also compatible with MySQL. In addition to the same or similar syntax, it can even connect to ClickHouse using the MySQL client (protocol).

ClickHouse can be mounted as a slave library of MySQL, and synchronize MySQL data in full and then incrementally in real time. This feature can be said to be the most eye-catching and most needed feature this year. Based on it, we can easily create a set of enterprise-level solutions. , Let the integration of OLTP and OLAP no longer have a headache. Currently supports MySQL version 5.6/5.7/8.0, compatible with Delete/Update statements, and most commonly used DDL operations. You only need to install the latest version of ClickHouse to experience this new feature, but this feature is still in the experimental stage and is still being improved.

2. About ProxySQL

ProxySQL is a powerful middleware that provides strong support for the MySQL architecture. It supports traditional master-slave replication, semi-synchronous replication, MGR, PXC and other MySQL architectures. It also supports automatic fault detection and switching, connection pooling, read and write Multiple practical functions such as separation, logging, monitoring, and cluster deployment. Of course, the biggest disadvantage of ProxySQL is the large performance loss. It is estimated that there will be at least 20% ~ 30% performance loss, so it may not be suitable for high performance scenarios. However, you can consider reducing the pressure of a single node by sub-database and table, etc., and give full play to the clustering function of ProxySQL.

3. Build HTAP system

The installation of ClickHouse and ProxySQL will not be repeated in this article, and I will start building the HTAP system directly. The following is a schematic diagram of the overall architecture

3.1 Configure ClickHouse as a MySQL slave library

After logging in to ClickHouse, execute the following command to enable the new feature:

clickhouse :) SET allow_experimental_database_materialize_mysql = 1;

In ClickHouse, create a replication channel to build a MySQL replication slave library, for example:

clickhouse :) CREATE DATABASE test ENGINE = MaterializeMySQL('172.24.10.10:3306', 'test', 'repl', 'repl');
clickhouse :) use test;
clickhouse :) show tables;
┌─name─────┐
│ sbtest1  │
│ sbtest10 │
│ sbtest11 │
...
32 rows in set. Elapsed: 0.006 sec.

After the replication channel is created for the first time, ClickHouse will quickly read all data from the MySQL main database and apply it. You can view the progress of data replication:

[[email protected]]# cat metadata/sbtest/.metadata

Version: 2
Binlog File: binlog.001496
Executed GTID: 097ee9f2-2ded-11eb-9211-e4434ba52b50:1-952676723
Binlog Position: 789663343
Data Version: 2

Reminder: Here, I set up a dedicated account for master-slave replication. Compared with the ordinary master-slave replication account, the account used for the ClickHouse slave library must at least add read-only permissions, for example:

[[email protected]]> show grants for repl;
+--------------------------------------------------------------------------+
| Grants for repl@%                                                        |
+--------------------------------------------------------------------------+
| GRANT RELOAD, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO `repl`@`%` |
| GRANT SELECT ON `test`.* TO `repl`@`%`                                   |
+--------------------------------------------------------------------------+

ClickHouse's MaterializeMySQL engine can copy data from MySQL very fast, even faster than MySQL's native slave library. You can experience it yourself.

Next, create a business account and a service monitoring account in ClickHouse (used for ProxySQL to monitor back-end services). Edit ClickHouse's configuration file  users.xml to add two users:

        <app_user>
            <password>app_user</password>
            <networks incl="networks" replace="replace">
                <ip>::/0</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
        </app_user>
        <monitor>
            <password>monitor</password>
            <networks incl="networks" replace="replace">
                <ip>::/0</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
        </monitor>

I only set a simple password, and did not modify its permissions and quota mode. It is only for demonstration purposes. Please adjust the online production environment to comply with regulations.

3.2 Configure ProxySQL

Configure the mysql_servers table, add two records, and make the configuration effective:

proxysql> insert into mysql_servers(hostgroup_id, hostname, port) values('0', '172.24.10.10', '3306');
proxysql> insert into mysql_servers(hostgroup_id, hostname, port) values('1', '172.24.10.11', '9004');
proxysql> save mysql servers to disk; load mysql servers to run;

Among them, 172.24.10.10:3306 is the MySQL main library, 172.24.10.11:9004 is the ClickHouse slave library, and port 9004 is dedicated to the MySQL client connection in ClickHouse. You can use the MySQL client (protocol) connection to perform various operations.

The hostgroup is 0 and 1, respectively, 0 is used for read-write groups, and 1 is used for read-only groups .

Configure the mysql_users table, add business accounts and monitoring accounts:

proxysql> select username,password,active from mysql_users;
+-----------+----------+--------+
| username  | password | active |
+-----------+----------+--------+
| app_user  | app_user | 1      |
| monitor   | monitor  | 1      |
+-----------+----------+--------+

proxysql> save mysql users to disk; load mysql users to runtime;

Configure the mysql_query_rules table. This is the key. It is used to determine which SQL is forwarded to the MySQL main database and which is forwarded to ClickHouse:

proxysql> select rule_id, active, match_pattern,destination_hostgroup from mysql_query_rules;
+---------+--------+-------------------------+-----------------------+
| rule_id | active | match_pattern           | destination_hostgroup |
+---------+--------+-------------------------+-----------------------+
| 1       | 1      | ^SELECT.*\+CLICKHOUSE.* | 1                     |
+---------+--------+-------------------------+-----------------------+

proxysql>  save mysql query rules to disk; load mysql query rules to run;

The above rules mean that when the SELECT statement contains the "+CLICKHOUSE" keyword, it will be automatically forwarded to the ClickHouse backend for processing, and the rest are sent to the MySQL backend for processing. For example, the following two SQLs will be forwarded to the MySQL and ClickHouse backends respectively:

#SQL #1
[[email protected]]> SELECT * FROM sbtest1 WHERE id=1;

#SQL #2
[[email protected]]> SELECT /*+CLICKHOUSE*/ * FROM sbtest1 WHERE id=1; 

The second SQL uses MySQL's comment syntax to cleverly implement the rule HINT.

Confirm the result of querying the stats_mysql_query_digest table:

proxysql> select hostgroup, schemaname, username, digest, digest_text from stats_mysql_query_digest;
+-----------+------------+----------+--------------------+----------------------------------+
| hostgroup | schemaname | username | digest             | digest_text                      |
+-----------+------------+----------+--------------------+----------------------------------+
| 0         | sbtest     | app_user | 0x5662D7CF0442E794 | select * from sbtest1 where id=? |
| 1         | sbtest     | app_user | 0x5662D7CF0442E794 | select * from sbtest1 where id=? |
+-----------+------------+----------+--------------------+----------------------------------+

As you can see, the two SQLs look the same, but they are forwarded to different hostgroups.

Finally, configure the monitoring service of ProxySQL (optional, not required):

proxysql> set mysql-monitor_enabled="true"; 
proxysql> set mysql-monitor_username="monitor";
proxysql> set mysql-monitor_password="monitor";

proxysql> save mysql variables to disk; load mysql variables to runtime;

At this point, a simple HTAP system based entirely on open source applications has been built.

4. Performance comparison

Here, I choose the benchmark solution provided by ClickHouse: Star Schema Benchmark.

After the compilation is complete, use ssb-dbgen to generate test data (specify parameter -s 50):

./dbgen -s 50 -T c &
./dbgen -s 50 -T l &
./dbgen -s 50 -T p &
./dbgen -s 50 -T s &
./dbgen -s 50 -T d &

Create a few more test database tables, and modify the DDL of the tables to adapt to MySQL syntax. Then import the test data, and finally generate the lineorder_flat table according to the document.

[[email protected]]> show table status;
+----------------+--------+---------+------------+-----------+----------------+--------------+
| Name           | Engine | Version | Row_format | Rows      | Avg_row_length | Data_length  |
+----------------+--------+---------+------------+-----------+----------------+--------------+
| customer       | InnoDB |      10 | Dynamic    |   1378209 |            120 |    166363136 |
| lineorder      | InnoDB |      10 | Dynamic    | 297927870 |            100 |  29871833088 |
| lineorder_flat | InnoDB |      10 | Dynamic    | 292584926 |            430 | 125952851968 |
| part           | InnoDB |      10 | Dynamic    |   1192880 |            111 |    132792320 |
| supplier       | InnoDB |      10 | Dynamic    |     99730 |            110 |     11026432 |
+----------------+--------+---------+------------+-----------+----------------+--------------+

After all the data is loaded, create a MaterializeMySQL replication channel in ClickHouse:

clickhouse :) CREATE DATABASE ssb ENGINE = MaterializeMySQL('172.24.10.10:3380', 'ssb', 'repl', 'repl');

The amount of data is relatively large, just wait patiently for it to copy.

Then connect to ProxySQL, first simply execute the large table count(*) to observe the time-consuming difference:

#直接执行count(*),会转发到后端 MySQL 实例
[[email protected]]> select count(*) from lineorder_flat;
+-----------+
| count(*)  |
+-----------+
| 300005811 |
+-----------+
1 row in set (3 min 2.14 sec)

#加上HINT规则,会转发到后端 ClickHouse 实例
[[email protected]]> select /*+CLICKHOUSE*/ count(*) from lineorder_flat;
+-----------+
| count(*)  |
+-----------+
| 300005811 |
+-----------+
1 row in set (5.67 sec)

Just count(*) is many times worse.

Then select the first 4 SQL tests, and the recording time is as follows:

Query MySQL ClickHouse (from library) Click House (Primitive)
Q1.1 308.388684 0.149 0.107
Q1.2 320.373203 0.280 0.027
Q1.3 279.673361 0.346 0.030
Q2.1 286.451062 1.246 0.489

Obviously, the efficiency of querying directly on MySQL is too low. Although there is a certain gap between MaterializeMySQL and ClickHouse's native MergeTree table as a slave library, the difference is not that big, and it is quite fast.

4. Other instructions

  • ClickHouse MaterializeMySQL does not support create like syntax . For example, execute create table db2.a like db1.a, where db1 is to be copied to ClickHouse, and db2 is left on the MySQL side, even this will cause ClickHouse side replication to report an error, and it needs to be restarted.

  • ClickHouse of MaterializeMySQL also does not support the function index .

  • Occasionally, it is found that after the monitoring module of ProxySQL is connected to ClickHouse, it will send the SET wait_timeout=N command, which will cause ClickHouse to report an error, but it will not affect normal use. Restart ProxySQL or restart the monitoring switch .

Enjoy it :)

Further reading

Teacher Ye's "MySQL Core Optimization" class has been upgraded to MySQL 8.0, scan the code to start the journey of MySQL 8.0 practice

Guess you like

Origin blog.csdn.net/n88Lpo/article/details/112057063