[Data Mining, Data Analysis] The practice of clickhouse in Go language

Table of Contents of Series Articles

[Data Mining] Clickhouse’s practice in Go language
[Data Mining] User portrait platform Construction and Business Practices



Preface

Today I would like to introduce to you clickhouse, an OLAP big data processing software. It has an honor in the industry, that is, "fast". Of course, fast does not mean driving fast, but refers to clickhouse's query at the big data level. In terms of performance, compared to Spark, MySQL, Hive, and Hadoop, the speed has been greatly improved.
Next, we will explain why clickhouse is suitable for big data analysis and data mining and what the situation is from the origin of clickhouse, OLAP/OLTP, go language development practice, and clickhouse table storage engine analysis. What kind of table engine should be used, and the defects of clickhouse, etc.


1. The origin of clickhouse

ClickHouse originated from Yandex's Metrica product team. Metrica is a web traffic analysis tool that collects user behavior data and conducts data OLAP analysis. The data collection event is generated by a click on the page, and then enters the data warehouse for OLAP analysis. The full name of ClickHouse is Click Stream, Data WareHouse, or ClickHouse for short. On September 20, 2021, the ClickHouse team became independent from Yandex and established a company.

2. OLAP/OLTP

OLAP and OLTP are two different types of methods in data processing and transaction process.

OLTP, also known as online transaction processing, mainly focuses on the user data received at the front desk being immediately transmitted to the computing center for processing, and the processing results are given in a very short time. This processing method is one of the ways to respond quickly to user operations, and its basic feature is to process a small amount of transactional data.

OLAP, the full name of Online Analytical Processing, enables analysts to quickly, consistently and interactively observe information from all aspects to achieve in-depth understanding of data. It helps analysts quickly obtain data and perform analysis and predictions.

2.1. Mainstream OLAP/OLTP databases

The following are some mainstream OLAP and OLTP databases:

Database type Database name describe
OLTP database - -
- MySQL Representative of OLTP, good at transaction processing, supports frequent data insertion or modification, and is suitable for business developers.
- Oracle It is also a powerful OLTP database that is widely used in enterprise-level applications.
- TiDB It is an open source distributed relational database independently designed and developed by PingCAP. It is a converged distributed database product that supports both online transaction processing and online analytical processing (Hybrid Transactional and Analytical Processing, HTAP). It is a distributed database with horizontal expansion or reduction, financial-grade high availability, real-time HTAP, and cloud-native distributed database. , compatible with MySQL 5.7 protocol and MySQL ecosystem and other important features.
OLAP database - -
- Greenplum A distributed database that is good at multi-dimensional and complex analysis of large amounts of data, pursues ultimate performance, and is oriented to analytical decision-makers.
- Hive Hadoop's data warehouse tool can process large-scale structured data, provide SQL-like query functions, and is suitable for data warehouses and BI platforms.
- ClickHouse An open source columnar storage database, suitable for use in scenarios such as data warehouses and data lakes, and supports complex data analysis query operations.
- AWS Redshift AWS Redshift is a cloud data warehouse service provided by AWS. A high-performance, highly reliable data warehouse service can be obtained with a simple click of the mouse. It uses Dense Storage (DS) nodes to create very large data warehouses at a very low price. Analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning to deliver the best price/performance at any scale.
- Doris A real-time data warehouse can handle PB-level data volume (if it is higher than the PB level, it is not recommended to use Doris to solve it, and you can consider using tools such as Hive) to solve structured data. The query time is generally at the second or millisecond level. Doris was developed by Baidu Big Data Department. It was previously called Baidu Palo. After being contributed to the Apache community in 2018, it was renamed Doris.

3. Go language development practice

It is recommended that you use the goframe framework of the go language, and it has been developed to support a variety of databases, including: mysql, mariadb, tidb, pgsql, mssql, oracle, clickhose, dm, sqllite. The source code is open, PR can be submitted, and other database components can be extended.
Personal test, easily supports mysql and clickhose
Insert image description here

3.1. Install and configure the go language environment and configure the IDE

3.1.1. Go development environment installation

1. Download the Go development package
Visit the Go domestic mirror site download page https://golang.google.cn/dl/, and select your version from the top of the page For the current system version, the latest version of the Go development package will be downloaded:
Insert image description here
2. Installation guide
Visit the official installation introduction page https://golang.google.cn /doc/install, just execute the corresponding installation process according to the current system version.

Windows (msi) and MacOS (pkg) recommend using the installation package method to install. The author's current MacOS installation package (pkg) installation process is shown in the figure below:
Insert image description here
Insert image description here

3.1.2. IDE development environment installation

Currently, there are two popular Go IDEs, one is VSCode+Plugins (free), and the other is JetBrains' Goland (paid). Since JetBrains is also a sponsor of the GoFrame framework, we give priority to using Goland as the development IDE. For downloading and registration, please refer to the online tutorial (Baidu or Google).

The official website of JetBrains is: https://www.jetbrains.com

**Note: Students who are familiar with the Java development tool Idea can quickly get started with Goland, and the operation will feel familiar **

1. Use of Goland
Let’s create the first Go program. Follow the old rules and go to hello world.
Insert image description here
2. Create a project
What needs to be noted here is the path of the Go installation file (SDK). The official installation document has detailed instructions, please read it carefully.

Just choose a local path for Location.
Insert image description here
3. Create a program
Create a new go file called hello.go, and enter the following code:
Insert image description here
4. Execute and run
Insert image description here
Insert image description here

3.2. Goframe tool installation

Help documentation: https://goframe.org/pages/viewpage.action?pageId=1115782
It is recommended to install the latest version
Insert image description here

3.3. Introduce clickhouse component

clickhouse帮助文档:
https://goframe.org/pages/viewpage.action?pageId=1114245#ORM%E4%BD%BF%E7%94%A8% E9%85%8D%E7%BD%AE-%E9%85%8D%E7%BD%AE%E6%96%B9%E6%B3%95

Database configuration file: config.yaml
Insert image description here
Insert image description here
Insert image description here

3.4. Goframe uses clickhouse’s complete project

A complete Web3 blockchain project, using goframe + clickhouse + mysql + redis, original and fully owned by myself
Address: https://github.com/hd5723/ chainApi
Insert image description here


4. Clickhouse table engine analysis

Official documentation (supports Chinese, English, and Russian): https://clickhouse.com/docs/zh/engines/table-engines
Insert image description here
Briefly introduces 2 commonly used table engines :MergeTree, ReplacingMergeTree

4.1、MergeTree

The most powerful table engine in Clickhouse is the MergeTree (merge tree) engine and other engines in the series (*MergeTree).

The MergeTree family of engines is designed to insert extremely large amounts of data into a table. Data can be written quickly one after another in the form of data fragments, and the data fragments are merged in the background according to certain rules. This strategy is much more efficient than continuously modifying (rewriting) stored data during insertion.

main feature:

  • The data stored in is sorted by primary key.
    This allows you to create a small sparse index to speed up data retrieval.

  • Partitioning can be used if a partition key is specified.

  • Certain partitioned operations in ClickHouse will be faster than normal operations with the same data set and the same result set. When a partition key is specified in the query, ClickHouse will automatically intercept the partition data. This also effectively increases query performance.

4.2、ReplacingMergeTree

This engine differs from MergeTree in that it removes duplicates with the same sort key value.

Data deduplication only occurs during data merging. The merge happens in the background at an unspecified time, so you can't plan ahead. Some data may still be unprocessed. Although you can initiate an unplanned merge by calling the OPTIMIZE statement, do not rely on it because the OPTIMIZE statement can cause heavy reads and writes of data.

Therefore, ReplacingMergeTree is suitable for clearing duplicate data in the background to save space, but it does not guarantee that no duplicate data will appear.


5. Why clickhouse is suitable for big data analysis and data mining, what kind of table engine should be used under what circumstances, and the shortcomings of clickhouse

5.1. Why clickhouse is suitable for big data analysis and data mining?

Performance comparison between clickhouse and MySQL

ClickHouse and MySQL are two different database management systems, and they have some differences in performance. The following is a comparison of ClickHouse and MySQL performance:

  • Data storage structure: ClickHouse is a columnar storage database suitable for processing large amounts of data, especially in analytical queries. MySQL is a row storage database that stores data in row units and is suitable for transaction processing and general queries.
  • Processing power: ClickHouse performs well when processing large-scale data sets, providing fast aggregation and analysis capabilities. MySQL performs better in small-scale data and transaction processing.
  • Query language: ClickHouse uses its own query language, ClickHouse SQL (similar to standard SQL), to support complex analytical queries and aggregation operations. MySQL uses standard SQL.
  • Data consistency: MySQL is a relational database that supports transactions and ACID features to ensure data consistency. ClickHouse is mainly used for analytical queries and has low requirements for data consistency.
  • Performance and scalability: ClickHouse has advantages in processing large-scale data and high-concurrency queries, can be scaled horizontally, and provides the ability to run in a distributed environment. MySQL performs better in small-scale applications and transaction processing.
    In general, MySQL is suitable for transaction processing and general queries. ClickHouse is suitable for processing large-scale data and analysis queries, and ClickHouse is suitable for statistical analysis queries of large wide tables (hundreds of fields, massive data).

5.2. ClickHouse query defects

5.2.1. Query processing defects on a single machine

ClickHouse will mobilize as many server resources as possible when querying. Generally, the CPU usage will be as high as more than 80%, so it is not suitable for high concurrent queries.

5.2.2. High cluster cost

If you want to deal with high concurrency problems, you need to add servers, which is very costly.

5.3.3. Poor performance of multi-table joint query

It is suitable for queries on large and wide tables (hundreds of fields). Even if there are not many fields in multi-table queries, the performance is not good.

5.3.4. Very poor support for modification and deletion

ClickHouse supports real-time statistical query, but the modification and deletion performance is very poor. Generally, the table engine ReplacingMergeTree is used to solve the modification and deletion problems.

Modifications to ClickHouse may cause the following issues:

  • Inconsistent data types: If the data type of a field is changed when modifying the table structure, and the data type subsequently inserted does not match the modified table structure, data insertion failure or data inconsistency may result. For example, if the field type is changed from int to float in ClickHouse, but the subsequently inserted data is still of type int, then the data will be truncated, resulting in inconsistent data.
  • Frequently deleting data: If a large amount of data is deleted in a short period of time, ClickHouse may cause a "Cannot allocate memory" error. Because ClickHouse will first build a deletion tree in memory when processing deletion operations. If deletion operations are too frequent, memory may be exhausted.
  • Duplicate data processing: ClickHouse may have problems processing duplicate data. If the data inserted on the same shard already exists, ClickHouse will deduplicate it. However, ClickHouse does not guarantee deduplication if duplicate data is inserted between different shards. This may cause data duplication issues.

Therefore, when making modifications to ClickHouse, you must be careful and perform corresponding tests to avoid possible problems. If you encounter problems such as being unable to save data after modification, you can check whether it is caused by the above situation and solve it according to the corresponding method. If the data source data has been adjusted multiple times and the data fields have been adjusted, make sure that the adjusted data is compatible with the old data format.


6. Architecture design

In the actual project, we used Clickhouse + MySQL + Redis for data operations, with a clear division of labor.

6.1、Clickhouse

Function: Real-time statistical query
Data source: Set up scheduled tasks, summarize dozens or hundreds of fields in multiple mysql tables, and write them to Clickhouse through the go program< a i=2> Strategy: regular incremental update, rarely modify/delete based on primary key

6.2、MySQL

Function: Scheduled tasks, query multi-table data
Data source: Obtain data through the ETH rpc interface and write it into it
Strategy: Frequent writing Enter data, query regularly (minute level, hour level, etc.), and modify/delete a small amount

6.3、Redis

Used to maintain version numbers, cache the first page of business data, etc.


Summarize

OLAP/OLTP databases have been developed for many years, and there are many debates, but generally speaking, the one that suits your business is the best. Different businesses, the growth of data volume, and the company's technology stack reserves are all influencing factors.

Guess you like

Origin blog.csdn.net/s445320/article/details/133938428