ClickHouse knowledge finishing

ClickHouse installation under Centos and Redhat

ClickHouse can run on any Linux, FreeBSD or Mac OS X with x86_64, AArch64 or PowerPC64LE CPU architecture.

Although pre-built binaries are usually compiled for x86 _64 and utilize the SSE 4.2 instruction set, unless otherwise noted, using a CPU that supports it will become an additional system requirement. This is the command to check whether the current CPU supports SSE 4.2:

grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

To run ClickHouse on processors that do not support SSE 4.2 or have AArch64 or PowerPC64LE architecture, you should build ClickHouse from the source to make appropriate configuration adjustments.

 

The Yandex ClickHouse team recommends using official pre-compiled rpmpackages for CentOS, RedHat and all other rpm-based Linux distributions.

First, you need to add the official repository:

sudo yum install yum-utils
sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG
sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/stable/x86_64

If you want to use the latest version, please stablereplace it with testing(it is recommended that you use it in a test environment).

Then run these commands to actually install the package (here to install server and client):

sudo yum install clickhouse-server clickhouse-client

You can run the following command to start the service in the background:

sudo service clickhouse-server start

You can /var/log/clickhouse-server/view the log in the directory.

If the service does not start, check the configuration file  /etc/clickhouse-server/config.xml.

You can also start the service directly in the console:

clickhouse-server --config-file=/etc/clickhouse-server/config.xml

In this case, the log will be printed to the console, which is very convenient during development.
If the configuration file is in the current directory, you do not need to specify the'–config-file' parameter. It uses'./config.xml' by default.

You can connect to the service using the command line client:

clickhouse-client

By default, it uses the'default' user to establish a connection with the localhost:9000 service without a password.
The client can also be used to connect to remote services, for example:

clickhouse-client --host=example.com --port 9000 --password ******

ClickHouse database engine

ClickHouse creates the database statement as follows:

CREATE DATABASE IF NOT EXISTS db_name [ENGINE = engine]

1、Lazy

expiration_time_in_secondsSave the table in the memory during the interval between the most recent access , which is only applicable to the *Log engine table. Due to the long access interval for this type of table, the storage of a large number of small *Log engine tables is optimized.

CREATE DATABASE testlazy ENGINE = Lazy(expiration_time_in_seconds);

2、MySQL

The MySQL engine is used to map the tables in the remote MySQL server to ClickHouse, and allows you to perform INSERTand SELECTquery the tables to facilitate data exchange between ClickHouse and MySQL. The data in the remote MySQL will be automatically pulled, and the data table of the MySQL table engine will be created under the database. MySQLThe database engine will convert its query into MySQL syntax and send it to the MySQL server, so you can perform operations such as SHOW TABLESor SHOW CREATE TABLE.

However, the following operations cannot be performed on it:

  • RENAME
  • CREATE TABLE
  • ALTER
CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
ENGINE = MySQL('host:port', ['database' | database], 'user', 'password')

-- 参数解释
-- host:port — 链接的MySQL地址。
-- database — 链接的MySQL数据库。
-- user — 链接的MySQL用户。
-- password — 链接的MySQL用户密码。

3、Ordinary

The default engine, when you use it, you don’t need to declare it when you build the database. The tables in this database can use any type of table engine.

4、Dictionary

Dictionary engine, this type of database will automatically create their data tables for all data dictionaries (load the field table information and data configured in the configuration file)

5、Memory

In the memory engine, users store temporary data. The data will only be in the memory and will not involve any disk operations. The data will be cleared when the service restarts.

ClickHouse Data Table Engine

MergeTree Family

MergeTree

The most powerful table engine in Clickhouse is undoubtedly the  MergeTree (merge tree) engine and *MergeTreeother engines in the series ( ). MergeTree The series of engines are designed to insert extremely large amounts of data into a table. Data can be quickly written one after another in the form of data fragments, and the data fragments are merged in the background according to certain rules. Compared to constantly modifying (overwriting) the stored data during insertion, this strategy is much more efficient.

main feature:

  • The stored data is sorted by the primary key. This allows you to create a small sparse index to speed up data retrieval.

  • Supports data partitioning, if the partition key  is specified  . In the case of the same data set and the same result set, some partitioned operations in ClickHouse will be faster than ordinary operations. When the partition key is specified in the query, ClickHouse will automatically intercept the partition data. This also effectively increases query performance.

  • Support data copy. ReplicatedMergeTree The series of tables provide a data copy function. For more information, see the  section on  data copy .

  • Support data sampling. If necessary, you can set a sampling method for the table.

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MergeTree()
ORDER BY expr
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]
[SETTINGS name=value, ...]

- ENGINE-engine name and parameters. ENGINE = MergeTree(). The MergeTree engine has no parameters.
- ORDER BY — sort key. It can be a tuple of a set of columns or any expression. For example: ORDER BY (CounterID, EventDate). If the primary key is not explicitly specified using the PRIMARY KEY, ClickHouse will use the sort key as the primary key. If you don't need to sort, you can use ORDER BY tuple(). In this case, ClickHouse stores according to the order in which the data is inserted. If INSERT ... SELECT you want to keep the data sorting during use  , please set max_insert_threads = 1. Want to query data according to the initial order, use single-threaded query

- PARTITION BY — Partition key. To partition by month, you can use the expression toYYYYMM(date_column), where date_column is a Date column. The format of the partition name will be "YYYYMM". Sparse indexes will cause additional data reads. When reading data in a single interval range of the primary key, at most index_granularity * 2 additional rows of data will be read in each data block  . Sparse indexes allow you to handle extremely large numbers of rows, because in most cases, these indexes reside in memory (RAM). ClickHouse does not require unique primary keys, so you can insert multiple rows with the same primary key.

- PRIMARY KEY-primary key, optional if you want to select a primary key different from the sort key. By default, the primary key is the same as the sort key (specified by the ORDER BY clause). Therefore, in most cases there is no need to specify a PRIMARY KEY clause.

- SAMPLE BY — The expression used for sampling. If a sampling expression is to be used, this expression must be included in the primary key. For example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)).

- TTL specifies the duration of row storage and defines the list of rules for the movement logic of data fragments on hard disks and volumes, optional. There must be at least one Date or DateTime column in the expression, for example: TTL date + INTERVAl 1 DAY. The type of rule DELETE|TO DISK'xxx'|TO VOLUME'xxx' specifies the action to be performed when the condition is met (the specified time is reached): remove expired rows, or the data fragment (if all rows in the If all the expressions are satisfied) move to the specified disk (TO DISK'xxx') or volume (TO VOLUME'xxx'). The default rule is DELETE. You can specify multiple rules in the list, but there can be at most one DELETE rule.

- SETTINGS — additional parameters that control the behavior of
    MergeTree : - index_granularity — index granularity. The number of data rows between adjacent "marks" in the index. The default value is 8192. Each piece of data is logically divided into granules. A particle is the smallest indivisible data set for data query in ClickHouse. ClickHouse does not split rows or values, so each particle always contains an integer number of rows. The first row of each particle is marked by the primary key value of the row, and ClickHouse will create an index file for each data fragment to store these marks. For each column, whether it is included in the primary key or not, ClickHouse will store similar tags. These tags allow you to find the data directly in the column file. The size of the particles is  controlled index_granularity and  index_granularity_bytescontrolled by the table engine parameters  . Depending on the size of the rows, the number of rows of particles is in the  [1, index_granularity] range. If the size of a single row exceeds the  index_granularity_bytes set value, the size of a particle will exceed  index_granularity_bytes. In this case, the size of the particle is equal to the size of the row.
    - index_granularity_bytes — Index granularity, in bytes, default value: 10Mb. If you want to limit the index granularity only by the number of data rows, please set it to 0 (not recommended).
    - enable_mixed_granularity_parts — Whether to enable the index_granularity_bytes to control the size of the index granularity. Before version 19.11, only the index_granularity configuration can be used to limit the size of the index granularity. When querying data from a table with very large rows (tens or hundreds of megabytes), the index_granularity_bytes configuration can improve the performance of ClickHouse. If you have very large rows in your table, you can enable this configuration to improve the performance of SELECT queries.
    - use_minimalistic_part_header_in_zookeeper — Whether to enable the smallest data segment header in ZooKeeper. If use_minimalistic_part_header_in_zookeeper=1 is set, ZooKeeper will store less data.
    - min_merge_bytes_to_use_direct_io — Use direct I/O to operate the minimum amount of data required for the merge operation of the disk. When merging data fragments, ClickHouse will calculate the total storage space of all the data to be merged. If the size exceeds the number of bytes set by min_merge_bytes_to_use_direct_io, ClickHouse will use the direct I/O interface (O_DIRECT option) to read and write to the disk. If you set min_merge_bytes_to_use_direct_io = 0, direct I/O is disabled. Default value: 10 * 1024 * 1024 * 1024 bytes.
    - merge_with_ttl_timeout — The minimum interval time of TTL merge frequency, unit: second. Default value: 86400 (1 day).
    - write_final_mark — Whether to enable writing the final index mark at the end of the data segment. Default value: 1 (not recommended to change).
    - merge_max_block_size-The limit on the maximum number of rows when merging operations are performed in a block. Default value: 8192
    - storage_policy-storage policy. See Use devices with multiple blocks for data storage.
    - min_bytes_for_wide_part, min_rows_for_wide_part The minimum number of bytes/rows that can be stored in the wide format in the data fragment. You can set none, only one, or all. Data fragments can be stored in Wide or Compact format. In the Wide format, each column is stored as a separate file in the file system, and in the Compact format, all columns are stored in one file. The Compact format can improve the performance when the insertion amount is small and the insertion frequency is frequent. If the number of bytes or rows in the data segment is less than the corresponding setting value, the data segment will be Compact stored in the  format, otherwise it will be Wide stored in the  format.

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_32323239/article/details/109552550