"Clickhouse Principle Analysis and Application Practice" Reading Notes

Top benefits (warm reminder): The electronic version can be read on the WeChat reading app.

Chapter 1 The Past and Present of ClickHouse

Limitations of traditional BI.

Data warehouse
In order to solve the problem of data islands, a database specially used for analysis scenarios is introduced to gather scattered data in one place.

The derived concept
of the data warehouse layered the data and formed a data mart through layer by layer, thereby reducing the data volume of the final query; proposed the concept of data cube, through the pre-processing of the data, the space is exchanged for time, and the query is improved. performance.

OLAP, online analysis, multi-dimensional analysis, basic operations: drill down, roll up, slice, dice, and rotate. The architecture can be roughly divided into three categories:

  1. ROLAP
    Relational OLAP, relational OLAP
  2. MOLAP
    Multidimensional OLAP, multidimensional OLAP; dimensional preprocessing may cause data expansion
  3. HOLAP
    Hybrid OLAP, hybrid architecture OLAP

clickhouse-benchmark-test

Data stored in order will have higher query performance. Because reading sequential files will take less disk seek and rotation delay time (here mainly refers to mechanical disks), and sequential reading can also use the pre-reading function of the file cache at the operating system level, so the query performance and data of the database are The storage order on the physical disk is closely related.

The applicable scenarios of ClickHouse are
BI business intelligence, advertising traffic, Web, App traffic, telecommunications, finance, e-commerce, information security, online games, Internet of Things and many other fields.

Inapplicable scenarios

  1. Does not support transactions
  2. Not good at querying by row granularity based on the primary key (support), so you should not use CK as a KV database
  3. Not good at deleting data by row (support)

Chapter 2 Overview of ClickHouse Architecture

Using multi-master peer-to-peer network structure,
core features:

  1. Complete DBMS function
  2. Columnar storage and data compression
  3. Vectorized execution engine
  4. Relational model and SQL query
  5. Diversified table engine
  6. Multithreading and distributed
  7. Multi-master architecture
  8. online search
  9. Data fragmentation and distributed query

Architecture design

Chapter 3 Installation and Deployment

It is necessary to verify whether the current server's CPU supports the SSE 4.2 instruction set, because this feature is required for vectorized execution.

Configuration file

  1. /etc/security/limits.d/clickhouse.conf: Configuration of the number of file handles
  2. /etc/cron.d/clickhouse-server:cron: Timing task configuration, used to restore ClickHouse service process interrupted due to abnormal reasons

The executable file is
mainly in the /usr/binpath:

  1. clickhouse: executable file of the main program
  2. clickhouse-client: a soft link to ClickHouse executable file for client connection
  3. clickhouse-server: a soft link to ClickHouse executable file for server startup
  4. clickhouse-compressor: a built-in compression tool that can be used to decompress positive and negative data

Client access interface

  1. TCP is
    based on the TCP protocol and has better performance. The default port is 9000, which is mainly used for internal communication between clusters and CLI clients;
  2. HTTP is
    based on the HTTP protocol and has better compatibility. It can be widely used for clients of programming languages ​​such as JAVA and Python in the form of REST services. The default port is 8123.
  3. Encapsulation interface
    includes CLI and JDBC, simple and easy to use, based on TCP bottom interface encapsulation.

CLI
has two execution modes:

  1. Interactive execution It is
    widely used in scenarios such as debugging, operation and maintenance, development and testing. Through the interactive execution of SQL statements, the relevant query results will be uniformly recorded in a ~/.clickhouse-client-historyfile, which can be used for auditing.
  2. Non-interactive execution is
    used in batch processing scenarios, such as data import and export operations, you need to add --queryparameters to specify SQL statements. --multiqueryParameter executes multiple queries, supports multiple SQL query statements with semicolon intervals to run at one time, and multiple SQL query result sets will be returned in sequence.

JDBC

<dependency>
    <groupId>ru.yandex.clickhouse</groupId>
    <artifactId>clickhouse-jdbc</artifactId>
    <version>0.2.4</version>
</dependency>

The bottom layer of JDBC is based on HTTP interface communication. The driver has two forms:

  1. Normal mode:jdbc:clickhouse://<host>:<port>[/<database>]
  2. High availability mode: multiple host addresses are allowed to be set, and one of the available addresses will be randomly selected for connection each time:jdbc:clickhouse://<first-host>:<first-port>,<second-host>:<second-port>[,…][/<database>]

Built-in tools

  1. Clickhouse-local
    can run most of the SQL queries independently without relying on any ClickHouse server program. It can be understood as a stand-alone microkernel of ClickHouse service and a lightweight application. Clickhouse-local can only use the File table engine, and its data is completely isolated from the ClickHouse service running on the same machine, and cannot be accessed from each other. For non-interactive operation, you need to specify the data source for each execution.
    Core parameters:
    -S / --structureTable structure
    -N / --table: table name, the default value is table
    -if / --input-format: the format of the input data, the default value is TSV
    -f / --file: the address of the input data, the default value is stdin Standard input
    -q / --query: SQL statement to be executed, multiple statements semicolon interval
  2. The clickhouse-benchmark
    benchmark test tool will give QPS, RPS (Request Per Second), and the query execution time of each percentile. You can specify multiple SQLs for testing. At this time, you need to define the SQL statements in a file.
    Core parameters::
    -i / --iterationsThe number of times the SQL query is executed, the default value is 0
    -c / --concurrency: The number of concurrent queries executed at the same time, the default value is 1
    -r / --randomize: When multiple SQL statements are executed, they are executed in random order
    -h / --host: server address, the default value is localhost. Support comparison test. At this time, you need to declare the addresses of the two servers. In the comparison test, the gap between the two query indicators will be compared by sampling
    --confidence: set the range of the confidence interval in the comparison test. The default value is 5 (99.5%). The values ​​are 0 (80%), 1 (90%), 2 (95%), 3 (98%), 4 (99%) and 5 (99.5%).

Chapter 4 Data Definition

type of data

Including basic types, composite types and special types

Basic type

The basic types have only three types: numeric, string, and time. There is no Boolean type. You can use integer 0 or 1 instead.

Numerical type

Numerical types are divided into three types: integer, floating-point and fixed-point

String

It can be subdivided into three categories: String, FixedString and UUID.

  1. The length of the String is not limited, and the character set is not limited, but it is recommended to follow a uniform encoding
  2. FixedString,, FixedString(N)use null bytes to fill the end characters
  3. UUID, 32 bits, its format is 8-4-4-4-12, null values ​​are filled with 0 by default, that is0…0-…00

time

There are three types of time: DateTime, DateTime64 and Date. CK currently does not have a timestamp type. The highest precision of the time type is seconds, that is, if you need to deal with milliseconds, microseconds and other time greater than the resolution of seconds, you can only use the UInt type to achieve.

  1. The DateTimeDateTime type contains hour, minute, and second information, accurate to the second, and supports writing in the form of a string
  2. DateTime64DateTime64 can record sub-seconds, adding precision settings above DateTime
  3. DateDate type does not contain specific time information, only accurate to the day, and also supports writing in string form

Compound type

Arrays, tuples, enumerations, and nesting

  1. Array
    has two forms of definition. The
    principle of minimum storage cost is to use the smallest expressible data type.

  2. Tuple tuple
    type consists of 1 to n elements. Different data types are allowed to be set between each element, and they are not required to be compatible with each other. Tuples also support type inference, and its inference is still based on the principle of minimum storage cost

  3. Enum
    includes two enumeration types, Enum8 and Enum16.
    Key and Value are not allowed to be repeated, and uniqueness must be guaranteed. Second, the values ​​of Key and Value cannot be Null, but Key is allowed to be an empty string.

  4. Nested is
    a nested table structure. A data table can define any number of nested type fields, but the nesting level of each field only supports one level, that is, the nested type cannot continue to be used in the nested table. Nested type is essentially a multidimensional array structure. Each field in the nested table is an array, and the length of the array between rows does not need to be aligned. Dot notation is required when accessing nested data.

Special type

  1. Nullable
    cannot be regarded as an independent data type. It can only be used with basic types and cannot be used as an index field. Nullable types should be used with caution, including Nullable data tables, otherwise it will slow down query and write performance. Because under normal circumstances, the data of each column field will be stored in the corresponding [Column].binfile. If a column field is modified by the Nullable type, an additional [Column].null.binfile will be generated specifically to save its Null value. That is, when reading and writing data, double additional file operations are required.

  2. Domain
    domain name types are divided into two types, IPv4 and IPv6, and are essentially a further encapsulation of integers and strings. The IPv4 type is based on the UInt32 encapsulation, and the IPv6 type is based on the FixedString(16) encapsulation. If you need to return the string form of the IP, you need to explicitly call the IPv4NumToString or IPv6NumToString function to convert.

Data table definition

Data table operation

Partition operation

Currently, only the table engine of the MergeTree series supports data partitioning. The parts system table is specially used to query the partition information of the data table.

  1. Inquire:
  2. delete:ALTER TABLE table1 DROP PARTITION partition1
  3. Replication: It can be used in scenarios such as fast data writing, data synchronization and backup between multiple tables; two prerequisites: two tables need to have the same partition key; the table structure is exactly the same.
  4. Reset: reset to the initial value, ALTER TABLE table1 CLEAR COLUMN column1 IN PARTITION partition1if the default value expression has been declared, the expression shall prevail; otherwise, the default value of the corresponding data type shall prevail
  5. Uninstallation: ALTER TABLE table1 DETACH PARTITION partition1After the partition is uninstalled, its physical data is not deleted, but is moved to the detached subdirectory of the current data table directory, that is, out of the management of CK. CK will not actively clean up these files. These partition files will remain exist.
  6. Loading: ALTER TABLE table1 ATACH PARTITION partition1The reverse process of unloading, restore data.
  7. Backup: FREEZE
  8. Reduction: FETCH

Distributed DDL execution

Converting an ordinary DDL statement into distributed DDL execution is very simple, just add a ON CLUSTER cluster_namestatement. This means that when DDL statements are executed on any node in the cluster, each node in the cluster will execute the same statements in the same order.

Data writing, deletion and modification

Three forms:
1.
2.

Chapter 5 Data Dictionary

Data dictionary is a very simple and practical storage medium, which defines data in the form of key-value and attribute mapping. The data in the dictionary will be loaded into the memory actively or passively (whether the data is actively loaded when ClickHouse is started or lazily loaded during the first query is determined by the parameter settings), and supports dynamic updates. The dictionary data resides in memory and is very suitable for storing constant or frequently used dimension table data to avoid unnecessary JOIN queries.

Divided into two forms of built-in and extended. The built-in dictionary is the default dictionary of CK, and the external extended dictionary is the dictionary realized by the user through custom configuration. Under normal circumstances, the data in the dictionary can only be accessed through the dictionary function (CK specially sets a type of dictionary function, which is specially used for the retrieval of dictionary data). Exception: use a special dictionary table engine. With the help of the dictionary table engine, the data dictionary can be mounted under an agent's data table, so as to realize the JOIN query of the data table and the dictionary data.

Built-in dictionary

Currently there is only one built-in dictionary-Yandex.Metrica dictionary, which is disabled by default and can be used only after it is enabled. The way to open it is also very simple, just open the config.xmlfile path_to_regions_hierarchy_fileand the path_to_regions_names_filestwo configurations. These two configurations are loaded lazily, and the loading action is triggered only when the dictionary is queried for the first time. The geo geographic data filled in the Yandex.Metrica dictionary is composed of two sets of models, which can be understood as the main table and dimension table of the regional data.

External extension dictionary

Currently, the extended dictionary supports 7 types of memory layouts and 4 types of data sources, which are registered as plug-ins.
By default, CK will automatically identify and load /etc/clickhouse-serverall _dictionary.xmlconfiguration files ending in the directory. At the same time, CK can also dynamically perceive the various changes of the configuration files in this directory, and supports non-stop online updating of configuration files. Multiple dictionaries can be defined in a single dictionary configuration file, each of which is defined by a set of dictionary elements:

<dictionaries>
	<dictionary>
	</dictionary>
</dictionaries>

Chapter VI Principle Analysis of MergeTree

Only the table engine of the merge tree series supports the features of primary key index, data partitioning, data copy, and data sampling. At the same time, only the table engine of this series supports ALTER related operations. The Replicated prefix supports data copy.

Chapter 7 MergeTree Series Table Engine

The table engine is roughly divided into 6 series: merge tree, external storage, memory, file, interface and others.
Data TTL and storage strategy.

Data TTL
in MergeTree, you can set TTL for a column field or the entire table. If both column-level and table-level TTLs are set at the same time, the one that expires first will prevail. INTERVAL complete operations include SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER and YEAR.
Execute optimize command to forcibly trigger TTL cleanup

Multipath storage strategy

chapter eight

Chapter nine

chapter Ten

Chapter Eleven

Guess you like

Origin blog.csdn.net/lonelymanontheway/article/details/108181649