Article 1 | ClickHouse Quick Start

Introduction to ClickHouse

ClickHouse is a columnar database management system (DBMS) for online analysis (OLAP). ClickHouse was originally a product called Yandex.Metrica, mainly used for WEB traffic analysis. The full name of ClickHouse is Click Stream, Data WareHouse , or ClickHouse for short.

ClickHouse is very suitable for the field of business intelligence. In addition, it can also be widely used in advertising traffic, Web, App traffic, telecommunications, finance, e-commerce, information security, online games, Internet of Things and many other fields. ClickHouse has the following characteristics:

  • Support complete SQL operations

  • Columnar storage and data compression

  • Vectorized execution engine

  • Relational model (similar to traditional database)

  • Rich table engine

  • Parallel processing

  • online search

  • Data fragmentation

    As a high-performance OLAP database, ClickHouse has the following shortcomings.

  • Does not support transactions.

  • Not good at querying by row granularity based on the primary key (although supported), so ClickHouse should not be used as a Key-Value database.

  • Not good at deleting data by row (although supported)

Stand-alone installation

Download RPM package

The installation method in this article is offline installation, you can download the corresponding rpm package in the link below, or you can download it directly from Baidu Cloud

-- rpm包地址
https://packagecloud.io/Altinity/clickhouse
-- 百度云地址
链接:https://pan.baidu.com/s/1pFR66SzLvPYMfcpuPJww5A 
提取码:gh5a

Include these packages in the software we install:

  • clickhouse-client The package contains the clickhouse-client application, which is an interactive ClickHouse console client.
  • clickhouse-common Package contains a ClickHouse executable file.
  • clickhouse-server The package contains the ClickHouse configuration file to be run as a server.

Contains a total of four RPM packages,

clickhouse-client-19.17.4.11-1.el7.x86_64.rpm
clickhouse-common-static-19.17.4.11-1.el7.x86_64.rpm
clickhouse-server-19.17.4.11-1.el7.x86_64.rpm
clickhouse-server-common-19.17.4.11-1.el7.x86_64.rpm

Screaming tip: If an error is reported during the installation process: Dependency check failed, it means that the dependency package is missing

You can manually install the libicu-50.2-4.el7_7.x86_64.rpm dependency package first

Turn off the firewall

## 查看防火墙状态。
systemctl status firewalld
## 临时关闭防火墙命令。重启电脑后,防火墙自动起来。
systemctl stop firewalld
## 永久关闭防火墙命令。重启后,防火墙不会自动启动。
systemctl disable firewalld

System Requirements

ClickHouse can run on any Linux, FreeBSD or Mac OS X with x86_64, AArch64 or PowerPC64LE CPU architecture. Although pre-built binaries are usually compiled for x86 _64 and utilize the SSE 4.2 instruction set, unless otherwise noted, using a CPU that supports it will become an additional system requirement. This is the command to check whether the current CPU supports SSE 4.2:

grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
SSE 4.2 supported

To run ClickHouse on processors that do not support SSE 4.2 or have AArch64 or PowerPC64LE architecture, ClickHouse should be built from the source to make appropriate configuration adjustments.

Install RPM package

## 将rpm包上传至/opt/software目录下
## 执行如下命令进行安装
[root@cdh06 software]# rpm -ivh *.rpm
错误:依赖检测失败:
        libicudata.so.50()(64bit) 被 clickhouse-common-static-19.17.4.11-1.el7.x86_64 需要
        libicui18n.so.50()(64bit) 被 clickhouse-common-static-19.17.4.11-1.el7.x86_64 需要
        libicuuc.so.50()(64bit) 被 clickhouse-common-static-19.17.4.11-1.el7.x86_64 需要
        libicudata.so.50()(64bit) 被 clickhouse-server-19.17.4.11-1.el7.x86_64 需要
        libicui18n.so.50()(64bit) 被 clickhouse-server-19.17.4.11-1.el7.x86_64 需要
        libicuuc.so.50()(64bit) 被 clickhouse-server-19.17.4.11-1.el7.x86_64 需要
## 上面安装报错,缺少相应的依赖包,
## 需要下载相对应的依赖包
## 下载libicu-50.2-4.el7_7.x86_64.rpm进行安装即可

Article 1 | ClickHouse Quick Start

View installation information

Directory Structure

  • /etc/clickhouse-server : The configuration file directory of the server , including global configuration config.xml and user configuration users.xml.
  • /etc/clickhouse-client : Client configuration, including conf.d folder and config.xml file.

  • /var/lib/clickhouse : The default data storage directory (usually modify the default path configuration to save the data to the path mounted on the large-capacity disk).

  • /var/log/clickhouse-server : The directory where logs are saved by default (usually the path configuration is modified to save the logs to the path mounted on a large-capacity disk).

Configuration file

  • /etc/security/limits.d/clickhouse.conf : configuration of the number of file handles
[root@cdh06 clickhouse-server]# cat /etc/security/limits.d/clickhouse.conf 
clickhouse      soft    nofile  262144
clickhouse      hard    nofile  262144

The configuration can also be modified through max_open_files in config.xml

 <!-- Set limit on number of open files (default: maximum). This setting makes sense on Mac OS X because getrlimit() fails to retrieve correct maximum value. -->
    <!-- <max_open_files>262144</max_open_files> -->
  • /etc/cron.d/clickhouse-server:cron : Timing task configuration, used to restore the ClickHouse service process interrupted due to abnormal reasons, the default configuration is as follows.
[root@cdh06 cron.d]# cat /etc/cron.d/clickhouse-server
#*/10 * * * * root (which service > /dev/null 2>&1 && (service clickhouse-server condstart ||:)) || /etc/init.d/clickhouse-server condstart > /dev/null 2>&1

executable file

The last is a set of executable files in the /usr/bin path:

  • clickhouse : the executable file of the main program.

  • clickhouse-client : A soft link to ClickHouse executable file for client connection.

  • clickhouse-server : A soft link to the ClickHouse executable file for server startup.

  • clickhouse-compressor : a built-in compression tool that can be used for positive and negative data compression.

Article 1 | ClickHouse Quick Start

Start/stop service

## 启动服务
[root@cdh06 ~]# service clickhouse-server start
Start clickhouse-server service: Path to data directory in /etc/clickhouse-server/config.xml: /var/lib/clickhouse/
DONE
## 关闭服务
[root@cdh06 ~]# service clickhouse-server stop

Client connection

[root@cdh06 ~]# clickhouse-client 
ClickHouse client version 19.17.4.11.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.17.4 revision 54428.

cdh06 :) show databases;

SHOW DATABASES

┌─name────┐
│ default │
│ system  │
└─────────┘

2 rows in set. Elapsed: 0.004 sec. 

Basic operation

Create database

  • grammar
CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster] [ENGINE = engine(...)]
  • example
CREATE DATABASE IF NOT EXISTS tutorial

By default, ClickHouse uses the native database engine Ordinary (any type of table engine can be used in this database , and in most cases, you only need to use the default database engine ). Of course, you can also use the Lazy engine and the MySQL engine. For example, using the MySQL engine, you can directly manipulate the tables in the MySQL corresponding database in ClickHouse. Assuming that there is a database named clickhouse in MySQL, you can use the following method to connect to the MySQL database.

-- --------------------------语法-----------------------------------
CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
ENGINE = MySQL('host:port', ['database' | database], 'user', 'password')
-- --------------------------示例------------------------------------
CREATE DATABASE mysql_db ENGINE = MySQL('192.168.200.241:3306', 'clickhouse', 'root', '123qwe');
-- ---------------------------操作-----------------------------------
cdh06 :) use mysql_db;
cdh06 :) show tables;

SHOW TABLES

┌─name─┐
│ test │
└──────┘

1 rows in set. Elapsed: 0.005 sec. 

cdh06 :) select * from test;

SELECT *
FROM test

┌─id─┬─name──┐
│  1 │ tom   │
│  2 │ jack  │
│  3 │ lihua │
└────┴───────┘

3 rows in set. Elapsed: 0.047 sec. 

Create table

  • grammar
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [compression_codec] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [compression_codec] [TTL expr2],
    ...
) ENGINE = engine
  • Example
-- 注意首字母大写
-- 建表
create table test(
    id Int32,
    name String
) engine=Memory;

The above command creates a memory table, which uses the Memory engine. The table engine determines the characteristics of the data table, and also determines how the data will be stored and loaded. The Memory engine is ClickHouse's simplest table engine. Data will only be stored in the memory, and the data will be lost when the service restarts.

Cluster installation

installation steps

The basic steps of the stand-alone installation and the basic use of the ClickHouse client are described above. Next, we will introduce how to install the cluster. The ClickHouse cluster installation is very simple. First, repeat the above steps to install ClickHouse on other machines, and then configure the /etc/clickhouse-server/config.xml and /etc/metrika.xml files respectively . It is worth noting that the ClickHouse cluster relies on Zookeeper , so ensure that the Zookeeper cluster is installed first. The installation steps of the zk cluster are very simple and will not be covered in this article. This article demonstrates the installation of a three-node ClickHouse cluster. The specific steps are as follows:

  • First, repeat the steps of single machine installation, install ClickHouse on the other two machines respectively

  • Then, modify the /etc/clickhouse-server/config.xml file on each machine

    <!-- 如果禁用了ipv6,使用下面配置-->
    <listen_host>0.0.0.0</listen_host>
    <!-- 如果没有禁用ipv6,使用下面配置
    <listen_host>::</listen_host>
    -->

    Screaming prompt (1):

    When ipv6 is disabled, if you use <listen_host>::</listen_host> configuration, the following error will be reported

    <Error> Application: DB::Exception: Listen [::]:8123 failed: Poco::Exception. Code: 1000, e.code() =0, e.displayText() = DNS error: EAI: -9

    Screaming prompt (2):

    ClickHouse default tcp port number is 9000, if there is a port conflict, you can modify the port number in the /etc/clickhouse-server/config.xml file <tcp_port>9001</tcp_port>

  • Finally, create a metrika.xml file under /etc . The content is as follows. The following configuration is a shard configuration that does not include replicas . We can also configure multiple replicas for shards

    <yandex>
    <!-- 该标签与config.xml的<remote_servers incl="clickhouse_remote_servers" >保持一致
    -->    
    <clickhouse_remote_servers>
      <!-- 集群名称,可以修改 -->
      <cluster_3shards_1replicas>
          <!-- 配置三个分片,每个分片对应一台机器-->
          <shard>
              <replica>
                  <host>cdh04</host>
                  <port>9001</port>
              </replica>
          </shard>
          <shard>
              <replica>
                  <host>cdh05</host>
                  <port>9001</port>
              </replica>
          </shard>
          <shard>
              <replica>
                  <host>cdh06</host>
                  <port>9001</port>
              </replica>
          </shard>
      </cluster_3shards_1replicas>
    </clickhouse_remote_servers>
    <!-- 该标签与config.xml的<zookeeper incl="zookeeper-servers" optional="true" />
    保持一致
    --> 
    <zookeeper-servers>
      <node>
          <host>cdh02</host>
          <port>2181</port>
      </node>
      <node>
          <host>cdh03</host>
          <port>2181</port>
      </node>
      <node>
          <host>cdh06</host>
          <port>2181</port>
      </node>
    </zookeeper-servers>
    <!-- 分片和副本标识,shard标签配置分片编号,<replica>配置分片副本主机名
    需要修改对应主机上的配置-->
    <macros>
      <shard>01</shard>
      <replica>cdh04</replica>
    </macros>    
    </yandex>
  • Start clickhouse-server on their respective machines

    # service clickhouse-server start
  • (Optional configuration) Modify the /etc/clickhouse-client/config.xml file

    Because clickhouse-client connects to localhost by default, and the default connection port number is 9000, we have modified the default port number, so we need to modify the default connection port number of the client. Add the following to the file:

    <port>9001</port>

    Of course, there is no need to modify it, but remember to add the --port 9001 parameter to indicate the port number to be connected when using the client connection , otherwise an error will be reported:

    Connecting to localhost:9000 as user default.
    Code: 210. DB::NetException: Connection refused (localhost:9000)

Basic operation

Verify the cluster

After completing the above configuration, start clickhouse-server on the respective machine and enable clickhouse-clent

// 启动server
# service clickhouse-server start
// 启动客户端,-m参数支持多行输入
# clickhouse-client -m

You can query the system tables to verify whether the cluster configuration has been loaded:

cdh04 :) select cluster,shard_num,replica_num,host_name,port,user from system.clusters;

Article 1 | ClickHouse Quick Start

Next, let's take a look at the fragmentation information (macro variable) of the cluster: execute the following commands on their respective machines:

cdh04 :) select * from system.macros;
┌─macro───┬─substitution─┐
│ replica │ cdh04        │
│ shard   │ 01           │
└─────────┴──────────────┘

cdh05 :) select * from system.macros;
┌─macro───┬─substitution─┐
│ replica │ cdh05        │
│ shard   │ 02           │
└─────────┴──────────────┘

cdh06 :) select * from system.macros;
┌─macro───┬─substitution─┐
│ replica │ cdh06        │
│ shard   │ 03           │
└─────────┴──────────────┘

Distributed DDL operation

By default, CREATE, DROP, ALTER, and RENAME operations only take effect on the server currently executing the command. In a cluster environment, you can use the ON CLUSTER statement, so that it can play a role in the entire cluster.

For example, create a distributed table:

CREATE TABLE IF NOT EXISTS user_cluster ON CLUSTER cluster_3shards_1replicas
(
    id Int32,
    name String
)ENGINE = Distributed(cluster_3shards_1replicas, default, user_local,id);

The definition of the Distributed table engine is as follows : Regarding the ClickHouse table engine, a follow-up article will explain it in detail.

Distributed(cluster_name, database_name, table_name[, sharding_key])

The meaning of each parameter is as follows:

  • cluster_name : cluster name, corresponding to the custom name in the cluster configuration.
  • database_name : database name
  • table_name : table name
  • sharding_key : Optional, the key value used for sharding . In the process of data writing, the distributed table will distribute the data to the local table of each node according to the rules of the sharding key.

Scream reminder:

Creating a distributed table is a mechanism for checking when reading , that is to say, there is no mandatory requirement on the order of creating a distributed table and a local table .

It is also worth noting that the ON CLUSTER distributed DDL is used in the above statement, which means that on each shard node of the cluster, a Distributed table will be created, so that all the distributions can be initiated from either end Read and write requests for the slice.

After creating the above distributed table, check the table on each machine and find that there is a newly created table on each machine.

Next, you need to create a local table. Create a local table on each machine:

CREATE TABLE IF NOT EXISTS user_local 
(
    id Int32,
    name String
)ENGINE = MergeTree()
ORDER BY id
PARTITION BY id
PRIMARY KEY id;

We first insert data into the user_local table on a machine, and then query the user_cluster table

-- 插入数据
cdh04 :) INSERT INTO user_local VALUES(1,'tom'),(2,'jack');
-- 查询user_cluster表,可见通过user_cluster表可以操作所有的user_local表
cdh04 :) select * from user_cluster;
┌─id─┬─name─┐
│  2 │ jack │
└────┴──────┘
┌─id─┬─name─┐
│  1 │ tom  │
└────┴──────┘

Next, we insert some data into user_cluster and observe the changes in user_local table data, we can find that the data is scattered and stored on other nodes.

-- 向user_cluster插入数据
cdh04 :)  INSERT INTO user_cluster VALUES(3,'lilei'),(4,'lihua'); 
-- 查看user_cluster数据
cdh04 :) select * from user_cluster;
┌─id─┬─name─┐
│  2 │ jack │
└────┴──────┘
┌─id─┬─name──┐
│  3 │ lilei │
└────┴───────┘
┌─id─┬─name─┐
│  1 │ tom  │
└────┴──────┘
┌─id─┬─name──┐
│  4 │ lihua │
└────┴───────┘

-- 在cdh04上查看user_local
cdh04 :) select * from user_local;
┌─id─┬─name─┐
│  2 │ jack │
└────┴──────┘
┌─id─┬─name──┐
│  3 │ lilei │
└────┴───────┘
┌─id─┬─name─┐
│  1 │ tom  │
└────┴──────┘
-- 在cdh05上查看user_local
cdh05 :) select * from user_local;
┌─id─┬─name──┐
│  4 │ lihua │
└────┴───────┘

to sum up

This article first introduces the basic features and usage scenarios of ClickHouse, then explains the offline installation steps of ClickHouse stand-alone version and cluster version, and gives a simple use case of ClickHouse. This article is a simple introduction to ClickHouse. In the following sharing, I will gradually explore the world of ClickHouse.
Article 1 | ClickHouse Quick Start

Guess you like

Origin blog.51cto.com/12729470/2532791