Simple deployment and application of ClickHouse

ClickHouse

introduction

The C in the more popular KFC package on the market is ClickHouse. Many students who do real-time and architecture are familiar with this tool. I will combine it, and simply talk about deployment and use in the business, and why you should abandon CK afterwards .
ClickHouse is a columnar storage OLAP database developed by Yandex, Russia's largest search engine. The query performance of stand-alone and clusters is flying fast. There is no single table query tool in OLAP. Toutiao, Tencent, Ctrip, and Kuaishou are all using CK to analyze PB-level data.
advantage:

  • 1. A true column-oriented DBMS
  • 2. Data efficient compression=>0.2
  • 3. Data stored on disk => reduce memory usage
  • 4. Multi-core parallel processing => multi-core multi-node parallelization of large queries
  • 5. Distributed processing on multiple servers
  • 6.SQL syntax support
  • 7. Ordered data storage => ClickHouse supports specifying to sort the data according to certain columns when building a table.
  • 8. Primary key index + sparse index

Build (docker)

Dockerfile

FROM centos:7
MAINTAINER clickhouse

RUN yum install -y curl
RUN curl -s https://packagecloud.io/install/repositories/Altinity/clickhouse/script.rpm.sh | bash
RUN yum install -y clickhouse-server clickhouse-client
RUN mkdir -p /var/clickhouse/log
RUN mkdir -p /var/clickhouse/data

ADD clickhouse-start.sh /

ENTRYPOINT ["sh","/clickhouse-start.sh"]

clickhouse-start.sh

#!/bin/bash
set -e
exec /etc/init.d/clickhouse-server start & tail -f /dev/null

The basic configuration of config.xml, I won’t go into more details. Here, change the log and data directories to /var/clickhouse/.... If I don’t know how to
talk privately, my metrika.xml can be understood as the configuration of the cluster. There are 3 highly available ones. Sharding, 2 copies as an example
, let’s briefly talk about this architecture. Generally speaking, a single machine is set up first, without considering these issues. If you want to build a highly available cluster, it is more reliable that there are shards and replicas, and ck uses is to specify a distribution specified embodiment two copies, such as a shard my_cluster first cluster is composed of CK01 and ck02, i.e., a fragment composed of two copies, then the macro variable specified in ck01 macros are 0101

machine shard replica
ck01 01 01
ck02 01 02
ck03 02 01
ck04 02 02
ck05 03 01
ck06 03 02
<yandex>
    <clickhouse_remote_servers>
        <my_cluster>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>ck01</host>
                    <port>9000</port>
                </replica>
                 <replica>
                    <host>ck02</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>ck03</host>
                    <port>9000</port>
                </replica>
                 <replica>
                    <host>s-hadoop-log04</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>ck05</host>
                    <port>9000</port>
                </replica>
                 <replica>
                    <host>ck06</host>
                    <port>9000</port>
                </replica>
            </shard>
        </my_cluster>
</clickhouse_remote_servers>

    <!--zookeeper相关配置-->
    <zookeeper-servers>
        <node index="1">
            <host>ck01</host>
            <port>2181</port>
        </node>
        <node index="2">
            <host>ck02</host>
            <port>2181</port>
        </node>
        <node index="3">
            <host>ck03</host>
            <port>2181</port>
        </node>
    </zookeeper-servers>

    <macros>
        <shard>01</shard>
        <replica>01</replica>
    </macros>

    <networks>
        <ip>::/0</ip>
    </networks>

    <clickhouse_compression>
        <case>
            <min_part_size>10000000000</min_part_size>
            <min_part_size_ratio>0.01</min_part_size_ratio>
            <method>lz4</method>
        </case>
    </clickhouse_compression>
</yandex>

users.xml Basic user management username: default password: 123456

<yandex>
    <!-- Profiles of settings. -->
    <profiles>
        <!-- Default settings. -->
        <default>
            <!-- Maximum memory usage for processing single query, in bytes. -->
            <max_memory_usage>10000000000</max_memory_usage>
            <use_uncompressed_cache>0</use_uncompressed_cache>
            <load_balancing>random</load_balancing>
        </default>

        <!-- Profile that allows only read queries. -->
        <readonly>
            <readonly>1</readonly>
        </readonly>
    </profiles>

    <!-- Users and ACL. -->
    <users>
        <default>
            <password>123456</password>
            <networks incl="networks" replace="replace">
                <ip>::/0</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
        </default>
        <guest>
            <password></password>
            <networks incl="networks" replace="replace">
                <ip>::/0</ip>
            </networks>
            <profile>readonly</profile>
            <quota>default</quota>
        </guest>
    </users>

    <!-- Quotas. -->
    <quotas>
        <!-- Name of quota. -->
        <default>
            <!-- Limits for time interval. You could specify many intervals with different limits. -->
            <interval>
                <!-- Length of interval. -->
                <duration>3600</duration>
                <!-- No limits. Just calculate resource usage for time interval. -->
                <queries>0</queries>
                <errors>0</errors>
                <result_rows>0</result_rows>
                <read_rows>0</read_rows>
                <execution_time>0</execution_time>
            </interval>
        </default>
    </quotas>
</yandex>

Create folder
/xxx/clickhouse/conf
/xxx/clickhouse/log
/xx/clickhouse/data/clickhouse-server

docker-compose-server.yml

version: '3.7'
services:
    clickhouse-server:
        image: clickhouse:v20.5.4
        restart: always
        network_mode: "host"
        container_name: "clickhouse-server"
        ports:
            - "9000:9000"
            - "9440:9440"
            - "9009:9009"
            - "8123:8123"
        volumes:
            - "/xxx/clickhouse/conf:/etc/clickhouse-server"
            - "/xxx/clickhouse/log:/var/clickhouse/log"
            - "/xxx/clickhouse/data/clickhouse-server:/var/clickhouse/data"
            - "/etc/localtime:/etc/localtime:ro"

Upgrade + expansion

Read the instructions before the upgrade. Generally, the upgrade is a rolling upgrade. You can re-image and replace the current image. The
expansion is easier to configure in metrika.xml. For example, two more machines, ck07 and
ck08, are added in my_cluster.

 <shard>
    <internal_replication>true</internal_replication>
    <replica>
        <host>ck07</host>
        <port>9000</port>
    </replica>
     <replica>
        <host>ck08</host>
        <port>9000</port>
    </replica>
</shard>

Then modify the variables 04-01 and 04-02 in the macro in 07 and 08 respectively

use

The application scenario is user portrait T+1, which provides fast circle selection, feature screening, fast query, and report display

ui

http://ui.tabix.io/#!/login The external UI is used after login, there is monitoring

Create a table (take the user portrait table as an example)

-- 创建ck表 高可用表 ON CLUSTER my_cluster的意思是在这个集群上创建,不写只会在一台机器上有这个表
CREATE TABLE `user_profile` ON CLUSTER my_cluster (
user_id              Bigint   comment '用户id',
user_name            String   comment '用户名称',
user_desc            String   comment '用户简介',
card_name            String   comment '身份证姓名',
...
partition_date       Int      comment '时间分区'
)
-- {shard} {replica} 是你在metrika.xml指定的宏变量
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/user_profile', '{replica}')
PARTITION BY partition_date
ORDER BY user_id
SETTINGS index_granularity = 8192;

-- 创建ck表 分布式表,上边的表是每台机器上都有的本地表,不同分片之间的数据不会互通,只查询上表,数据是一个分片的数据,这个表是所有分片上表的一个映射,查询分布式表是所有分片数据的总和
create table user_profile_all ON CLUSTER my_cluster as user_profile
ENGINE = Distributed(my_cluster, default, user_profile, rand());

Import table from hive

Import from hdfs to local, and import from local to ck.
6 clusters, 13G data, 40 million pieces, 4 minutes to obtain data, 6 minutes to import data

#!/bin/bash
 
echo "select * from test.user_profile where partition_date=$1"
 
hive -e "SET hive.exec.compress.output=false; insert overwrite local directory '/home/tmp/test/user_profile/$1' row format delimited fields terminated by '\001' STORED AS TEXTFILE select * from test.user_profile where partition_date=$1"
 
echo "load csv to ck"
 
delimiter=$'\001'
cat ./$1/* | sed 's/"/\\"/g'| sed "s/'/\\\'/g"|clickhouse-client --host=ck01 --port=9000 --user=default --format_csv_delimiter="$delimiter" --query="INSERT INTO default.user_profile_all FORMAT CSV"
 
echo "load down! remove file"
 
rm -rf ./$1
  
-- 复杂数据类型,第47列是array[int]
cat 20200714/* | sed 's/"/\\"/g'| sed "s/'/\\\'/g" | sed "s/\x02/, /g"  | awk -F '\x01' '{
for(i=1;i<=NF;i++) {
    if(i==NF){print $i}
    else if( i==47 ){printf "["$i"]""\x01"}
    else {printf $i"\x01"}
}}' | clickhouse-client --host=ck01 --port=9000 --user=default --format_csv_delimiter=$'\001' --query="INSERT INTO default.user_user_profile_all FORMAT CSV"
 

db guide table

CREATE TABLE tablename ENGINE = MergeTree ORDER BY id AS
SELECT *
FROM mysql('host:port', 'databasename', 'tablename', 'username', 'password')

alter

-- 删除分区
ALTER TABLE default.tablename ON CLUSTER my_cluster delete where partition_date=20200615
-- 添加列
alter table tablename ON CLUSTER my_cluster add column cost int default 0 after user_id
-- 删除列
alter table tablename ON CLUSTER my_cluster drop column cost
-- 注视
alter table tablename ON CLUSTER my_cluster comment column cost 'test'
-- 更改类型
alter table tablename ON CLUSTER my_cluster modify column cost String
-- 更改列名 (2020年9月初会更新至20.5.4,之后版本可用)
alter table default.tablename ON CLUSTER my_cluster rename column `oldname` to `newname`;

problem

Because of these problems, let us give up and continue to use ClickHouse

  • Does not support transactions, does not support real delete/update ; in fact, it can be deleted, but it takes one minute to delete a piece of data asynchronously
  • High concurrency is not supported. The official recommendation is that qps is 100. You can increase the number of connections by modifying the configuration file, but if the server is good enough; the test phase is fine
  • SQL satisfies more than 80% of the grammar used in daily use, and join is written in a special way; the latest version already supports joins similar to SQL, but the performance is not good
  • Try to do more than 1,000 batch writes and avoid row-by-row insert or small batch insert, update, and delete operations, because the bottom layer of ClickHouse will continue to do asynchronous data merging, which will affect query performance. This is doing real-time data writing. Try to avoid the time; it cannot meet the situation of quasi real-time writing
  • Clickhouse is fast because it uses a parallel processing mechanism. Even a query will use half of the server's CPU to execute, so ClickHouse cannot support high-concurrency usage scenarios. The default single query uses half of the server cores. The number of server cores will be automatically identified, and this parameter can be modified through the configuration file.
  • Written in c++, small companies use it on a large scale, and no one can solve the problem

Guess you like

Origin blog.csdn.net/jklcl/article/details/112971187