ClickHouse detailed explanation, installation tutorial

1. Concept

ClickHouse is a columnar database management system (DBMS) for online analytics (OLAP).

Key Features of OLAP Scenarios

  • Mostly read requests

  • Data is updated in sizable batches (>1000 rows) instead of a single row; or not updated at all.

  • Data that has been added to the database cannot be modified.

  • For reads, quite a few rows are fetched from the database, but only a small subset of the columns.

  • Wide tables, that is, each table contains a large number of columns

  • relatively few queries (typically hundreds of queries per second or less per server)

  • For simple queries, about 50 milliseconds of latency is allowed

  • The data in the columns is relatively small: numbers and short strings (e.g. 60 bytes per URL)

  • Requires high throughput (billions of rows per second per server) when processing a single query

  • business is not necessary

  • Low requirements for data consistency

  • There is one big table per query. All but him were small.

  • The query results are significantly smaller than the source data. In other words, the data is filtered or aggregated so the results fit in a single server's RAM

2. Application scenarios

Build real-time operation monitoring reports based on ClickHouse and BI tools

Use ClickHouse to build real-time interactive reports to analyze core business indicators such as orders, revenue, and number of users in real time; build a user source analysis system to track PV and UV sources of each channel.

Massive data real-time multi-dimensional query

For large and wide tables with hundreds of millions to tens of billions of records, hundreds of dimensions can be queried freely, and the response time is usually within 100 milliseconds. Allows business personnel to continue exploratory query and analysis without interrupting the analysis thinking, facilitates deep digging of business value, and has a very good query experience.

User portrait analysis

With the development of the data age, the volume of data platforms in all walks of life is getting bigger and bigger, and the demand for personalized operation of users is becoming more and more prominent. The user label system, as the basic service of personalized operation for thousands of people, should And born. Today, almost all industries (such as the Internet, games, education, etc.) have the need for real-time precision marketing. Generate user portraits through the system, filter users through condition combinations during marketing, and quickly extract target groups.

Construction of User Characteristic Behavior Analysis System Based on ClickHouse

Use ClickHouse to filter crowd tag data in real time and perform group portrait statistics; customize conditions to filter massive detailed log records and analyze user behavior.

User Group Statistics

Build a large and wide table of user characteristics, arbitrarily select user attribute label data and filter conditions, and conduct statistical analysis of crowd characteristics.

Visitor source analysis display

Correlate user behavior in user access logs through batch offline calculations, generate a large and wide table of user behavior paths and synchronize them to ClickHouse, and build an interactive visitor source exploration and analysis visualization system based on ClickHoue.

3. Product Features

A true columnar database management system

In a true columnar database management system, there should be no additional data other than the data itself. This means that in order to avoid storing their length «number» next to values, you must support fixed-length numeric types. For example, 1 billion UInt8s would consume around 1GB uncompressed, which would have a strong impact on CPU usage if it wasn't. It is important to store data compactly even when uncompressed, because the speed of decompression depends mainly on the size of the uncompressed data. This is very noteworthy, because in some other systems, different columns can also be stored separately, but due to the optimization of other scenarios, it cannot effectively handle analysis queries. For example: HBase, BigTable, Cassandra, HyperTable. In these systems, you can get throughput of hundreds of thousands of rows per second, but not hundreds of millions of rows per second. It should be noted that ClickHouse is not just a database, it is a database management system. Because it allows creating tables and databases, loading data, and running queries at runtime without reconfiguring or restarting services.

data compression

In some columnar database management systems (for example: InfiniDB CE and MonetDB) data compression is not used. However, if you want to achieve relatively good performance, data compression does play a vital role.

In addition to efficient general-purpose compression codecs with different trade-offs between disk space and CPU consumption, ClickHouse also provides specialized codecs for specific types of data , which enables ClickHouse to work with smaller databases (such as time series databases) Compete and outperform them.

disk storage of data

Many columnar databases (such as SAP HANA, Google PowerDrill) can only work in memory, and this method will cause more equipment budget than actual.

ClickHouse is designed for systems that work on traditional disks, which offer lower storage costs per GB, but if SSDs and memory are available, it will also make good use of these resources.

Multi-core parallel processing

ClickHouse uses all available resources on the server to process large queries in parallel in the most natural way.

Multi-server distributed processing

Few of the columnar database management systems mentioned above support distributed query processing. In ClickHouse, data can be stored on different shards, each shard consists of a set of replicas for fault tolerance, and queries can be processed on all shards in parallel. These are transparent to the user

support SQL

ClickHouse supports a SQL-based declarative query language , which in many cases is identical to the ANSI SQL standard . Supported queries are GROUP BY , ORDER BY , FROM , JOIN , IN and non-correlated subqueries. Correlated (dependent) subqueries and window functions are not currently supported, but will be implemented in the future.

vector engine

In order to use CPU efficiently, data is not only stored in columns, but also processed in vectors (parts of columns), which can use CPU more efficiently.

real-time data update

ClickHouse supports defining primary keys in tables. In order to enable queries to quickly perform range lookups in the primary key, data is always incrementally stored in MergeTree in an orderly manner. Therefore, data can be continuously and efficiently written to the table, and there will be no locking during the writing process.

index

Sort the data according to the primary key, which will help ClickHouse complete the search for a specific value or range of data within tens of milliseconds. suitable for online inquiries

Online querying means that the query is processed and the results are loaded into the user's page with very low latency without any preprocessing of the data.

Support for approximate calculations

ClickHouse provides various ways to speed up queries while allowing to sacrifice data precision:

  1. Various aggregate functions for approximate calculations, such as: distinct values, medians, quantiles

  1. Approximate queries based on a partial sample of the data. At this point, only a small percentage of the data is retrieved from disk.

  1. Instead of using all aggregation conditions, aggregation is performed by randomly selecting a limited number of data aggregation conditions. This reduces computing resource usage while providing fairly accurate aggregation results when data aggregation conditions satisfy certain distribution conditions.

Adaptive Join Algorithm

ClickHouse supports custom JOIN multiple tables, it is more inclined to the hash join algorithm, if there are multiple large tables, use the merge-join algorithm

Support for data replication and data integrity

ClickHouse uses asynchronous multi-master replication technology. After data is written to any available copy, the system will distribute the data to other copies in the background to ensure that the system maintains the same data on different copies. In most cases, ClickHouse can automatically recover after a failure, and manual recovery is required in some rare complex cases.

For more information, see Data Replication .

Role Access Control

ClickHouse implements user account management using SQL queries and allows access control of roles , similar to the ANSI SQL standard and popular relational database management systems.

limit

  1. There is no full transaction support.

  1. Lack of high frequency, low latency ability to modify or delete existing data. Can only be used to delete or modify data in bulk, but this is GDPR compliant .

  1. Sparse indexes make ClickHouse unsuitable for point queries that retrieve a single row by its key.

4. Performance

According to Yandex's internal test results, ClickHouse has shown better performance than comparable products in the same category. You can view the detailed test results here . Many other tests confirmed this as well. You can use the Internet to search for them, or you can also check from some of the related links we have collected.

Throughput of a single large query

Throughput can be measured in rows processed per second or bytes processed per second. If the data is placed in the page cache, a less complex query can be processed at a speed of about 2-10GB/s (uncompressed) on a single server (for simple queries, the speed can reach 30GB/s). If the data is not in the page cache, then the speed will depend on your disk system and the compression ratio of the data. For example, if a disk allows reading data at a speed of 400MB/s, and the data compression ratio is 3, the data processing speed is 1.2GB/s. That means, if you're fetching a 10 byte column, it's about 100-200 million rows per second.

For distributed processing, the processing speed scales almost linearly, but this is limited to cases where the aggregated or sorted results are not that large.

Latency in processing short queries

If a query uses a primary key and doesn't have too many rows (hundreds of thousands) to process, and doesn't query too many columns, its latency should be less than 50 ms (in the best case) if the data is cached by the page cache should be less than 10 milliseconds). Otherwise, the latency depends on the number of lookups for the data. If you are currently using HDD, the delay required for query can be calculated by the following formula when no data is loaded: Lookup time (10 ms) * Number of columns queried * Number of data blocks queried.

Throughput for handling a large number of short queries

Under the same conditions, ClickHouse can handle hundreds of queries per second (up to thousands in the best case) on a single server. But since this is not suitable for analytical type scenarios. Therefore we recommend a maximum of 100 queries per second.

Data write performance

We recommend batch writes of no less than 1000 rows at a time, or no more than one write request per second. When using the tab-separated format to write a piece of data into the MergeTree table, the writing speed is about 50 to 200MB/s. If you're writing 1Kb per row, you're writing between 50,000 and 200,000 rows per second. If your rows are smaller then the write speed will be higher. To improve write performance, you can use multiple INSERTs for parallel writes, which will result in a linear performance increase.

5. Installation

System Requirements

ClickHouse can run on any Linux, FreeBSD or Mac OS X with x86_64, AArch64 or PowerPC64LE CPU architecture.

Official prebuilt binaries are usually compiled for x86_64 and utilize the SSE 4.2 instruction set, so CPU usage supporting it will be an additional system requirement unless otherwise stated. Below is the command to check if the current CPU supports SSE 4.2:

$ grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

To run ClickHouse on processors that do not support SSE 4.2 or AArch64 , PowerPC64LE architecture, you should build ClickHouse from source with appropriate configuration adjustments .

Available packages

DEB installation package

It is recommended to use the official precompiled deb packages of Debian or Ubuntu. Run the following command to install the package:

sudo apt-get install -y apt-transport-https ca-certificates dirmngrsudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \    /etc/apt/sources.list.d/clickhouse.listsudo apt-get updatesudo apt-get install -y clickhouse-server clickhouse-clientsudo service clickhouse-server startclickhouse-client # or "clickhouse-client --password" if you've set up a password.

Deprecated Method for installing deb-packages

If you want to use the latest version, please replace stable with testing (we only recommend you for testing environment).

You can also manually download the installation package from here: Download .

List of installation packages:

  • clickhouse-common-static — ClickHouse compiled binaries.

  • clickhouse-server — Create a clickhouse-server soft connection and install the default configuration service

  • clickhouse-client — Create a soft connection to the clickhouse-client client tool and install the client configuration file.

  • clickhouse-common-static-dbg — ClickHouse binaries with debug information.

RPM installation package

Official precompiled rpm packages for CentOS, RedHat, and all other rpm-based Linux distributions are recommended.

First, you need to add the official repository:

sudo yum install -y yum-utilssudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.reposudo yum install -y clickhouse-server clickhouse-clientsudo /etc/init.d/clickhouse-server startclickhouse-client # or "clickhouse-client --password" if you set up a password.

Deprecated Method for installing rpm-packages

If you want to use the latest version, please replace stable with testing (we only recommend you for testing environment). prestable is also sometimes available.

Then run the command to install:

sudo yum install clickhouse-server clickhouse-client

You can also manually download the installation package from here: Download .

Tgz installation package

If your operating system does not support the installation of deb or rpm packages, it is recommended to use the official precompiled tgz package.

The required version can be downloaded from the repository https://packages.clickhouse.com/tgz/ via curl or wget .

After downloading unzip the downloaded resource file and install it using the install script. Here is an example installation of the latest stable release:

LATEST_VERSION=$(curl -s https://packages.clickhouse.com/tgz/stable/ | \    grep -Eo '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | sort -V -r | head -n 1)export LATEST_VERSIONcase $(uname -m) in  x86_64) ARCH=amd64 ;;  aarch64) ARCH=arm64 ;;  *) echo "Unknown architecture $(uname -m)"; exit 1 ;;esacfor PKG in clickhouse-common-static clickhouse-common-static-dbg clickhouse-server clickhouse-clientdo  curl -fO "https://packages.clickhouse.com/tgz/stable/$PKG-$LATEST_VERSION-${ARCH}.tgz" \    || curl -fO "https://packages.clickhouse.com/tgz/stable/$PKG-$LATEST_VERSION.tgz"donetar -xzvf "clickhouse-common-static-$LATEST_VERSION-${ARCH}.tgz" \  || tar -xzvf "clickhouse-common-static-$LATEST_VERSION.tgz"sudo "clickhouse-common-static-$LATEST_VERSION/install/doinst.sh"tar -xzvf "clickhouse-common-static-dbg-$LATEST_VERSION-${ARCH}.tgz" \  || tar -xzvf "clickhouse-common-static-dbg-$LATEST_VERSION.tgz"sudo "clickhouse-common-static-dbg-$LATEST_VERSION/install/doinst.sh"tar -xzvf "clickhouse-server-$LATEST_VERSION-${ARCH}.tgz" \  || tar -xzvf "clickhouse-server-$LATEST_VERSION.tgz"sudo "clickhouse-server-$LATEST_VERSION/install/doinst.sh" configuresudo /etc/init.d/clickhouse-server starttar -xzvf "clickhouse-client-$LATEST_VERSION-${ARCH}.tgz" \  || tar -xzvf "clickhouse-client-$LATEST_VERSION.tgz"sudo "clickhouse-client-$LATEST_VERSION/install/doinst.sh"

Deprecated Method for installing tgz archives

For production environments, it is recommended to use the latest stable version. You can find it on the GitHub page https://github.com/ClickHouse/ClickHouse/tags, it is suffixed with `-stable` flag.

Docker installation package

To run ClickHouse in Docker, follow the guide on Docker Hub . It is the official deb installation package.

Other environment installation packages

For non-linux operating systems and Arch64 CPU architectures, ClickHouse will compile and provide the latest commit of the master branch (it will be delayed by a few hours).

After downloading, you can use the clickhouse client to connect to the service, or use the clickhouse local mode to process data, but you must additionally download the server and users configuration files from GitHub.

It is not recommended to use these builds in a production environment as they have not been fully tested, but you do so at your own risk. They are only one part of the functionality of ClickHouse.

Install from source

To compile ClickHouse manually, follow the Linux or Mac OS X instructions.

You can compile and install them, or use the program without installing the package. By building manually you can disable SSE 4.2 or AArch64 cpu .

  Client: programs/clickhouse-client  Server: programs/clickhouse-server

You need to create a data and metadata folder and chown the desired user . Their paths can be changed in the server configuration ( src/programs/server/config.xml ), by default they are:

  /opt/clickhouse/data/default/  /opt/clickhouse/metadata/default/

On Gentoo, you can install ClickHouse from source using emerge clickhouse .

start up

If there is no service , you can run the following command to start the service in the background:

$ sudo /etc/init.d/clickhouse-server start

Log files will be output in /var/log/clickhouse-server/ folder.

If the server is not started, check the configuration in /etc/clickhouse-server/config.xml .

You can also start the server manually from the console:

$ clickhouse-server --config-file=/etc/clickhouse-server/config.xml

In this case, logs will be printed to the console, which is handy during development.

If the configuration file is in the current directory, you do not need to specify the --config-file parameter. By default, its path is ./config.xml .

ClickHouse supports access restriction settings. They are located in the users.xml file (same directory as config.xml ). By default, the default user is allowed access from anywhere without a password. You can view user/default/networks . See Configuration Files for more information .

After starting the service, you can connect to it using the command line client:

$ clickhouse-client

By default, connect to localhost:9000 with the default user and no password . You can also use the --host parameter to connect to a specified server.

Terminals must use UTF-8 encoding. For more information, see Command-line client .

Example:

$ ./clickhouse-clientClickHouse client version 0.0.18749.Connecting to localhost:9000.Connected to ClickHouse server version 0.0.18749.:) SELECT 1SELECT 1┌─1─┐│ 1 │└───┘1 rows in set. Elapsed: 0.003 sec.:)

Congratulations, the system is working!

6. ClickHouse Tutorial

What can I gain from this tutorial?

By following this tutorial, you will learn how to set up a simple ClickHouse cluster. It will be small, but fault-tolerant and scalable. We'll then use one of the sample datasets to populate the data and perform some demo queries.

single node setup

To demonstrate the complexities of a distributed environment with a delay, we will first deploy ClickHouse on a single server or virtual machine. ClickHouse is usually installed from deb or rpm packages, but there are other methods for operating systems that do not support them .

For example, if you choose the deb installation package, execute:

sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754

echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
    /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update

sudo apt-get install -y clickhouse-server clickhouse-client

Include these packages in our installed software:

  • The clickhouse-client package contains the clickhouse-client client, which is an interactive ClickHouse console client.

  • The clickhouse-common package contains a ClickHouse executable.

  • The clickhouse-server package contains ClickHouse configuration files to run as a server.

Server configuration files are located at /etc/clickhouse-server/ . Before proceeding, note the <path> element in config.xml . It determines where the data is stored, so it should be on a volume with disk capacity; the default is /var/lib/clickhouse/ . If you want to adjust the configuration, it is inconvenient to edit the config directly. Take into account that it may be rewritten in a future package update. The recommended way to override configuration elements is to create a config.d folder in the configuration as a way to override config.xml.

As you may have noticed, clickhouse-server does not start automatically after installation. It also doesn't automatically restart after an update. The way you start the server depends on your initial system, usually it goes like this:

sudo service clickhouse-server start

or

sudo /etc/init.d/clickhouse-server start

The default location of server logs is /var/log/clickhouse-server/ . When the server records the Ready for connections message in the log , it means that the server is ready to handle client connections.

Once clickhouse-server is up and running, we can use clickhouse-client to connect to the server and run some test queries like SELECT "Hello, world!"; .

Quick Tips for Clickhouse-client

Import sample dataset

Now it's time to populate our ClickHouse server with some sample data. In this tutorial, we will use anonymous data from Yandex.Metrica, which was the first service running as a production environment before ClickHouse became open source (see ClickHouse History for more on this ) . There are several ways to import Yandex.Metrica datasets , for the sake of this tutorial we will use the most realistic one.

Download and extract table data

curl https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsvcurl https://datasets.clickhouse.com/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv

The extracted file size is about 10GB.

create table

Like most database management systems, ClickHouse logically groups tables into databases. A default database is included, but we will create a new database tutorial :

clickhouse-client --query "CREATE DATABASE IF NOT EXISTS tutorial"

The syntax for creating a table is much more complex than creating a database (see Resources . A general CREATE TABLE statement must specify three key things:

  1. The name of the table to create.

  1. Table structure, such as: column names and corresponding data types .

  1. The table engine and its settings, which determine all the details of how queries to this table are physically performed.

Yandex.Metrica is a web analytics service, the sample dataset does not include its full functionality, so only two tables can be created:

  • The hits table contains every action done by all users on all sites covered by the service.

  • The visits table contains pre-built sessions, not individual actions.

Let's look at and execute the actual create table query for these tables:

CREATE TABLE tutorial.hits_v1(    `WatchID` UInt64,    `JavaEnable` UInt8,    `Title` String,    `GoodEvent` Int16,    `EventTime` DateTime,    `EventDate` Date,    `CounterID` UInt32,    `ClientIP` UInt32,    `ClientIP6` FixedString(16),    `RegionID` UInt32,    `UserID` UInt64,    `CounterClass` Int8,    `OS` UInt8,    `UserAgent` UInt8,    `URL` String,    `Referer` String,    `URLDomain` String,    `RefererDomain` String,    `Refresh` UInt8,    `IsRobot` UInt8,    `RefererCategories` Array(UInt16),    `URLCategories` Array(UInt16),    `URLRegions` Array(UInt32),    `RefererRegions` Array(UInt32),    `ResolutionWidth` UInt16,    `ResolutionHeight` UInt16,    `ResolutionDepth` UInt8,    `FlashMajor` UInt8,    `FlashMinor` UInt8,    `FlashMinor2` String,    `NetMajor` UInt8,    `NetMinor` UInt8,    `UserAgentMajor` UInt16,    `UserAgentMinor` FixedString(2),    `CookieEnable` UInt8,    `JavascriptEnable` UInt8,    `IsMobile` UInt8,    `MobilePhone` UInt8,    `MobilePhoneModel` String,    `Params` String,    `IPNetworkID` UInt32,    `TraficSourceID` Int8,    `SearchEngineID` UInt16,    `SearchPhrase` String,    `AdvEngineID` UInt8,    `IsArtifical` UInt8,    `WindowClientWidth` UInt16,    `WindowClientHeight` UInt16,    `ClientTimeZone` Int16,    `ClientEventTime` DateTime,    `SilverlightVersion1` UInt8,    `SilverlightVersion2` UInt8,    `SilverlightVersion3` UInt32,    `SilverlightVersion4` UInt16,    `PageCharset` String,    `CodeVersion` UInt32,    `IsLink` UInt8,    `IsDownload` UInt8,    `IsNotBounce` UInt8,    `FUniqID` UInt64,    `HID` UInt32,    `IsOldCounter` UInt8,    `IsEvent` UInt8,    `IsParameter` UInt8,    `DontCountHits` UInt8,    `WithHash` UInt8,    `HitColor` FixedString(1),    `UTCEventTime` DateTime,    `Age` UInt8,    `Sex` UInt8,    `Income` UInt8,    `Interests` UInt16,    `Robotness` UInt8,    `GeneralInterests` Array(UInt16),    `RemoteIP` UInt32,    `RemoteIP6` FixedString(16),    `WindowName` Int32,    `OpenerName` Int32,    `HistoryLength` Int16,    `BrowserLanguage` FixedString(2),    `BrowserCountry` FixedString(2),    `SocialNetwork` String,    `SocialAction` String,    `HTTPError` UInt16,    `SendTiming` Int32,    `DNSTiming` Int32,    `ConnectTiming` Int32,    `ResponseStartTiming` Int32,    `ResponseEndTiming` Int32,    `FetchTiming` Int32,    `RedirectTiming` Int32,    `DOMInteractiveTiming` Int32,    `DOMContentLoadedTiming` Int32,    `DOMCompleteTiming` Int32,    `LoadEventStartTiming` Int32,    `LoadEventEndTiming` Int32,    `NSToDOMContentLoadedTiming` Int32,    `FirstPaintTiming` Int32,    `RedirectCount` Int8,    `SocialSourceNetworkID` UInt8,    `SocialSourcePage` String,    `ParamPrice` Int64,    `ParamOrderID` String,    `ParamCurrency` FixedString(3),    `ParamCurrencyID` UInt16,    `GoalsReached` Array(UInt32),    `OpenstatServiceName` String,    `OpenstatCampaignID` String,    `OpenstatAdID` String,    `OpenstatSourceID` String,    `UTMSource` String,    `UTMMedium` String,    `UTMCampaign` String,    `UTMContent` String,    `UTMTerm` String,    `FromTag` String,    `HasGCLID` UInt8,    `RefererHash` UInt64,    `URLHash` UInt64,    `CLID` UInt32,    `YCLID` UInt64,    `ShareService` String,    `ShareURL` String,    `ShareTitle` String,    `ParsedParams` Nested(        Key1 String,        Key2 String,        Key3 String,        Key4 String,        Key5 String,        ValueDouble Float64),    `IslandID` FixedString(16),    `RequestNum` UInt32,    `RequestTry` UInt8)ENGINE = MergeTree()PARTITION BY toYYYYMM(EventDate)ORDER BY (CounterID, EventDate, intHash32(UserID))SAMPLE BY intHash32(UserID)
CREATE TABLE tutorial.visits_v1(    `CounterID` UInt32,    `StartDate` Date,    `Sign` Int8,    `IsNew` UInt8,    `VisitID` UInt64,    `UserID` UInt64,    `StartTime` DateTime,    `Duration` UInt32,    `UTCStartTime` DateTime,    `PageViews` Int32,    `Hits` Int32,    `IsBounce` UInt8,    `Referer` String,    `StartURL` String,    `RefererDomain` String,    `StartURLDomain` String,    `EndURL` String,    `LinkURL` String,    `IsDownload` UInt8,    `TraficSourceID` Int8,    `SearchEngineID` UInt16,    `SearchPhrase` String,    `AdvEngineID` UInt8,    `PlaceID` Int32,    `RefererCategories` Array(UInt16),    `URLCategories` Array(UInt16),    `URLRegions` Array(UInt32),    `RefererRegions` Array(UInt32),    `IsYandex` UInt8,    `GoalReachesDepth` Int32,    `GoalReachesURL` Int32,    `GoalReachesAny` Int32,    `SocialSourceNetworkID` UInt8,    `SocialSourcePage` String,    `MobilePhoneModel` String,    `ClientEventTime` DateTime,    `RegionID` UInt32,    `ClientIP` UInt32,    `ClientIP6` FixedString(16),    `RemoteIP` UInt32,    `RemoteIP6` FixedString(16),    `IPNetworkID` UInt32,    `SilverlightVersion3` UInt32,    `CodeVersion` UInt32,    `ResolutionWidth` UInt16,    `ResolutionHeight` UInt16,    `UserAgentMajor` UInt16,    `UserAgentMinor` UInt16,    `WindowClientWidth` UInt16,    `WindowClientHeight` UInt16,    `SilverlightVersion2` UInt8,    `SilverlightVersion4` UInt16,    `FlashVersion3` UInt16,    `FlashVersion4` UInt16,    `ClientTimeZone` Int16,    `OS` UInt8,    `UserAgent` UInt8,    `ResolutionDepth` UInt8,    `FlashMajor` UInt8,    `FlashMinor` UInt8,    `NetMajor` UInt8,    `NetMinor` UInt8,    `MobilePhone` UInt8,    `SilverlightVersion1` UInt8,    `Age` UInt8,    `Sex` UInt8,    `Income` UInt8,    `JavaEnable` UInt8,    `CookieEnable` UInt8,    `JavascriptEnable` UInt8,    `IsMobile` UInt8,    `BrowserLanguage` UInt16,    `BrowserCountry` UInt16,    `Interests` UInt16,    `Robotness` UInt8,    `GeneralInterests` Array(UInt16),    `Params` Array(String),    `Goals` Nested(        ID UInt32,        Serial UInt32,        EventTime DateTime,        Price Int64,        OrderID String,        CurrencyID UInt32),    `WatchIDs` Array(UInt64),    `ParamSumPrice` Int64,    `ParamCurrency` FixedString(3),    `ParamCurrencyID` UInt16,    `ClickLogID` UInt64,    `ClickEventID` Int32,    `ClickGoodEvent` Int32,    `ClickEventTime` DateTime,    `ClickPriorityID` Int32,    `ClickPhraseID` Int32,    `ClickPageID` Int32,    `ClickPlaceID` Int32,    `ClickTypeID` Int32,    `ClickResourceID` Int32,    `ClickCost` UInt32,    `ClickClientIP` UInt32,    `ClickDomainID` UInt32,    `ClickURL` String,    `ClickAttempt` UInt8,    `ClickOrderID` UInt32,    `ClickBannerID` UInt32,    `ClickMarketCategoryID` UInt32,    `ClickMarketPP` UInt32,    `ClickMarketCategoryName` String,    `ClickMarketPPName` String,    `ClickAWAPSCampaignName` String,    `ClickPageName` String,    `ClickTargetType` UInt16,    `ClickTargetPhraseID` UInt64,    `ClickContextType` UInt8,    `ClickSelectType` Int8,    `ClickOptions` String,    `ClickGroupBannerID` Int32,    `OpenstatServiceName` String,    `OpenstatCampaignID` String,    `OpenstatAdID` String,    `OpenstatSourceID` String,    `UTMSource` String,    `UTMMedium` String,    `UTMCampaign` String,    `UTMContent` String,    `UTMTerm` String,    `FromTag` String,    `HasGCLID` UInt8,    `FirstVisit` DateTime,    `PredLastVisit` Date,    `LastVisit` Date,    `TotalVisits` UInt32,    `TraficSource` Nested(        ID Int8,        SearchEngineID UInt16,        AdvEngineID UInt8,        PlaceID UInt16,        SocialSourceNetworkID UInt8,        Domain String,        SearchPhrase String,        SocialSourcePage String),    `Attendance` FixedString(16),    `CLID` UInt32,    `YCLID` UInt64,    `NormalizedRefererHash` UInt64,    `SearchPhraseHash` UInt64,    `RefererDomainHash` UInt64,    `NormalizedStartURLHash` UInt64,    `StartURLDomainHash` UInt64,    `NormalizedEndURLHash` UInt64,    `TopLevelDomain` UInt64,    `URLScheme` UInt64,    `OpenstatServiceNameHash` UInt64,    `OpenstatCampaignIDHash` UInt64,    `OpenstatAdIDHash` UInt64,    `OpenstatSourceIDHash` UInt64,    `UTMSourceHash` UInt64,    `UTMMediumHash` UInt64,    `UTMCampaignHash` UInt64,    `UTMContentHash` UInt64,    `UTMTermHash` UInt64,    `FromHash` UInt64,    `WebVisorEnabled` UInt8,    `WebVisorActivity` UInt32,    `ParsedParams` Nested(        Key1 String,        Key2 String,        Key3 String,        Key4 String,        Key5 String,        ValueDouble Float64),    `Market` Nested(        Type UInt8,        GoalID UInt32,        OrderID String,        OrderPrice Int64,        PP UInt32,        DirectPlaceID UInt32,        DirectOrderID UInt32,        DirectBannerID UInt32,        GoodID String,        GoodName String,        GoodQuantity Int32,        GoodPrice Int64),    `IslandID` FixedString(16))ENGINE = CollapsingMergeTree(Sign)PARTITION BY toYYYYMM(StartDate)ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID)SAMPLE BY intHash32(UserID)

You can execute these queries using clickhouse-client 's interactive mode (just launch it in a terminal, no need to specify the query in advance). Or try some alternative interfaces if you want .

As we can see, hits_v1 uses the MergeTree engine , while visits_v1 uses the Collapsing engine.

Import Data

Data is imported into ClickHouse through INSERT INTO , and the query is similar to many SQL databases. However, the data is usually in a serialization format supported by a provider other than the VALUES clause (which is also supported).

The files we downloaded earlier are in tab-delimited format, so here's how to import them via the console client:

clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsvclickhouse-client --query "INSERT INTO tutorial.visits_v1 FORMAT TSV" --max_insert_block_size=100000 < visits_v1.tsv

ClickHouse has a lot of settings to tune. One way to specify them in the console client is through parameters, like we saw --max_insert_block_size in the statement above . The easiest way to find out what settings are available, their meanings, and their default values ​​is to query the system.settings table:

SELECT name, value, changed, descriptionFROM system.settingsWHERE name LIKE '%max_insert_b%'FORMAT TSVmax_insert_block_size    1048576    0    "The maximum block size for insertion, if we control the creation of blocks for insertion."

You can also OPTIMIZE the imported table. Tables configured with the MergeTree-family engine always merge data parts behind the scenes to optimize data storage (or at least check if it makes sense). These queries force the table engine to perform storage optimizations immediately, rather than some time later:

clickhouse-client --query "OPTIMIZE TABLE tutorial.hits_v1 FINAL"clickhouse-client --query "OPTIMIZE TABLE tutorial.visits_v1 FINAL"

These queries start to be I/O and CPU intensive, so if the table keeps receiving new data, it's best to leave it alone and let the merge run in the background.

Now we can check if the table import was successful:

clickhouse-client --query "SELECT COUNT(*) FROM tutorial.hits_v1"clickhouse-client --query "SELECT COUNT(*) FROM tutorial.visits_v1"

query example

SELECT    StartURL AS URL,    AVG(Duration) AS AvgDurationFROM tutorial.visits_v1WHERE StartDate BETWEEN '2014-03-23' AND '2014-03-30'GROUP BY URLORDER BY AvgDuration DESCLIMIT 10
SELECT    sum(Sign) AS visits,    sumIf(Sign, has(Goals.ID, 1105530)) AS goal_visits,    (100. * goal_visits) / visits AS goal_percentFROM tutorial.visits_v1WHERE (CounterID = 912887) AND (toYYYYMM(StartDate) = 201403) AND (domain(StartURL) = 'yandex.ru')

cluster deployment

A ClickHouse cluster is a homogeneous cluster. Setup steps:

  1. Install the ClickHouse server on all machines in the cluster

  1. Set the cluster configuration in the configuration file

  1. Create local tables on each instance

  1. Create a distributed table

The distributed table is actually a view , which is mapped to the local table of the ClickHouse cluster. Executing a SELECT query from a distributed table uses the resources of all shards in the cluster. You can specify configs for multiple clusters and create multiple distributed tables providing views for different clusters.

Example configuration for a cluster with three shards, each with one replica:

<remote_servers>    <perftest_3shards_1replicas>        <shard>            <replica>                <host>example-perftest01j.yandex.ru</host>                <port>9000</port>            </replica>        </shard>        <shard>            <replica>                <host>example-perftest02j.yandex.ru</host>                <port>9000</port>            </replica>        </shard>        <shard>            <replica>                <host>example-perftest03j.yandex.ru</host>                <port>9000</port>            </replica>        </shard>    </perftest_3shards_1replicas></remote_servers>

To demonstrate further, let's create a new local table using the same CREATE TABLE statement as the hits_v1 table, but with a different name:

CREATE TABLE tutorial.hits_local (...) ENGINE = MergeTree() ...

Create a distributed table that provides a cluster-local table view:

CREATE TABLE tutorial.hits_all AS tutorial.hits_localENGINE = Distributed(perftest_3shards_1replicas, tutorial, hits_local, rand());

It is common practice to create similar distributed tables on all machines in the cluster. It allows running distributed queries on any computer in the cluster. There is also an alternative option to create a temporary distributed table for a given SELECT query using the remote table function.

Let's run an INSERT SELECT to propagate this table to multiple servers.

INSERT INTO tutorial.hits_all SELECT * FROM tutorial.hits_v1;

!!! warning "Note:" This method is not suitable for sharding of large tables. There is a separate tool clickhouse-copier which can reshard arbitrarily large tables.

As you'd expect, computationally intensive queries run N times faster if they use 3 servers instead of one.

In this case, we used a cluster with 3 shards, each containing a replica.

To provide resiliency in a production environment, we recommend that each shard should contain 2-3 replicas spread across multiple availability zones or data centers (or at least racks). Note that ClickHouse supports an unlimited number of replicas.

Example configuration for a sharded cluster with three replicas:

<remote_servers>    ...    <perftest_1shards_3replicas>        <shard>            <replica>                <host>example-perftest01j.yandex.ru</host>                <port>9000</port>             </replica>             <replica>                <host>example-perftest02j.yandex.ru</host>                <port>9000</port>             </replica>             <replica>                <host>example-perftest03j.yandex.ru</host>                <port>9000</port>             </replica>        </shard>    </perftest_1shards_3replicas></remote_servers>

Enabling native replication in Zookeeper is required. ClickHouse takes care of data consistency across all replicas and automatically runs the recovery process after a failure. It is recommended to deploy a ZooKeeper cluster on a separate server (where no other processes, including ClickHouse, are running).

!!! note "Note" ZooKeeper is not a strict requirement: in some simple cases you can replicate data by writing it to all replicas in your application code. This approach is not recommended, in this case ClickHouse will not be able to guarantee data consistency on all replicas. So it's up to your application to guarantee this.

The ZooKeeper location is specified in the configuration file:

<zookeeper>    <node>        <host>zoo01.yandex.ru</host>        <port>2181</port>    </node>    <node>        <host>zoo02.yandex.ru</host>        <port>2181</port>    </node>    <node>        <host>zoo03.yandex.ru</host>        <port>2181</port>    </node></zookeeper>

Additionally, we need to set macros to identify each shard and replica used to create the table:

<macros>    <shard>01</shard>    <replica>01</replica></macros>

If there are no replicas when the replicated table is created, a new first replica is instantiated. If there is already a live replica, the new replica will clone the data from the existing replica. You can choose to first create all replicated tables and then insert data into them. Another option is to create some replicas and add others after or during data insertion.

CREATE TABLE tutorial.hits_replica (...)ENGINE = ReplcatedMergeTree(    '/clickhouse_perftest/tables/{shard}/hits',    '{replica}')...

Here, we use the ReplicatedMergeTree table engine. In the parameters, we specify the ZooKeeper path containing the shard and replica identifiers.

INSERT INTO tutorial.hits_replica SELECT * FROM tutorial.hits_local;

Replication operates in multi-master mode. Data can be loaded into any replica, which is then automatically synchronized with other instances. Replication is asynchronous, so not all replicas may contain recently inserted data at a given moment. At least one replica should allow data ingestion. Others will sync data and restore consistency after reactivation. Note that this approach allows for a low chance of recently inserted data being lost.

Official website: https://clickhouse.com/docs/zh/introduction/performance

Guess you like

Origin blog.csdn.net/leesinbad/article/details/129104367