Insert picture description here

1 Overview

Reprinted: Clickhouse query statement sample

note:

sample子句只能用于MergeTree系列引擎的数据表，并且在create table的时候就声明sample by 抽样表达式。

近似计算The function provided by the sample clause can realize the function of data sampling, so that the query only returns sampled data instead of all data, thereby effectively reducing the query load.

sample子句的采样设计是一种幂等设计，即在数据发生变化的时候使用相同的采样规则能返回相同的数据. This feature is very suitable for scenarios where approximate query results can be accepted.

The official provides the following usage scenarios:

When you have strict timing requirements (like <100ms) but you can’t justify the cost of additional hardware
resources to meet them.
When your raw data is not accurate, so approximation doesn’t noticeably degrade the quality.
Business requirements target approximate results (for cost-effectiveness, or to market exact results to
premium users).

Clickhouse> create table clicks(CounterID UInt64,EventDate DATE, UserID UInt64) engine=MergeTree() 
order by (CounterID,intHash32(UserID)) sample by intHash32(UserID);
 
CREATE TABLE clicks
(
    `CounterID` UInt64,
    `EventDate` DATE,
    `UserID` UInt64
)
ENGINE = MergeTree()
ORDER BY (CounterID, intHash32(UserID))
SAMPLE BY intHash32(UserID)

Insert test data:

Clickhouse> insert into clicks select CounterID,EventDate,UserID from hits_v1;
 
INSERT INTO clicks SELECT 
    CounterID,
    EventDate,
    UserID
FROM hits_v1
 
Ok.
 
0 rows in set. Elapsed: 1.003 sec. Processed 8.87 million rows, 124.23 MB (8.85 million rows/s., 123.88 MB/s.)

The definition of clicks table is sampled and queried according to the results of intHash32 (UserID) distribution.
There are two points to note when declaring Sample KEY:

sample by 所声明的表达式必须同时包含在主键的声明内
sample key必须UInt类型，若不是可以定义但是查询的时候会抛出异常。

The SAMPLE clause supports three formats:

1.sample k
k represents the factor coefficient, the sampling factor, the value range is [0,1], if the decimal is between 0-1, it means sampling, and if it is 0 or 1, it is equivalent to not sampling.

select CounterID from clicks sample 0.1
等同于：
select CounterID from clicks sample 1/10

Query to obtain approximate results:

Clickhouse> select count() from clicks;
 
SELECT count()
FROM clicks
 
┌─count()─┐
│ 8873898 │
└─────────┘
 
1 rows in set. Elapsed: 0.003 sec. 
 
Clickhouse> select count() from clicks sample 0.1;
 
SELECT count()
FROM clicks
SAMPLE 1 / 10
 
┌─count()─┐
│  839889 │
└─────────┘
 
1 rows in set. Elapsed: 0.029 sec. Processed 5.89 million rows, 94.27 MB (201.86 million rows/s., 3.23 GB/s.) 
 
 
Clickhouse> select CounterID,_sample_factor from clicks sample 0.1 limit 2;
 
SELECT 
    CounterID,
    _sample_factor
FROM clicks
SAMPLE 1 / 10
LIMIT 2
 
┌─CounterID─┬─_sample_factor─┐
│        57 │             10 │
│        57 │             10 │
└───────────┴────────────────┘
 
2 rows in set. Elapsed: 0.012 sec.

The sampling factor can be queried through the virtual field _sample_factor.

2.
sample n n represents the number of samples sampled. n indicates at least how many rows of data are sampled. n=1 means no sampling is used, and n ranges from 2 to the total number of rows in the table.

Clickhouse> select count() from clicks sample 10000;
 
SELECT count()
FROM clicks
SAMPLE 10000
 
┌─count()─┐
│    9251 │
└─────────┘
 
1 rows in set. Elapsed: 0.025 sec. Processed 5.48 million rows, 87.72 MB (223.47 million rows/s., 3.58 GB/s.) 
 
Clickhouse> select count()*any(_sample_factor) from clicks sample 10000;
 
SELECT count() * any(_sample_factor)
FROM clicks
SAMPLE 10000
 
┌─multiply(count(), any(_sample_factor))─┐
│                      8154379.059200001 │
└────────────────────────────────────────┘
 
1 rows in set. Elapsed: 0.024 sec. Processed 5.48 million rows, 54.82 MB (229.44 million rows/s., 2.29 GB/s.) 
 
Clickhouse> select CounterID,_sample_factor from clicks sample 10000 limit 2;
 
SELECT 
    CounterID,
    _sample_factor
FROM clicks
SAMPLE 10000
LIMIT 2
 
┌─CounterID─┬────_sample_factor─┐
│      1294 │ 881.4592000000001 │
└───────────┴───────────────────┘
┌─CounterID─┬────_sample_factor─┐
│      1366 │ 881.4592000000001 │
└───────────┴───────────────────┘
 
2 rows in set. Elapsed: 0.041 sec. Processed 7.69 thousand rows, 123.01 KB (187.84 thousand rows/s., 3.01 MB/s.)

The range of data sampling is an approximate value, and the minimum granularity of sampled data is determined by index_granularity.
It is meaningless to set a value of n smaller than the index granularity or smaller.

3. sample k offset n
means sampling according to factor coefficient and offset.

Clickhouse> select CounterID,_sample_factor from clicks sample 0.4 offset 0.5 limit 1;
 
SELECT 
    CounterID,
    _sample_factor
FROM clicks
SAMPLE 4 / 10 OFFSET 5 / 10
LIMIT 1
 
┌─CounterID─┬─_sample_factor─┐
│        57 │            2.5 │
└───────────┴────────────────┘
 
1 rows in set. Elapsed: 0.017 sec. 
 
Clickhouse> select CounterID,_sample_factor from clicks sample 0.6 offset 0.5 limit 1;
 
SELECT 
    CounterID,
    _sample_factor
FROM clicks
SAMPLE 6 / 10 OFFSET 5 / 10
LIMIT 1
 
┌─CounterID─┬─────_sample_factor─┐
│        57 │ 1.6666666666666667 │
└───────────┴────────────────────┘
 
1 rows in set. Elapsed: 0.007 sec.

When the sampling factor overflows (the value of offset + the value of sample is greater than 1), the overflowed data is automatically staged.

[Clickhouse] simple clickhouse query statement

1 Overview

Guess you like