[ClickHouse series] How to use ClickHouse to generate random test data

Starting with 22.10, ClickHouse has added powerful functionality to generate random data with a high degree of flexibility.

uniform random distribution

randCanonical

ClickHouse has canonical random functions that all databases and programming languages ​​have. Use randCanonicalthe function to return a pseudorandom value uniformly distributed in the [0, 1) interval:

SELECT randCanonical()

rowUniform

Generate random numbers in the range of X…Y. The interval is left-closed and right-open. Use randUniformthe function to return a float type value uniformly distributed in the [X, Y) interval:

SELECT randUniform(5,10)

floor

Generate a random integer, you can use floor()the function to round, and you can return an integer in the interval [X, Y)

SELECT floor(randUniform(X, Y)) AS r

non-uniform random distribution

Version 22.10 of ClickHouse provides random functions capable of generating non-uniform (and continuous) distributions.

borderNormal

This function takes the mean as the first argument and the variance as the second argument, and outputs a floating point number around the mean. Take randNormal(X, Y) as an example to see the generated data distribution:

SELECT
    floor(randNormal(100, 5)) AS k,
    count(*) AS c,
    bar(c, 0, 50000, 100)
FROM numbers(100000) GROUP BY k ORDER BY k ASC

Results of the:

┌───k─┬────c─┬─bar(count(), 0, 50000, 100)─┐
│  791 │                             │
│  805 │                             │
│  818 │                             │
│  8217 │                             │
│  8335 │                             │
│  8480 │ ▏                           │
│  85130 │ ▎                           │
│  86208 │ ▍                           │
│  87362 │ ▋                           │
│  88559 │ █                           │
│  89868 │ █▋                          │
│  901347 │ ██▋                         │
│  911798 │ ███▌                        │
│  922648 │ █████▎                      │
│  933471 │ ██████▊                     │
│  944365 │ ████████▋                   │
│  955285 │ ██████████▌                 │
│  966081 │ ████████████▏               │
│  977019 │ ██████████████              │
│  987599 │ ███████████████▏            │
│  998072 │ ████████████████▏           │
│ 1007909 │ ███████████████▋            │
│ 1017565 │ ███████████████▏            │
│ 1026994 │ █████████████▊              │
│ 1036240 │ ████████████▍               │
│ 1045384 │ ██████████▋                 │
│ 1054411 │ ████████▋                   │
│ 1063507 │ ███████                     │
│ 1072595 │ █████▏                      │
│ 1081890 │ ███▋                        │
│ 1091337 │ ██▋                         │
│ 110890 │ █▋                          │
│ 111538 │ █                           │
│ 112336 │ ▋                           │
│ 113221 │ ▍                           │
│ 114103 │ ▏                           │
│ 11563 │ ▏                           │
│ 11624 │                             │
│ 11726 │                             │
│ 1184 │                             │
│ 1194 │                             │
│ 1211 │                             │
└─────┴──────┴─────────────────────────────┘

Here 10W random numbers are generated randNormal(), they are rounded and the number of occurrences of each number is counted. We see that most of the time, the function generates a random number that is closer to the given mean, which is exactly how the normal distribution works. A normal distribution occurs when we sum independent variables (such as the type of error in an aggregated system).

randBinomial

The binomial distribution is often used to model the probability of success in a series of yes-or-no questions. When modeling coin tosses, it is often used to model the total number of heads. Visualizes like a normal distribution.

SELECT
    floor(randBinomial(100, 0.85)) AS k,
    bar(count(*), 0, 50000, 100) AS b1
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌──k─┬─b1──────────────────────┐
│ 68 │                         │
│ 70 │                         │
│ 71 │                         │
│ 72 │                         │
│ 73 │ ▏                       │
│ 74 │ ▎                       │
│ 75 │ ▋                       │
│ 76 │ █▏                      │
│ 77 │ █▊                      │
│ 78 │ ███▍                    │
│ 79 │ █████▎                  │
│ 80 │ ███████▊                │
│ 81 │ ███████████▏            │
│ 82 │ ███████████████         │
│ 83 │ ██████████████████▏     │
│ 84 │ █████████████████████   │
│ 85 │ ██████████████████████▏ │
│ 86 │ █████████████████████▋  │
│ 87 │ ███████████████████▋    │
│ 88 │ ████████████████▋       │
│ 89 │ ████████████▋           │
│ 90 │ █████████               │
│ 91 │ █████▍                  │
│ 92 │ ██▊                     │
│ 93 │ █▍                      │
│ 94 │ ▌                       │
│ 95 │ ▏                       │
│ 96 │                         │
│ 97 │                         │
│ 98 │                         │
└────┴─────────────────────────┘

randNegativeBinomial

Negative binomial distribution, but for modeling the number of attempts to achieve a particular binary event, such as the number of coin tosses required to get a specified number of tails in a sequence.

SELECT
    floor(randNegativeBinomial(100, 0.85)) AS k,
    bar(count(*), 0, 50000, 100) AS b1
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌──k─┬─b1─────────────────┐
│  3 │                    │
│  4 │                    │
│  5 │                    │
│  6 │ ▎                  │
│  7 │ ▋                  │
│  8 │ █▍                 │
│  9 │ ██▋                │
│ 10 │ ████▎              │
│ 11 │ ██████▎            │
│ 12 │ ████████▊          │
│ 13 │ ███████████▌       │
│ 14 │ █████████████▊     │
│ 15 │ ████████████████▏  │
│ 16 │ █████████████████  │
│ 17 │ █████████████████▋ │
│ 18 │ █████████████████▍ │
│ 19 │ ███████████████▊   │
│ 20 │ ██████████████▏    │
│ 21 │ ████████████▏      │
│ 22 │ ██████████▎        │
│ 23 │ ███████▋           │
│ 24 │ ██████▎            │
│ 25 │ ████▍              │
│ 26 │ ███▎               │
│ 27 │ ██▍                │
│ 28 │ █▋                 │
│ 29 │ ▊                  │
│ 30 │ ▋                  │
│ 31 │ ▍                  │
│ 32 │ ▏                  │
│ 33 │ ▏                  │
│ 34 │                    │
│ 35 │                    │
│ 36 │                    │
│ 37 │                    │
│ 38 │                    │
│ 39 │                    │
│ 40 │                    │
│ 41 │                    │
│ 43 │                    │
└────┴────────────────────┘

randLogNormal

The lognormal distribution, which is often useful for modeling natural phenomena such as failure rates, game length, and income distribution.

SELECT
    floor(randLogNormal(1 / 100, 0.75)) AS k,
    bar(count(*), 0, 50000, 10) AS b1
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌──k─┬─b1─────────┐
│  0 │ █████████▊ │
│  1 │ ██████▍    │
│  2 │ ██▏        │
│  3 │ ▋          │
│  4 │ ▎          │
│  5 │ ▏          │
│  6 │            │
│  7 │            │
│  8 │            │
│  9 │            │
│ 10 │            │
│ 11 │            │
│ 12 │            │
│ 13 │            │
│ 14 │            │
│ 15 │            │
│ 16 │            │
│ 17 │            │
│ 18 │            │
│ 22 │            │
│ 23 │            │
│ 25 │            │
│ 31 │            │
│ 34 │            │
└────┴────────────┘

randExponential

An exponential distribution, which can be used to model the duration of a customer's phone call or the total amount of sales.

SELECT
    floor(randExponential(1 / 2)) AS k,
    bar(count(*), 0, 50000, 10) AS b1
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌──k─┬─b1───────┐
│  0 │ ███████▋ │
│  1 │ ████▋    │
│  2 │ ██▊      │
│  3 │ █▋       │
│  4 │ █        │
│  5 │ ▋        │
│  6 │ ▍        │
│  7 │ ▏        │
│  8 │ ▏        │
│  9 │          │
│ 10 │          │
│ 11 │          │
│ 12 │          │
│ 13 │          │
│ 14 │          │
│ 15 │          │
│ 16 │          │
│ 17 │          │
│ 18 │          │
│ 19 │          │
│ 20 │          │
│ 21 │          │
│ 22 │          │
│ 26 │          │
└────┴──────────┘

randChiSquared

Chi-square distribution, the distribution of the sum of squares of k independent standard normal random variables. This is mostly useful for testing statistical assumptions, specifically whether the dataset matches the distribution.

SELECT
    floor(randChiSquared(10)) AS k,
    bar(count(*), 0, 10000, 10) AS b1
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌──k─┬─b1─────────┐
│  0 │            │
│  1 │ ▎          │
│  2 │ █▍         │
│  3 │ ███▍       │
│  4 │ █████▋     │
│  5 │ ███████▋   │
│  6 │ ████████▊  │
│  7 │ █████████▌ │
│  8 │ █████████▋ │
│  9 │ █████████  │
│ 10 │ ████████▏  │
│ 11 │ ███████▎   │
│ 12 │ ██████▏    │
│ 13 │ ████▊      │
│ 14 │ ████       │
│ 15 │ ███▏       │
│ 16 │ ██▌        │
│ 17 │ █▊         │
│ 18 │ █▍         │
│ 19 │ █          │
│ 20 │ ▋          │
│ 21 │ ▌          │
│ 22 │ ▍          │
│ 23 │ ▎          │
│ 24 │ ▏          │
│ 25 │ ▏          │
│ 26 │ ▏          │
│ 27 │            │
│ 28 │            │
│ 29 │            │
│ 30 │            │
│ 31 │            │
│ 32 │            │
│ 33 │            │
│ 34 │            │
│ 35 │            │
│ 36 │            │
│ 37 │            │
│ 38 │            │
│ 39 │            │
└────┴────────────┘

randStudentT

Student's T distribution.

SELECT
    floor(randStudentT(4.5)) AS k,
    bar(count(*), 0, 10000, 10) AS b1
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌───k─┬─b1─────────┐
│ -18 │            │
│ -17 │            │
│ -15 │            │
│ -14 │            │
│ -12 │            │
│ -11 │            │
│ -10 │            │
│  -9 │            │
│  -8 │            │
│  -7 │            │
│  -6 │ ▏          │
│  -5 │ ▎          │
│  -4 │ █▏         │
│  -3 │ ███▋       │
│  -2 │ ██████████ │
│  -1 │ ██████████ │
│   0 │ ██████████ │
│   1 │ ██████████ │
│   2 │ ███▋       │
│   3 │ █▏         │
│   4 │ ▎          │
│   5 │ ▏          │
│   6 │            │
│   7 │            │
│   8 │            │
│   9 │            │
│  10 │            │
│  11 │            │
│  12 │            │
│  13 │            │
│  14 │            │
│  17 │            │
│  20 │            │
└─────┴────────────┘

randFisherF

The F distribution is mainly used in statistical tests to assess whether the variation of two populations is the same in terms of distribution.

SELECT
    floor(randFisherF(3, 20)) AS k,
    bar(count(*), 0, 10000, 10) AS b1
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌──k─┬─b1─────────┐
│  0 │ ██████████ │
│  1 │ ██████████ │
│  2 │ █████████  │
│  3 │ ███▏       │
│  4 │ █▎         │
│  5 │ ▍          │
│  6 │ ▏          │
│  7 │            │
│  8 │            │
│  9 │            │
│ 10 │            │
│ 11 │            │
│ 12 │            │
│ 13 │            │
│ 21 │            │
└────┴────────────┘

edgePoisson

The Poisson distribution can be used to model some specific event over time (such as a goal in a football match) or the interval between events (such as a log message).

SELECT
    floor(randPoisson(10)) AS k,
    bar(count(*), 0, 15000, 10) AS b1
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌──k─┬─b1────────┐
│  0 │           │
│  1 │           │
│  2 │ ▏         │
│  3 │ ▌         │
│  4 │ █▏        │
│  5 │ ██▌       │
│  6 │ ████▏     │
│  7 │ ██████    │
│  8 │ ███████▌  │
│  9 │ ████████▎ │
│ 10 │ ████████▍ │
│ 11 │ ███████▌  │
│ 12 │ ██████▎   │
│ 13 │ ████▋     │
│ 14 │ ███▌      │
│ 15 │ ██▎       │
│ 16 │ █▍        │
│ 17 │ ▋         │
│ 18 │ ▍         │
│ 19 │ ▏         │
│ 20 │           │
│ 21 │           │
│ 22 │           │
│ 23 │           │
│ 24 │           │
│ 25 │           │
└────┴───────────┘

edgeBernoulli

A thin Bernoulli distribution that can be used to model the failure and success of a particular operation.

SELECT
    floor(randBernoulli(0.75)) AS k,
    count(*) AS c
FROM numbers(100000)
GROUP BY k
ORDER BY k ASC

Results of the:

┌─k─┬─────c─┐
│ 0 │ 24976 │
│ 1 │ 75024 │
└───┴───────┘

generate random data

We can use any given random generator as per our requirement and populate our table with test data. Let's populate a purchasestable representing product sales:

CREATE TABLE purchases
(
    `dt` DateTime,
    `customer_id` UInt32,
    `total_spent` Float32
)
ENGINE = MergeTree
ORDER BY dt

We'll use randExponential()the function to generate data for the column total_spentto model the distribution of customer sales:

INSERT INTO purchases SELECT
    now() - randUniform(1, 1000000.),
    number,
    15 + round(randExponential(1 / 10), 2)
FROM numbers(1000000)

We propagate the data using the sequence number of the customer ID and a uniform random offset in time. We can see that total_spentthe value of is distributed exponentially, tending towards a value of 15 (assuming $15.00is the lowest value that can be spent):

SELECT
    floor(total_spent) AS s,
    count(*) AS n,
    bar(n, 0, 350000, 50)
FROM purchases
GROUP BY s
ORDER BY s ASC

Results of the:

┌───s─┬─────n─┬─bar(count(), 0, 350000, 50)─┐
│  15 │ 94520 │ █████████████▌              │
│  16 │ 86711 │ ████████████▍               │
│  17 │ 77290 │ ███████████                 │
│  18 │ 70446 │ ██████████                  │
│  19 │ 63580 │ █████████                   │
│  20 │ 57635 │ ████████▏                   │
│  21 │ 52518 │ ███████▌                    │
│  22 │ 47777 │ ██████▋                     │
│  23 │ 42808 │ ██████                      │
│  24 │ 38729 │ █████▌                      │
│  25 │ 34912 │ ████▊                       │
│  26 │ 31604 │ ████▌                       │
│  27 │ 28542 │ ████                        │
│  28 │ 26128 │ ███▋                        │
│  29 │ 23520 │ ███▎                        │
│  30 │ 21383 │ ███                         │
│  31 │ 19207 │ ██▋                         │
│  32 │ 17156 │ ██▍                         │
│  33 │ 15656 │ ██▏                         │
│  34 │ 14171 │ ██                          │
│  35 │ 12855 │ █▋                          │
│  36 │ 11772 │ █▋                          │
│  37 │ 10481 │ █▍                          │
│  38 │  9542 │ █▎                          │
│  39 │  8538 │ █▏                          │
│  40 │  7854 │ █                           │
│  41 │  7064 │ █                           │
│  42 │  6467 │ ▊                           │
│  43 │  5901 │ ▋                           │
│  44 │  5418 │ ▋                           │
│  45 │  4838 │ ▋                           │
│  46 │  4198 │ ▌                           │
│  47 │  3760 │ ▌                           │
│  48 │  3542 │ ▌                           │
│  49 │  3188 │ ▍                           │
│  50 │  2858 │ ▍                           │
│  51 │  2631 │ ▍                           │
│  52 │  2347 │ ▎                           │
│  53 │  2175 │ ▎                           │
│  54 │  1896 │ ▎                           │
│  55 │  1723 │ ▏                           │
│  56 │  1611 │ ▏                           │
│  57 │  1408 │ ▏                           │
│  58 │  1253 │ ▏                           │
│  59 │  1246 │ ▏                           │
│  60 │  1089 │ ▏                           │
│  61 │   976 │ ▏                           │
│  62 │   859 │                             │
│  63 │   785 │                             │
│  64 │   741 │                             │
│  65 │   666 │                             │
│  66 │   553 │                             │
│  67 │   524 │                             │
│  68 │   479 │                             │
│  69 │   394 │                             │
│  70 │   386 │                             │
│  71 │   356 │                             │
│  72 │   305 │                             │
│  73 │   307 │                             │
│  74 │   244 │                             │
│  75 │   233 │                             │
│  76 │   214 │                             │
│  77 │   189 │                             │
│  78 │   160 │                             │
│  79 │   154 │                             │
│  80 │   136 │                             │
│  81 │   131 │                             │
│  82 │   118 │                             │
│  83 │   121 │                             │
│  84 │   110 │                             │
│  85 │    90 │                             │
│  86 │    67 │                             │
│  87 │    76 │                             │
│  88 │    62 │                             │
│  89 │    73 │                             │
│  90 │    60 │                             │
│  91 │    52 │                             │
│  92 │    45 │                             │
│  93 │    41 │                             │
│  94 │    44 │                             │
│  95 │    23 │                             │
│  96 │    22 │                             │
│  97 │    24 │                             │
│  98 │    14 │                             │
│  99 │    16 │                             │
│ 100 │    16 │                             │
│ 101 │    28 │                             │
│ 102 │    16 │                             │
│ 103 │    16 │                             │
│ 104 │    10 │                             │
│ 105 │    17 │                             │
│ 106 │    10 │                             │
│ 107 │     8 │                             │
│ 108 │     5 │                             │
│ 109 │    12 │                             │
│ 110 │     6 │                             │
│ 111 │     3 │                             │
│ 112 │     5 │                             │
│ 113 │     3 │                             │
│ 114 │     4 │                             │
│ 115 │     4 │                             │
│ 116 │     6 │                             │
│ 117 │     4 │                             │
│ 118 │     2 │                             │
│ 119 │     3 │                             │
│ 120 │     3 │                             │
│ 121 │     2 │                             │
│ 122 │     1 │                             │
│ 123 │     2 │                             │
│ 124 │     1 │                             │
│ 125 │     1 │                             │
│ 128 │     2 │                             │
│ 129 │     2 │                             │
│ 130 │     1 │                             │
│ 131 │     2 │                             │
│ 132 │     1 │                             │
│ 135 │     1 │                             │
│ 137 │     1 │                             │
│ 141 │     1 │                             │
│ 154 │     1 │                             │
│ 155 │     2 │                             │
└─────┴───────┴─────────────────────────────┘

Notice how we use an exponential distribution to gradually reduce the total payout. We can use normal distribution (using randNormal()function) or any other distribution to get different peaks and forms.

Generating time-distributed data

While in our previous example we modeled values ​​using a random distribution, we could also model time. Suppose we collect client events into the following table:

CREATE TABLE events
(
    `dt` DateTime,
    `event` String
)
ENGINE = MergeTree
ORDER BY dt

In fact, more events may occur at certain times of day. The Poisson distribution is a good way to model a sequence of independent events over time. To simulate a time distribution, we simply add the generated random values ​​to the time column:

INSERT INTO events SELECT
    toDateTime('2022-12-12 12:00:00') - (((12 + randPoisson(12)) * 60) * 60),
    'click'
FROM numbers(100000)

Here we insert 100,000 click events spread over approximately 24 hours, with noon being the peak event time (12:00 in our example):

SELECT
    toStartOfHour(dt) AS hour,
    count(*) AS c,
    bar(c, 0, 15000, 50)
FROM events
GROUP BY hour
ORDER BY hour ASC

Results of the:

┌────────────────hour─┬─────c─┬─bar(count(), 0, 15000, 50)──────────────┐
│ 2022-12-10 16:00:00 │     1 │                                         │
│ 2022-12-10 20:00:00 │     3 │                                         │
│ 2022-12-10 21:00:00 │    13 │                                         │
│ 2022-12-10 22:00:00 │    19 │                                         │
│ 2022-12-10 23:00:00 │    42 │ ▏                                       │
│ 2022-12-11 00:00:00 │    71 │ ▏                                       │
│ 2022-12-11 01:00:00 │   183 │ ▌                                       │
│ 2022-12-11 02:00:00 │   289 │ ▊                                       │
│ 2022-12-11 03:00:00 │   543 │ █▋                                      │
│ 2022-12-11 04:00:00 │   971 │ ███▏                                    │
│ 2022-12-11 05:00:00 │  1606 │ █████▎                                  │
│ 2022-12-11 06:00:00 │  2662 │ ████████▋                               │
│ 2022-12-11 07:00:00 │  3830 │ ████████████▋                           │
│ 2022-12-11 08:00:00 │  5342 │ █████████████████▋                      │
│ 2022-12-11 09:00:00 │  7214 │ ████████████████████████                │
│ 2022-12-11 10:00:00 │  8896 │ █████████████████████████████▋          │
│ 2022-12-11 11:00:00 │ 10563 │ ███████████████████████████████████▏    │
│ 2022-12-11 12:00:00 │ 11502 │ ██████████████████████████████████████▎ │
│ 2022-12-11 13:00:00 │ 11532 │ ██████████████████████████████████████▍ │
│ 2022-12-11 14:00:00 │ 10581 │ ███████████████████████████████████▎    │
│ 2022-12-11 15:00:00 │  8729 │ █████████████████████████████           │
│ 2022-12-11 16:00:00 │  6618 │ ██████████████████████                  │
│ 2022-12-11 17:00:00 │  4304 │ ██████████████▎                         │
│ 2022-12-11 18:00:00 │  2536 │ ████████▍                               │
│ 2022-12-11 19:00:00 │  1220 │ ████                                    │
│ 2022-12-11 20:00:00 │   516 │ █▋                                      │
│ 2022-12-11 21:00:00 │   165 │ ▌                                       │
│ 2022-12-11 22:00:00 │    41 │ ▏                                       │
│ 2022-12-11 23:00:00 │     7 │                                         │
│ 2022-12-12 00:00:00 │     1 │                                         │
└─────────────────────┴───────┴─────────────────────────────────────────┘

In this case, instead of generating a value, we use a random function to insert a new record at the calculated point in time.

generate time-dependent values

Building on the previous example, we can use distributions to generate time-dependent values. For example, suppose we want to simulate hardware metric collection, such as CPU utilization or RAM usage, into the following table:

CREATE TABLE metrics
(
    `name` String,
    `dt` DateTime,
    `val` Float32
)
ENGINE = MergeTree
ORDER BY (name, dt)

In reality, we will definitely experience peak hours when the CPU is fully loaded and periods of lower load. To model this, we can use a random function of the desired distribution to compute the metric and point-in-time values:

INSERT INTO metrics SELECT
    'cpu',
    t + ((60 * 60) * randCanonical()) AS t,
    round(v * (0.95 + (randCanonical() / 20)), 2) AS v
FROM
(
    SELECT
        toDateTime('2022-12-12 12:00:00') - toIntervalHour(k) AS t,
        round((100 * c) / m, 2) AS v
    FROM
    (
        SELECT
            k,
            c,
            max(c) OVER () AS m
        FROM
        (
            SELECT
                floor(randBinomial(24, 0.5) - 12) AS k,
                count(*) AS c
            FROM numbers(1000)
            GROUP BY k
            ORDER BY k ASC
        )
    )
) AS a
INNER JOIN numbers(1000000) AS b ON 1 = 1

Here, we generate 1K random values ​​from a binomial distribution to obtain each generated number and its associated count. We then compute the maximum of these values ​​using the window maximum function, adding it as a column to each result. Finally, in the outer query, we generate a metric based on that count divided by the maximum value to get a 0...100random value in the range corresponding to possible CPU load figures. We also add noise to the numbers time, valuse randCanonical(), and do a join with the numbers table to generate 1m metric events.

SELECT
    toStartOfHour(dt) AS h,
    round(avg(val), 2) AS v,
    bar(v, 0, 100)
FROM metrics
GROUP BY h
ORDER BY h ASC

Results of the:

┌───────────────────h─┬─────v─┬─bar(round(avg(val), 2), 0, 100)────────────────────────────────────────────────┐
│ 2022-12-12 04:00:00 │  1.78 │ █▍                                                                             │
│ 2022-12-12 05:00:00 │  0.59 │ ▍                                                                              │
│ 2022-12-12 06:00:00 │  5.35 │ ████▎                                                                          │
│ 2022-12-12 07:00:00 │ 10.11 │ ████████                                                                       │
│ 2022-12-12 08:00:00 │ 32.11 │ █████████████████████████▋                                                     │
│ 2022-12-12 09:00:00 │ 50.53 │ ████████████████████████████████████████▍                                      │
│ 2022-12-12 10:00:00 │ 65.39 │ ████████████████████████████████████████████████████▎                          │
│ 2022-12-12 11:00:00 │ 93.34 │ ██████████████████████████████████████████████████████████████████████████▋    │
│ 2022-12-12 12:00:00 │  97.5 │ ██████████████████████████████████████████████████████████████████████████████ │
│ 2022-12-12 13:00:00 │ 87.98 │ ██████████████████████████████████████████████████████████████████████▍        │
│ 2022-12-12 14:00:00 │ 58.86 │ ███████████████████████████████████████████████                                │
│ 2022-12-12 15:00:00 │ 51.13 │ ████████████████████████████████████████▉                                      │
│ 2022-12-12 16:00:00 │ 23.18 │ ██████████████████▌                                                            │
│ 2022-12-12 17:00:00 │ 13.07 │ ██████████▍                                                                    │
│ 2022-12-12 18:00:00 │  2.97 │ ██▍                                                                            │
│ 2022-12-12 21:00:00 │  0.59 │ ▍                                                                              │
└─────────────────────┴───────┴────────────────────────────────────────────────────────────────────────────────┘

generate multimodal distributions

All of our previous examples generated data with a single peak or optimum. A multimodal distribution contains multiple peaks and can be used to simulate real-world events, such as multiple seasonal sales peaks. We can do this by repeating our generated data by grouping the generated values ​​by a specific sequence number:

SELECT
    floor(randBinomial(24, 0.75)) AS k,
    count(*) AS c,
    number % 3 AS ord,
    bar(c, 0, 10000)
FROM numbers(100000)
GROUP BY
    k,
    ord
ORDER BY
    ord ASC,
    k ASC

This will repeat our binomially distributed data three times:

┌──k─┬────c─┬─ord─┬─bar(count(), 0, 10000)─────────────────────────────┐
│  7 │    1 │   0 │                                                    │
│  8 │    1 │   0 │                                                    │
│  9 │    5 │   0 │                                                    │
│ 10 │   12 │   0 │                                                    │
│ 11 │   44 │   0 │ ▎                                                  │
│ 12 │  162 │   0 │ █▎                                                 │
│ 13 │  440 │   0 │ ███▌                                               │
│ 14 │ 1059 │   0 │ ████████▍                                          │
│ 15 │ 2282 │   0 │ ██████████████████▎                                │
│ 16 │ 3802 │   0 │ ██████████████████████████████▍                    │
│ 17 │ 5380 │   0 │ ███████████████████████████████████████████        │
│ 18 │ 6126 │   0 │ █████████████████████████████████████████████████  │
│ 19 │ 5793 │   0 │ ██████████████████████████████████████████████▎    │
│ 20 │ 4372 │   0 │ ██████████████████████████████████▊                │
│ 21 │ 2542 │   0 │ ████████████████████▎                              │
│ 22 │ 1002 │   0 │ ████████                                           │
│ 23 │  277 │   0 │ ██▏                                                │
│ 24 │   34 │   0 │ ▎                                                  │
│  8 │    1 │   1 │                                                    │
│  9 │    2 │   1 │                                                    │
│ 10 │   10 │   1 │                                                    │
│ 11 │   39 │   1 │ ▎                                                  │
│ 12 │  153 │   1 │ █▏                                                 │
│ 13 │  435 │   1 │ ███▍                                               │
│ 14 │ 1120 │   1 │ ████████▊                                          │
│ 15 │ 2220 │   1 │ █████████████████▋                                 │
│ 16 │ 3768 │   1 │ ██████████████████████████████▏                    │
│ 17 │ 5352 │   1 │ ██████████████████████████████████████████▋        │
│ 18 │ 6080 │   1 │ ████████████████████████████████████████████████▋  │
│ 19 │ 5988 │   1 │ ███████████████████████████████████████████████▊   │
│ 20 │ 4318 │   1 │ ██████████████████████████████████▌                │
│ 21 │ 2537 │   1 │ ████████████████████▎                              │
│ 22 │  994 │   1 │ ███████▊                                           │
│ 23 │  285 │   1 │ ██▎                                                │
│ 24 │   31 │   1 │ ▏                                                  │
│  8 │    1 │   2 │                                                    │
│  9 │    1 │   2 │                                                    │
│ 10 │   17 │   2 │ ▏                                                  │
│ 11 │   52 │   2 │ ▍                                                  │
│ 12 │  211 │   2 │ █▋                                                 │
│ 13 │  474 │   2 │ ███▋                                               │
│ 14 │ 1110 │   2 │ ████████▊                                          │
│ 15 │ 2242 │   2 │ █████████████████▊                                 │
│ 16 │ 3741 │   2 │ █████████████████████████████▊                     │
│ 17 │ 5306 │   2 │ ██████████████████████████████████████████▍        │
│ 18 │ 6256 │   2 │ ██████████████████████████████████████████████████ │
│ 19 │ 5779 │   2 │ ██████████████████████████████████████████████▏    │
│ 20 │ 4360 │   2 │ ██████████████████████████████████▊                │
│ 21 │ 2461 │   2 │ ███████████████████▋                               │
│ 22 │ 1035 │   2 │ ████████▎                                          │
│ 23 │  255 │   2 │ ██                                                 │
│ 24 │   32 │   2 │ ▎                                                  │
└────┴──────┴─────┴────────────────────────────────────────────────────┘

Analog binary state

The randBernoulli()function returns 0or 1based on a given probability, for example if we want to get 190% of the cases we use:

SELECT randBernoulli(0.9)

This is useful when generating data for binary states such as failed or successful transactions:

SELECT
    If(randBernoulli(0.95), 'success', 'failure') AS status,
    count(*) AS c
FROM numbers(1000)
GROUP BY status

Results of the:

┌─status──┬───c─┐
│ failure │  52 │
│ success │ 948 │
└─────────┴─────┘

Here we generate 95% of successthe states and only 5% failure.

Generate random values ​​for enums

We can combine an array with a random function to get a value from a subset and use it to populate an ENUM column:

SELECT
    ['200', '404', '502', '403'][toInt32(randBinomial(4, 0.1)) + 1] AS http_code,
    count(*) AS c
FROM numbers(1000)
GROUP BY http_code

Results of the:

┌─http_code─┬───c─┐
│ 403       │   3 │
│ 502       │  49 │
│ 200       │ 685 │
│ 404       │ 263 │
└───────────┴─────┘

Here we use the binomial distribution to get the number of requests with one of 4 possible HTTP response codes. We usually expect 200 more than errors, so model it as such.

generate random string

ClickHouse also allows generating random strings using the randomString(), randomStringUTF8()and randomPrintableASCII()functions. All functions accept string length as parameter. To create a dataset with random strings, we can combine string generation with a random function to get strings of arbitrary length. Below we use this method to generate 10 random strings, human readable characters, of length 5 to 25 symbols:

SELECT 
    randomPrintableASCII(randUniform(5, 25)) AS s, 
    length(s) AS length 
FROM numbers(10) 

Results of the:

┌─s──────────────────┬─length─┐
│ 3{
   
   {mc              │      5 │
│ F7g S)a:i*q/)_'g:  │     17 │
│ :ccV^f4{vpwgr'Qq#M │     18 │
│ G=G':              │      5 │
│ }6o,0yMDo*x`!BqnY$ │     18 │
│ \7Y5]"             │      6 │
│ kkS3q?fE+%4hD6ItJA │     18 │
│ .<S<+n&eyu59=*6g   │     16 │
│ [!cBR              │      5 │
│ +,hD}`7#B+HYv$     │     14 │
└────────────────────┴────────┘

generate noisy data

In the real world, data always contains errors. fuzzBits()This can be simulated in ClickHouse using the function. This function generates error data based on user-specified valid values ​​by randomly shifting bits with a specified probability. Suppose we want to add errors to string field values. The following will randomly generate errors based on our initial values:

SELECT fuzzBits('Good string', 0.01) 
FROM numbers(10) 

Results of the:

┌─fuzzBits('Good string', 0.01)─┐
│ Good string                   │
│ Good string                   │
│ Eood string                   │
│ Good string                   │
│ Good string                   │
│ Good strkng                   │
│ Good string                   │
│ G/od string                   │
│ Good!strino                   │
│ Good string                   │
└───────────────────────────────┘

Be sure to adjust the probability, as the number of errors generated depends on the length of the value you pass to the function. Use lower values ​​for less error probability:

SELECT 
    IF(fuzzBits('Good string', 0.001) = 'Good string', 1, 0) AS has_errors, 
    count(*) 
FROM numbers(1000) 
GROUP BY has_errors 

Results of the:

┌─has_errors─┬─count()─┐
│          0 │     279 │
│          1 │     721 │
└────────────┴─────────┘

Here we use a probability of 0.001 to get about 25% wrong values.

Generating a real dataset

To summarize, let's simulate a 30-day clickstream with a near-real-world distribution within a day, with a peak at noon. We'll use a normal distribution for this. Each event will also have two possible states success OR fail, distributed using a Bernoulli function. Our table:

CREATE TABLE click_events
(
    `dt` DateTime,
    `event` String,
    `status` Enum8('success' = 1, 'fail' = 2)
)
ENGINE = MergeTree
ORDER BY dt

Let's populate this table with 10 million events:

INSERT INTO click_events SELECT
    (parseDateTimeBestEffortOrNull('12:00') - toIntervalHour(randNormal(0, 3))) - toIntervalDay(number % 30),
    'Click',
    ['fail', 'success'][randBernoulli(0.9) + 1]
FROM numbers(10000000)

We used a 90% probability of success, so we will get the value in column 9 randBernoulli() out of 10 . We are used to generating distributions of events. Let's visualize this data using the following query:success statusrandNormal()

SELECT 
    dt, 
    count(*) AS c, 
    bar(c, 0, 100000) 
FROM click_events 
GROUP BY dt 
ORDER BY dt ASC

Results of the:

┌──────────────────dt─┬─────c─┬─bar(count(), 0, 100000)────────────────────────────────────────────────┐
│ 1999-12-02 22:00:00 │     1 │                                                                        │
│ 1999-12-02 23:00:00 │     1 │                                                                        │
│ 1999-12-03 00:00:00 │     6 │                                                                        │
│ 1999-12-03 01:00:00 │    27 │                                                                        │
│ 1999-12-03 02:00:00 │   110 │                                                                        │
│ 1999-12-03 03:00:00 │   321 │ ▎                                                                      │
│ 1999-12-03 04:00:00 │   866 │ ▋                                                                      │
│ 1999-12-03 05:00:00 │  1965 │ █▌                                                                     │
│ 1999-12-03 06:00:00 │  4417 │ ███▌                                                                   │
│ 1999-12-03 07:00:00 │  8372 │ ██████▋                                                                │
│ 1999-12-03 08:00:00 │ 14392 │ ███████████▌                                                           │
│ 1999-12-03 09:00:00 │ 22358 │ █████████████████▉                                                     │
│ 1999-12-03 10:00:00 │ 31178 │ ████████████████████████▉                                              │
│ 1999-12-03 11:00:00 │ 39093 │ ███████████████████████████████▎                                       │
│ 1999-12-03 12:00:00 │ 87328 │ █████████████████████████████████████████████████████████████████████▊ │
│ 1999-12-03 13:00:00 │ 38772 │ ███████████████████████████████                                        │
│ 1999-12-03 14:00:00 │ 31445 │ █████████████████████████▏                                             │
│ 1999-12-03 15:00:00 │ 22470 │ █████████████████▉                                                     │
│ 1999-12-03 16:00:00 │ 14399 │ ███████████▌                                                           │
│ 1999-12-03 17:00:00 │  8251 │ ██████▌                                                                │
│ 1999-12-03 18:00:00 │  4296 │ ███▍                                                                   │
│ 1999-12-03 19:00:00 │  1973 │ █▌                                                                     │
│ 1999-12-03 20:00:00 │   823 │ ▋                                                                      │
│ 1999-12-03 21:00:00 │   322 │ ▎                                                                      │
│ 1999-12-03 22:00:00 │   106 │                                                                        │
│ 1999-12-03 23:00:00 │    30 │                                                                        │
│ 1999-12-04 00:00:00 │    22 │                                                                        │
......
│ 2000-01-01 00:00:00 │    18 │                                                                        │
│ 2000-01-01 01:00:00 │    36 │                                                                        │
│ 2000-01-01 02:00:00 │   120 │                                                                        │
│ 2000-01-01 03:00:00 │   302 │ ▏                                                                      │
│ 2000-01-01 04:00:00 │   798 │ ▋                                                                      │
│ 2000-01-01 05:00:00 │  1993 │ █▌                                                                     │
│ 2000-01-01 06:00:00 │  4339 │ ███▍                                                                   │
│ 2000-01-01 07:00:00 │  8280 │ ██████▌                                                                │
│ 2000-01-01 08:00:00 │ 14686 │ ███████████▋                                                           │
│ 2000-01-01 09:00:00 │ 22395 │ █████████████████▉                                                     │
│ 2000-01-01 10:00:00 │ 31481 │ █████████████████████████▏                                             │
│ 2000-01-01 11:00:00 │ 39121 │ ███████████████████████████████▎                                       │
│ 2000-01-01 12:00:00 │ 86806 │ █████████████████████████████████████████████████████████████████████▍ │
│ 2000-01-01 13:00:00 │ 39060 │ ███████████████████████████████▏                                       │
│ 2000-01-01 14:00:00 │ 31020 │ ████████████████████████▊                                              │
│ 2000-01-01 15:00:00 │ 22618 │ ██████████████████                                                     │
│ 2000-01-01 16:00:00 │ 14543 │ ███████████▋                                                           │
│ 2000-01-01 17:00:00 │  8235 │ ██████▌                                                                │
│ 2000-01-01 18:00:00 │  4263 │ ███▍                                                                   │
│ 2000-01-01 19:00:00 │  1947 │ █▌                                                                     │
│ 2000-01-01 20:00:00 │   809 │ ▋                                                                      │
│ 2000-01-01 21:00:00 │   320 │ ▎                                                                      │
│ 2000-01-01 22:00:00 │   114 │                                                                        │
│ 2000-01-01 23:00:00 │    31 │                                                                        │
│ 2000-01-02 00:00:00 │     3 │                                                                        │
│ 2000-01-02 01:00:00 │     3 │                                                                        │
└─────────────────────┴───────┴────────────────────────────────────────────────────────────────────────┘

724 rows in set. Elapsed: 0.453 sec.

Welcome to add WX: xiedeyantu to discuss technical issues.

Guess you like

Origin blog.csdn.net/weixin_39992480/article/details/129095303