foreword

This article belongs to the column "Big Data Technology System". This column is original by the author. Please indicate the source for the citation. Please point out the deficiencies and mistakes in the comment area, thank you!

For the directory structure and references of this column, please refer to Big Data Technology System

The difference between MySQL and StarRocks table creation

StarRocks is compatible with the MySQL 5 protocol, and it is slightly different from MySQL when creating tables.

Create table statement in MySQL

CREATE TABLE	mysqltestdb ・ test_mysql(
dateid	DATE,
siteid	INT DEFAULT 10,
citycode	SMALLINT,
username	VARCHAR(32) DEFAULT '',
pv	BIGINT DEFAULT 0
)ENGINE=InnoDB DEFAULT CHARSET=latin1;

Table creation statement in StarRocks

CREATE TABLE srtestdb.test_sr(
date_id DATE,
site_id INT DEFAULT 10,
city_code SMALLINT,
user_name VARCHAR(32) DEFAULT '',
pv BIGINT DEFAULT 0
)
PARTITION BY RANGE(date_id)(
PARTITION p1 VALUES LESS THAN ('2020-01-31'),
PARTITION p2 VALUES LESS THAN ('2020-02-29'),
PARTITION p3 VALUES LESS THAN ('2020-03-31')
)
DUPLICATE KEY(date_id, site_id, city_code)
DISTRIBUTED BY HASH(site_id) BUCKETS

Among them, the DUPLICATE KEY statement specifies the table creation model in StarRocks.

Create table model

In StarRocks, there are four types of table building models, namely detail model, aggregation model, update model and primary key model.

The DISTRIBUTED BY HASH statement specifies the bucket key for creating a table
and the number of buckets.

Compared with MySQL, which is mainly used for OLTP business, it is strongly recommended to create partitions for tables in StarRocks.

Partitioning and bucketing in StarRocks

Partitions in StarRocks

create partition

The role of the MySQL partition table is the same. After the table is partitioned, the partition pruning can be effectively used to reduce the amount of data scanned.

Currently StarRocks only supports range partitioning. The following example will introduce the partitioning function:

CREATE TABLE site_access( 
 date_id DATE, 
 site_id INT DEFAULT '10',
 city_code VARCHAR(100),
 user_name VARCHAR(32) DEFAULT '',
 pv BIGINT DEFAULT '0'
 )
 DUPLICATE KEY(dateid, site_id, city_code)
 PARTITION BY RANGE(date_id)(
 PARTITION p20200321 VALUES LESS THAN ("2020-03-22"),
 PARTITION p20200322 VALUES LESS THAN ("2020-03-23"), 
 PARTITION p20200323 VALUES LESS THAN ("2020-03-24"),
 PARTITION p20200324 VALUES LESS THAN ("2020-03-25")
 )
 DISTRIBUTED BY HASH(event_day, site_id) BUCKETS 32；

Create partitions in batches

As shown in the following example, partitions can be created automatically by specifying the START END EVERY statement. Among them, the value of START will be included, and the value of END will be excluded.

CREATE TABLE site_access (
	date_id DATE,
	site_id INT,
	city_code SMALLINT,
	user_name VARCHAR(32),
	pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(date_id, site_id, city_code)
PARTITION BY RANGE (date_id)(
START ("2021-01-01") END ("2021-02-01") EVERY (INTERVAL 1 DAY)
)
DISTRIBUTED BY HASH(site_id) BUCKETS 10

After automatically creating partitions, we can still add partitions using the ADD PARTITION statement.

In the above example, we specify to create daily partitions from 2021-01-01 to 2021-01-04 through the START END statement.

For data not included, an exception will be thrown when inserting.

If you want to keep this part of the data, you can manually create two boundary partitions like the following example.

ALTER TABLE test.site_access2 ADD PARTITION p_low VALUES LESS THAN ("2021-01-01"
)
ALTER TABLE test.site_access2 ADD PARTITION p_high VALUES LESS THAN ("2999-01-0 1")

management partition

add partition

ALTER TABLE test.site_access2 ADD PARTITION p_low VALUES LESS THAN 
("2021-01¬01")

delete partition

ALTER TABLE test.site_access2 ADD PARTITION p_low VALUES LESS THAN ("2021-01¬01")

Modify partition attributes

ALTER TABLE site_access SET("dynamic_partition.enable"二"false");

ALTER TABLE site_access SET("dynamic_partition.enable"二"true");

View partition information

SHOW PARTITIONS FROM test.site_access2;

Why partition buckets

In StarRocks, data is stored in a partitioned and then bucketed manner.

If no partition is performed, the whole table defaults to one partition, and the whole table is bucketed.

We take the following table as an example

CREATE TABLE ads(
	ads_uuid INT,
	ads_date DATE,
	uuid INT,
	imp_cnt INT,
	click_cnt INT
)
DUPLICATE KEY (access_date, site_id, citycode)
PARTITION BY RANGE (ads_date)(
PARTITION p1 VALUES LESS THAN ('2020-01-31'),
PARTITION p2 VALUES LESS THAN ('2020-02-29'),
PARTITION p3 VALUES LESS THAN ('2020-03-31')
)
DISTRIBUTE BY HASH(ads_uuid) BUCKETS 10;

When selecting partition and bucket keys, we need to cover the conditions of the query statement as much as possible .

After the table is partitioned and bucketed, the data in the table becomes more directional.

Queries that originally required a full table scan, after partitioning and bucketing, only scan a few partitions and buckets.

In the following query, for the query of the ads table, the condition ads_date > '2020-02-29' AND ads_date < '2020-03-31' can use partition pruning to cut most of the data, and the condition ads_uuid = ` can be used Bucket clipping, you can clip nine of the ten buckets, and only scan the remaining one.

SELECT	pv		
FROM	ads		
WHERE	ads_date >	'2020-02-29
AND	ads_date <	'2020-03-31
AND	ads_uuid =	1;

Bucketing in StarRocks

Bucket key selection

The next level of partitioning is bucketing. StarRocks uses the HASH algorithm as the bucketing algorithm, which can make the data under the partition evenly distributed on different nodes, avoiding the problem of hot query.

In the same partition, data with the same hash value of the bucketing key will form a data fragment (tablet). A tablet is the smallest unit of multi-copy redundant storage, and it is also the smallest unit of the scheduling process for copy management.

Generally speaking, we will try to make the partition bucket key cover most of the conditions of the where statement.

For the following query, we will select the site_id column as the bucketing column:

select
city_code, sum(pv)
from site_access
where site_id = 54321;

But sometimes, the data in the site_id column is unevenly distributed, and this bucketing method will cause data skew, resulting in local data overheating.

We can break up the data by combining buckets:

CREATE TABLE site_access
(
 site_id INT DEFAULT '10',
 city_code SMALLINT,
 user_name VARCHAR(32) DEFAULT '',
 pv BIGINT
)
DUPLICATE KEY(site_id, city_code, user_name)
DISTRIBUTED BY HASH(site_id,city_code) BUCKETS 10;

Selection of the number of buckets

The compression method of bucketing is lz4.

It is recommended that the size of each bucket data file be around 100MB-1GB.

Generally speaking, we follow the following rules to determine the number of buckets:

In the case of fewer machines, if you want to make full use of machine resources, you can consider using the number of BEs * cpucore / 2 to set the number of buckets.
For example, in a table with 100GB of data, there are 4 BEs, each 64C, and only one partition, then the number of buckets can be 4*64/2 = 128, so that the data of each tablet is also 781MB, and it can also be fully utilized CPU resources.
The number of buckets affects the parallelism of queries. The best practice is to calculate the amount of data storage and set each tablet to be between 100MB and 1GB.
Compared with the CSV file, the compression ratio of StarRocks is around 0.3 ~ 0.5 (the following calculations take 0.5, calculated in thousands). Assuming a 10GB CSV file is imported into StarRocks, we divide it into 10 even partitions. The amount of CSV text data that a partition bears: 10GB/10 = 1GB. A single copy is stored in StarRocks with a compression ratio of 0.5. File size: 1GB * 0.5 = 500MB. Usually, three copies are stored. The total file size of a partition is 500MB*3 = 1500MB. According to the suggestion, if a tablet is planned to be 300MB, 5 partitions need to be set. Bucket: 1500MB/300MB = 5, if it is a file in MySQL, in the mode of one master and two slaves, we only need to calculate the size of the single-copy MySQL cluster, convert it into the CSV file size according to the compression ratio (empirical value) of 0.7, and then follow the The above steps calculate the number of buckets for StarRcoks.
Select a column with a high cardinality as the bucketing key (if there is a unique ID, use this column as the bucketing key), so as to ensure that the data is as balanced as possible in each bucket. If the data is severely skewed, the data can be used Multiple columns are used as bucketing keys (but generally not too many).

Manage buckets

Currently, there is no way to adjust the number of buckets.

In PoC, you can tentatively import the data of a partition first, and you can judge the size of the tablet through the DataSize (in bytes) in the show tablet command.

1	mysql> show tablet from	srtestdb.test_duplicate_tbl \G
2	*************************** 1. row ***************************
3	TabletId:	10297
4	ReplicaId:	10298
5	BackendId:	10002
6	SchemaHash:	1515068627
7	Version:	2
8	VersionHash:	5815677282633857677
9	LstSuccessVersion:	2
10	LstSuccessVersionHash:	5815677282633857677
11	LstFailedVersion:	-1
12	LstFailedVersi onHash:	0
13	LstFailedTime:	NULL
14	DataSize:	839
15	RowCount:	16
16	State:	NORMAL
17	LstConsistencyCheckTime:	NULL
18	CheckVersion:	-1
19	CheckVersionHash:	-1
20	VersionCount:	2
21	PathHash:	-5057820482300793837
22	MetaUrl:	
	http://192.168.88.14:8040/api/meta/header/10297/1515068627

23	CompactionStatus:	
	http://192.168.88.14:8040/api/compaction/show?tablet_id=10297&schema_hash=151506

Table model for StarRocks

data model

In addition to specifying bucket information, unlike MySQL table creation, StarRocks also needs to specify the data model for table creation. In this example, the use of the DUPLICATE KEY keyword specifies the creation of detailed models.

CREATE TABLE srtestdb ・test_sr(
	siteid INT,
	citycode SMALLINT,
	username VARCHAR(32) DEFAULT '',
	pv BIGINT
)
DUPLICATE KEY(siteid, citycode, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10

According to different business needs, StarRocks provides three data models:

Detailed model: There are data rows with duplicate sort keys in the table, which correspond to the ingested data rows one-to-one, and users can recall all historical data
Aggregation model: There are no data rows with duplicate primary keys in the table, and the ingested data rows with duplicate primary keys will be merged into one row
Update model: the primary key satisfies the uniqueness constraint, and the imported data replaces duplicate data through the primary key, which is equivalent to an upsert operation

detailed model

The detailed model is used by default in StarRocks.

Like relational databases such as MySQL, how data is written to StarRocks is how it is stored, without calculation changes.

The detailed model uses DUPLICATE KEY as the keyword:

CREATE TABLE srtestdb・test_duplicate_tbl(
siteid INT,
city SMALLINT,
username VARCHAR(32) DEFAULT '',
pv BIGINT
)
DUPLICATE KEY(siteid, city, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10;

According to the following example, we insert a set of data, and after querying the full table, we can find that the data inserted in the queried data set has not undergone any changes. Note that our sort keys (siteid, city, username) can be repeated.

1	INSERT INTO srtestdb・test_duplicate_tbl VALUES
2	(10, 100,'	aaa',	1), (10,	100,	'aaa', 2),
3	(10, 200,'	aaa',	1), (10,	200,	'aaa ' , 2),
4	(20, 100,'	aaa',	1), (20,	100,	'aaa', 2),
5	(20, 200,'	aaa',	1), (20,	200,	'aaa', 2),
6	(10, 100,'	bbb',	1), (10,	100,	'bbb', 2),
7	(10, 200,'	bbb',	1), (10,	200,	'bbb', 2),
8	(20, 100,'	bbb',	1), (20,	100,	'bbb', 2),
9	(20, 200,'	bbb',	1), (20,	200,	'bbb', 2);
10					
11	一其中排序键(siteid, citycode)	有多条重复值如（10,100Jaaa'
12	SELECT * FROM srtestdb・test_duplicate_tbl;
13	+	+-		+		_一+			+
14	| siteid |	city	I username I pv	I
15	+	+-		+		—+			+
16	I	10 I	100	I aaa	I	1 I
17	I	10 I	100	I aaa	I	2 I
18	I	10 I	100	I bbb	I	1 I
19	I	10 I	100	I bbb	I	2 I
20	I	10 I	200	I aaa	I	1 I
21	I	10 I	200	I aaa	I	2 I
22	I	10 I	200	I bbb	I	1 I
23	I	10 I	200	I bbb	I	2 I
24	I	20 I	100	I aaa	I	1 I
25	I	20 I	100	I aaa	I	2 I
26	I	20 I	100	I bbb	I	1 I
27	I	20 I	100	I bbb	I	2 I
28	I	20 I	200	I aaa	I	1 I
29	I	20 I	200	I aaa	I	2 I
30	I	20 I	200	I bbb	I	1 I
31	I	20 I	200	I bbb	I	2 I
32	+	+-		+		_一+			+

aggregate model

When our query does not need to recall detailed data, only a summary operation is required, and the aggregation model can be used. After the data is inserted into the table, the detailed data is not stored, only the result of the aggregation calculation is stored. Aggregation models are keyed with AGGREGATE KEY:

1	CREATE TABLE srtestdb.test_aggregate_tbl
2	(
3	siteid	INT,
4	city	SMALLINT,
5	username	VARCHAR(32),
6	pv BIGINT	SUM DEFAULT '0

7	)
8	AGGREGATE KEY(siteid, city, username)
9	DISTRIBUTED BY HASH(siteid) BUCKETS 10 PROPERTIES("replication_num" = "1");

We insert the same data as the detailed model into the aggregation model, and after querying, we find that the detailed data is not stored, but the result of aggregation according to (siteid, city, username):

1 INSERT INTO srtestdb.test_aggregate_tbl VALUES
2	(10,	100,	'aaa',	1),	(10,	100,	'aaa',	2),
3	(10,	200,	'aaa',	1),	(10,	200,	'aaa',	2),
4	(20,	100,	'aaa',	1),	(20,	100,	'aaa',	2),
5	(20,	200,	'aaa',	1),	(20,	200,	'aaa',	2),
6	(10,	100,	'bbb',	1),	(10,	100,	'bbb',	2),
7	(10,	200,	'bbb',	1),	(10,	200,	'bbb',	2),
8	(20,	100,	'bbb',	1),	(20,	100,	'bbb',	2),
9	(20,	200,	'bbb',	1),	(20,	200,	'bbb',	2);

10
11 SELECT * FROM srtestdb.test_aggregate_tbl;
12	+				-+-		-+			-+
13	|	siteid	|	city	|	username	| pv		|
14	+								
									
15	|	10	|	100	|	aaa	|	3	|
16	|	10	|	100	|	bbb	|	3	|
17	|	10	|	200	|	aaa	|	3	|
18	|	10	|	200	|	bbb	|	3	|
19	|	20	|	100	|	aaa	|	3	|
20	|	20	|	100	|	bbb	|	3	|
21	|	20	|	200	|	aaa	|	3	|
22	|	20	|	200	|	bbb	|	3	|
23	+		-+-		-+-		-+			-+

The aggregation model is equivalent to a materialized view that we have done an aggregation operation on the detail model:

1	SELECT siteid, city, username, SUM(pv)
2	FROM srtestdb.test_duplicate_tbl
3	GROUP BY siteid, city, username;
4	+		-+-				-+		-+
5	I	siteid	I	city	I	username	I sum('pv')	I
6	+							
								
7	I	10	I	100	I	bbb	I	3	I
8	I	20	I	200	I	bbb	I	3	I
9	I	20	I	200	I	aaa	I	3	I
10	I	10	I	100	I	aaa	I	3	I
11	I	20	I	100	I	aaa	I	3	I
12	I	10	I	200	I	aaa	I	3	I
13	I	10	I	200	I	bbb	I	3	I
14	I	20	I	100	I	bbb	I	3	I
15	+		-+-		-+-		-+		-+

primary key model

At present, StarRocks does not support UPDATE statement, we provide the primary key model to realize the function of UPSERT.

When we insert a piece of data, if the key does not exist, StarRocks will insert this record. If the key already exists, StarRocks will modify the original record and update it to a new value.

The primary key model takes PRIMARY KEY as the keyword:

1	CREATE TABLE srtestdb ・test_primary_tbl
2	(			
3	siteid	INT	NOT	NULL,
4	city	SMALLINT	NOT	NULL,
5	username	VARCHAR(32)	NOT	NULL,
6	pv BIGINT	DEFAULT '0'		
7	)
8	PRIMARY KEY(siteid, city, username)
9	DISTRIBUTED BY HASH(siteid) BUCKETS 10 PROPERTIES("replication_num" = "1");

After we insert a piece of data, we insert a piece of data with the existing primary key and data without the primary key respectively. You can see that there are still two pieces of data in the table, and the data with the existing primary key will overwrite the original data (UPDATE), Data that does not have a primary key is inserted directly into the table (INSERT):

1	INSERT INTO srtestdb.test_primary_tbl VALUES	(10, 100,	'aaa'	,1);
2	SELECT * FROM srtestdb.test_primary_tbl;			
3	+	+	+	+	+			
4	| siteid | city | username | pv	|			
5	+	+	+	+	+			
6	|	10 |	100 | aaa	|	1 |			
7	+	+	+	+	+			
8
9	--没有主键为(20,100, 'aaa')的数据，直接插入这条数据		
10	INSERT INTO srtestdb.test_primary_tbl VALUES	(20, 100,	'aaa'	,1);
11	SELECT * FROM srtestdb.test_primary_tbl;			
12	+	+	+	+	+			
13	| siteid | city | username | pv	|			
14	+	+	+	+	+			
15	|	10 |	100 | aaa	|	1 |			
16	|	20 |	100 | aaa	|	1 |			
17	+	+	+	+	+			
18				
19	--已经存在了主键为(10,100, 'aaa')的数据,更新原有记录		
20	INSERT INTO srtestdb.test_primary_tbl VALUES	(10, 100,	'aaa'	,99);
21	SELECT * FROM srtestdb.test_primary_tbl;			
22	+	+	+	+	+			
23	| siteid | city | username | pv	|			
24	+	+	+	+	+			
25	|	20 |	100 | aaa	|	1 |			
26	|	10 |	100 | aaa	|	99 |			
27	+	+	+	+	+

sort key

Introduction to sort keys

The data in the Star Rocks table is divided into key and value. In the above example, all three models use (siteid, city, username) as the sort key (key) of the table.

Taking the above example as an example, there are two points to note when sorting columns:

The definition of the sort column must appear before the definition of other columns in the table creation statement.
The order of the sort column can be (siteid, city), or (siteid, city, username), but not (city,
username) or (siteid, city, pv)
The order of sorted columns is determined by the order in CREATE TABLE
The order of the sort column can be (siteid, city), or (siteid, city, username), but not (city, siteid) or (city, siteid, username)

sparse index

To speed up queries, StarRocks will automatically create sparse indexes on sorted columns.

When searching for a range, the sparse index (shortkey index) can help us quickly locate the initial target row.

When there are a lot of sorting columns, StarRocks will automatically add some restrictions to the sparse index to ensure that the content of the sparse index is small and can be cached in memory.

Due to the existence of sparse indexes, queries can be accelerated.

Depending on whether the query uses the leading column of the sparse index, the acceleration effect is different

How to choose a sort key

According to the sparse index acceleration rules, the following suggestions can be followed when specifying sort columns:

Columns with high selectivity (distinguishing degree) are placed in front as leading columns
The most frequently used columns in the query conditions are placed in front as the leading columns
Try to make the partition interval cover as many query conditions as possible

StarRocks table creation guide