1. Basic introduction of ClickHouse

1. Basic introduction of ClickHouse

1.1 Introduction

​ Click Stream, Data Warehouse Click Stream Data Warehouse

​ In the process of collecting data, a page click will generate an event. ----》Based on page click event flow, OLAP analysis for data warehouse

​ Clickhouse is an open source, completely columnar storage, relational database management system. For data warehouses, it is mainly used for online analytical processing (OLAP, Online Analytical Processing).

1.2 Advantages

​ ROLAP, On-Line Analytical Processing

​ Online real-time query https://clickhouse.tech/benchmark/dbms/

​ Complete DBMS

	DDL:可以动态的创建、修改或者删除数据库、表和视图,而无须重启服务;

	DML:可以动态的查询、插入、修改、或者删除数据;

	权限控制:可以按照用户粒度设置数据库或者表的操作权限,保证数据的安全性;

	数据备份与恢复:提高数据备份导出与导入恢复机制;

	分布式管理:提供集群模式,能够自动管理多个数据库节点。

​ Complete columnar storage (reduce the scope of data scanning and the size of data during data transmission) ----》 Efficient data compression

	假设一张数据表A中字段A1~A50,100行数据。
		按列查找:SELECT A1,A2,A3,A4,A5 FROM A;			有效的减少了查询时所需扫描的数据量;
		按行查找:数据库首先追行扫描,并且获取每行数据的所有50个字段,再从每一行的数据中返回A1~A5这五个字段.
		压缩前:abcdefghi_bcdefghi;
		压缩后:abcdefghi_(9,8)。
	压缩本质:按照一定步长对数据进行匹配扫描,当发现重复部分的时候进行编码转化。重复值越多,压缩比越高。
	数据压缩:上述(9,8)表示如果从下划线开始向前移动9个字节,会匹配到8个字节长度的重复项,即bcdefghi

​ Vectorized execution engine

消除程序中循环的优化
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

​ SQL query

​ Diversified database engines and table engines

不同的引擎,不同的数据存储位置和特点
数据库引擎:
Ordinary		默认引擎
Dictionary		字典引擎		此类数据库会自动为所有数据字典创建他们的数据表。
Memory			内存引擎		用于存放临时数据,此类数据库下的数据表只会停留在内存,重启数据丢失。
Lazy			日志引擎		此类数据库下只能使用log系列的表引擎。
Mysql			Mysql引擎		 此类数据库会自动拉取远端Mysql中的数据,并创建Mysql表引擎的数据表。
表引擎:
Memory、MergeTree

​ Multithreading and distributed

​ Multi-master architecture

	采用Multi-Master主从架构,节点角色对等,客户端访问任意节点得到效果相同,规避了单点故障

1.3 Disadvantages

​ Does not support transactions.

​ Does not support secondary indexes.

​ Metadata management requires manual intervention and maintenance.

​ Try to do batch insert operations.

​ Supports limited operating systems.

​ Does not support typical K/V storage

​ Concurrent query resource control is not easy to handle

1.4 Applicable scenarios

​ Applicable to the field of business intelligence (BI), applied to advertising traffic, web, APP traffic, Internet of Things and other fields.

​ Multi-Master architecture mode (Multi-Master), suitable for multiple data centers, multiple activities in different places

1.5 Inapplicable scenarios

​ Does not support transactions-"OLTP transactional operation scenarios

​ Not good at querying by row granularity based on the primary key (support) ----" CK should not be used as a Key-Value database

​ Not good at deleting data by row (support)

2. ClickHouse architecture design

​ Cloumn and field are the most basic mapping units of ClickHouse data.

	为什么ClickHouse如此之快?
		0.c++,c语言和硬件交互优势
		1.采用列式存储,
		4.方便实时的数据结构 MergeTree
		2.使用了向量化引擎
		3.软件架构设计采用自底向上方式。				追求自底向上、追求极致的设计思路

			硬件---》算法---》特殊优化---》版本发布

硬件:

Insert picture description here

	ClickHouse会在内存中GROUP BY,并且使用HashTable装载数据。

	在意CPU L3级别缓存,因为一次L3级别缓存失效会带来70~100ns的延迟。意味着单核CPU上,浪费4000万次/秒的运算;32线程的CPU上,浪费5亿次/秒的运算。ClickHouse在基准查询中能做到1.75亿次/秒的数据扫描性能。

算法:

	对于常量,使用了Volnitsky算法;对于非常量,使用CPU的向量化执行SIMD,暴力优化;正则匹配使用re2和hyperscan算法。

特殊优化:

	针对同一场景不同状况,选择使用不同的实现方式。

		例如去重计数uniqCombined函数,会根据数据量的不同选择不同的算法:

			当数据量较小的时候,会选择Array保存;

			当数据量中等的时候,会选择HashSet;

			当数据量很大的时候,会使用HyperLogLog算法

		对于数据结构比较清晰的场景,会通过代码生成技术实现循环展开,以减少循环次数。

版本发布:

	基本每个月都能发布一个版本,意味着拥有一个持续验证,持续改进的机制。

Three, client access interface

3.1 The bottom access interface of ClickHouse

Supports both TCP and HTTP protocols.

The TCP protocol has better performance, and its port is 9000, which is mainly used for internal communication between clusters and CLI (Command Line Interface) clients;

CLI两种执行模式:

1.交互式执行			clickhouse-client -h clickhouse-11 --port 9000

2.非交互式执行		clickhouse-client --query

标准输入:cat /root/test.tsv |clickhouse-client --query "INSERT INTO test FORMAT TSV"

标准输出:clickhouse-client --query="select * from test" > /root/test.csv

默认情况下,clickhouse-client后面只能运行一条SQL语句,执行多条情况下:

clickhouse-client -h clickhouse-1 --port 9000 --multiquery --query="select1;select 2;select 3;"

The HTTP protocol has better compatibility, and its port is 8123, which can be widely used for clients of programming languages ​​such as JAVA, Python, etc. through REST services.

It is more recommended to connect through encapsulated interfaces such as CLI or JDBC, which is easier to use.

clickhouse-client参数

--host / -h 		:服务器地址						  config.xml中<listen_host>::1</listen_host>

    															   <listen_host>127.0.0.1</listen_host>

--port				:服务器TCP端口,默认9000			config.xml中<port>9000</port>

--user/ - u			:登录的用户名,默认default		

--password			:登录的密码,默认为空				   users.xml中<password></password>

--database / -d		:登录的数据库,默认为default		   config.xml中<default_database>default</default_database>

--query / -q		:只能非交互查询时使用,指定SQL语句

--multiquery / -n	:在非交互式查询中,允许一次运行多个SQL语句

--time / -t 		:在非交互式执行中,会打印每条SQL执行时间

clickhouse-client -t -n -q “select RequestNum,RequestTry from test.hit limit 1000000,10;select count(*) from test.hit;select RequestNum,RequestTry from test.hit limit 100000,10;”

非交互式导入导出数据:

https://clickhouse.tech/docs/zh/getting-started/example-datasets/metrica/

导入数据:

unxz hits_v1.tsv.xz 

clickhouse-client --query "create database if not exists test"

clickhouse-client --query "create table test.hit ( WatchID UInt64,  JavaEnable UInt8,  Title String,  GoodEvent Int16,  EventTime DateTime,  EventDate Date,  CounterID UInt32,  ClientIP UInt32,  ClientIP6 FixedString(16),  RegionID UInt32,  UserID UInt64,  CounterClass Int8,  OS UInt8,  UserAgent UInt8,  URL String,  Referer String,  URLDomain String,  RefererDomain String,  Refresh UInt8,  IsRobot UInt8,  RefererCategories Array(UInt16),  URLCategories Array(UInt16), URLRegions Array(UInt32),  RefererRegions Array(UInt32),  ResolutionWidth UInt16,  ResolutionHeight UInt16,  ResolutionDepth UInt8,  FlashMajor UInt8, FlashMinor UInt8,  FlashMinor2 String,  NetMajor UInt8,  NetMinor UInt8, UserAgentMajor UInt16,  UserAgentMinor FixedString(2),  CookieEnable UInt8, JavascriptEnable UInt8,  IsMobile UInt8,  MobilePhone UInt8,  MobilePhoneModel String,  Params String,  IPNetworkID UInt32,  TraficSourceID Int8, SearchEngineID UInt16,  SearchPhrase String,  AdvEngineID UInt8,  IsArtifical UInt8,  WindowClientWidth UInt16,  WindowClientHeight UInt16,  ClientTimeZone Int16,  ClientEventTime DateTime,  SilverlightVersion1 UInt8, SilverlightVersion2 UInt8,  SilverlightVersion3 UInt32,  SilverlightVersion4 UInt16,  PageCharset String,  CodeVersion UInt32,  IsLink UInt8,  IsDownload UInt8,  IsNotBounce UInt8,  FUniqID UInt64,  HID UInt32,  IsOldCounter UInt8, IsEvent UInt8,  IsParameter UInt8,  DontCountHits UInt8,  WithHash UInt8, HitColor FixedString(1),  UTCEventTime DateTime,  Age UInt8,  Sex UInt8,  Income UInt8,  Interests UInt16,  Robotness UInt8,  GeneralInterests Array(UInt16), RemoteIP UInt32,  RemoteIP6 FixedString(16),  WindowName Int32,  OpenerName Int32,  HistoryLength Int16,  BrowserLanguage FixedString(2),  BrowserCountry FixedString(2),  SocialNetwork String,  SocialAction String,  HTTPError UInt16, SendTiming Int32,  DNSTiming Int32,  ConnectTiming Int32,  ResponseStartTiming Int32,  ResponseEndTiming Int32,  FetchTiming Int32,  RedirectTiming Int32, DOMInteractiveTiming Int32,  DOMContentLoadedTiming Int32,  DOMCompleteTiming Int32,  LoadEventStartTiming Int32,  LoadEventEndTiming Int32, NSToDOMContentLoadedTiming Int32,  FirstPaintTiming Int32,  RedirectCount Int8, SocialSourceNetworkID UInt8,  SocialSourcePage String,  ParamPrice Int64, ParamOrderID String,  ParamCurrency FixedString(3),  ParamCurrencyID UInt16, GoalsReached Array(UInt32),  OpenstatServiceName String,  OpenstatCampaignID String,  OpenstatAdID String,  OpenstatSourceID String,  UTMSource String, UTMMedium String,  UTMCampaign String,  UTMContent String,  UTMTerm String, FromTag String,  HasGCLID UInt8,  RefererHash UInt64,  URLHash UInt64,  CLID UInt32,  YCLID UInt64,  ShareService String,  ShareURL String,  ShareTitle String,  ParsedParams Nested(Key1 String,  Key2 String, Key3 String, Key4 String, Key5 String,  ValueDouble Float64),  IslandID FixedString(16),  RequestNum UInt32,  RequestTry UInt8) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192"

cat hits_v1.tsv | clickhouse-client --query "insert into test.hit FORMAT TSV " --max_insert_block_size=100000

clickhouse-client --query "optimize table test.hit final"

clickhouse-client --query "select count(*) from test.hit"

导出数据:

clickhouse-client --query "select * from test.hit" >> /root/hit.tsv

速度之快:

SELECT count(*) FROM test.hit

┌─count()─┐

│ 8873898 │

└─────────┘

1 rows in set. Elapsed: 0.002 sec. 


SELECT 

    RequestNum, 

    RequestTry

FROM test.hit

LIMIT 1000000, 10

┌─RequestNum─┬─RequestTry─┐

│        240 │          1 │

│          5 │          0 │

│          4 │          1 │

│       1188 │          0 │

│       1829 │          0 │

│       1229 │          0 │

│       1508 │          0 │

│       1348 │          0 │

│       1418 │          2 │

│       1988 │          0 │

└────────────┴────────────┘

10 rows in set. Elapsed: 0.007 sec. Processed 1.57 million rows, 7.86 MB (223.91 million rows/s., 1.12 GB/s.) 

3.2 Built-in utilities

The clickhouse-local stand-alone version needs to specify the data source + stdin standard input + non-interactive execution + can only use the File table engine

Clickhouse-benchmark SQL statement performance test automatically runs SQL queries and generates corresponding running index reports

clickhouse-benchmark参数:

-i / --iterations			SQL语句查询执行的次数,默认值为0

-c / --concurrency			同时执行查询的并发数,默认值是1

-r / --randomize			再执行多条SQL语句的时候,按照随机顺序执行,

-h / --host					服务端地址,默认值localhost;对比测试时,需要指定两个服务端

​ SQL statement performance test

echo "select * from test.hit limit 1000000,10" |clickhouse-benchmark -i 5 

Loaded 1 queries.

Queries executed: 1.

localhost:9000, queries 1, QPS: 1.121, RPS: 1203315.971, MiB/s: 1131.422, result RPS: 11.213, result MiB/s: 0.011.

0.000%		0.892 sec.	

10.000%		0.892 sec.	

20.000%		0.892 sec.	

30.000%		0.892 sec.	

40.000%		0.892 sec.	

50.000%		0.892 sec.	

60.000%		0.892 sec.	

70.000%		0.892 sec.	

80.000%		0.892 sec.	

90.000%		0.892 sec.	

95.000%		0.892 sec.	

99.000%		0.892 sec.	

99.900%		0.892 sec.	

99.990%		0.892 sec.	

Queries executed: 2.

localhost:9000, queries 1, QPS: 1.651, RPS: 1772268.853, MiB/s: 1673.131, result RPS: 16.515, result MiB/s: 0.015.

0.000%		0.606 sec.	

10.000%		0.606 sec.	

20.000%		0.606 sec.	

30.000%		0.606 sec.	

40.000%		0.606 sec.	

50.000%		0.606 sec.	

60.000%		0.606 sec.	

70.000%		0.606 sec.	

80.000%		0.606 sec.	

90.000%		0.606 sec.	

95.000%		0.606 sec.	

99.000%		0.606 sec.	

99.900%		0.606 sec.	

99.990%		0.606 sec.	

Queries executed: 5.

localhost:9000, queries 5, QPS: 1.183, RPS: 1257808.054, MiB/s: 1184.972, result RPS: 11.829, result MiB/s: 0.012.

0.000%		0.606 sec.	

10.000%		0.651 sec.	

20.000%		0.696 sec.	

30.000%		0.753 sec.	

40.000%		0.823 sec.	

50.000%		0.892 sec.	

60.000%		0.894 sec.	

70.000%		0.897 sec.	

80.000%		0.941 sec.	

90.000%		1.027 sec.	

95.000%		1.070 sec.	

99.000%		1.104 sec.	

99.900%		1.112 sec.	

99.990%		1.113 sec.	

Guess you like

Origin blog.csdn.net/weixin_45320660/article/details/112761722