ClickHouse Overview

ClickHouse Overview

liuzx32 concern

0.4662018.12.29 17:38:06 number of words read 10,297 2,765

 One. Outline

With the advent of the era of the Internet of Things IOT, perception and alarm data storage device IOT growing, the value of useful data needed to analyze the data analyst. Big Data analytics has become a very important part. Of course, the past two years turned the tide of open source, provides a tool for very large surplus for data analysis engineer. But it also increased the difficulty developers to choose the right tool, especially for developers who are new entrants. Learning costs, diversity and complexity of the framework has become a very big problem. For example kafka, hdfs, spark, hive, etc. combined to produce the final results. The various open source frameworks, tools, libraries, artificial platform to integrate with the complex work needed, is one of the big areas of data development and data analysts often complain that they also support big data analytics platform simplification and harmonization of The primary reason.

two. Clickhouse development history

Yandex in June 15, 2016 a database of open source data analysis, called ClickHouse, this conservative Russians is a large thing. Even more surprising is that this run sub-column-store database to more than many popular commercial MPP database software, such as Vertica. If you have not heard Vertica, then you must have heard Michael Stonebraker, 2014 Turing Award winner, PostgreSQL and Ingres inventor (Sybase and SQL Server are inherited from the Ingres), Paradigm4 and SciDB founder. Michael Stonebraker founded in 2005 Vertica company, the company was later acquired by HP, HP Vertica become high-performance business on behalf of MPP columnar storage databases, Facebook bought the Vertica data on user behavior analysis.

three. Clickhouse support features analysis

Before looking at Clickhouse run scene to understand the technical features and drawbacks of a technical architecture and developers to understand. Only "Know thyself" can "know yourself," Next we look at Clickhouse specific characteristics;

Ø 1. True column-oriented DBMS

Ø 2. Efficient Data Compression

Ø 3. disk storage data

Ø 4. multicore parallel processing

Ø 5. distributed processing on multiple servers

Ø 6.SQL syntax support

Ø 7. vectorization engine

Ø 8. Real-time data updates

Ø 9. Index

Ø 10. suitable for online inquiries

11. The support calculation of an approximate pre Ø

Ø 12. supports nested data structure

Ø support an array as a data type

Ø 13. Support query complexity and quota restrictions

Ø 14. Copy data replication and support for data integrity

# Let's look at some of these features:

1. True column-oriented DBMS

In a real column-oriented DBMS, there is no "junk" is stored in the value. For example, fixed-length value must be supported in order to avoid numerical next memory length "number." For example, one billion UInt8 type of value should actually consume approximately 1 GB of uncompressed disk space, otherwise it will strongly affect the use of the CPU. Because decompression speed (CPU usage) depends primarily on the amount of data is not compressed, so even in the absence of compression, compactly stored data (no "junk") it is also very important.

Because some of the system can store the value of individual columns alone, but because of the optimization of another scene, can not effectively deal with analytical queries. Such as HBase, BigTable, Cassandra and HyperTable. In these systems, you can get about the throughput per second of thousands of lines, but does not reach the hundreds of millions of lines per second.

In addition, ClickHouse is a DBMS, rather than a single database. ClickHouse allows you to create tables and databases at runtime, load data and run queries, without reconfiguring and restarting the server.

2. Data Compression

Some DBMS (InfiniDB CE and MonetDB) column-oriented data compression is not used. However, data compression actually improves performance.

3. The disk storage data

Many column-oriented DBMS (SAP HANA and GooglePowerDrill) can only work in memory. But even on thousands of servers, memory is also too small to store all page views and sessions in Yandex.Metrica.

4. The multi-core parallel processing

Multi-core multi-node parallelism large queries.

The distributed processing on multiple servers

Column-based DBMS listed above are almost not support distributed processing. In ClickHouse, the data may reside on different fragments. Each slice may be a set of copies for fault tolerance. Query processing in parallel on all slices. This is transparent to the user.

6.SQL support

If you are familiar with standard SQL, we can not really talk about SQL support. NULL is not supported. All functions have different names. JOIN support. Subquery is supported FROM, IN, JOIN clauses; scalar query support. Correlated subqueries are not supported.

7. vectorization engine

Data not stored in columns, and a vector - a column processing section. This allows us to achieve high performance CPU.

8. Real-time data updates

ClickHouse support the primary key table. In order to quickly perform queries on the primary key range, using data merge tree (mergeTree) performed in ascending order. For this reason, data can be continuously added to the table. No lock processing data is added.

9. Index

For example, data can be extracted with a primary key for a particular client (Metrica, a counter) within a specific time range, and the delay time is less than several tens of milliseconds.

10. Support online inquiry

This allows us to use the system as a back-end Web interface. Low latency means that you can handle queries in real-time without delay, while Yandex.Metrica interface page is loaded (online mode).

11. Support approximated

1. The system comprises various values ​​for approximate calculation, quantile and the median aggregate function.

2. Support section (sample) data based on the query to run and get similar results. In this case, data is retrieved from the small proportion of the disk.

3. Support for a limited number of random key (instead of all keys) the polymerization run. Under certain conditions in the data key distribution, which provides relatively accurate results while using fewer resources.

12. The data replication and support for data integrity.

Using asynchronous multi-master replication. After writing any available copies of the data will be distributed to all remaining copies. The system remains the same data in different copies. Data is automatically restored after a failure

ClickHouse not perfect:

Ø 1. does not support things.

Ø 2. does not support the Update / Delete operations.

Ø 3. limited operating system support.

Now supports ubuntu, centos need to compile it yourself, there are enthusiastic people have translated well, and make use on the line. For Windows does not support.

four. ClickHouse scenarios

Since open source ClickHouse 2016 Nian 6 Yue 15 Ri, ClickHouse Chinese community then established. Chinese open source group began to Analysys, Hikvision, the US group, Sina, Jingdong, 58, Tencent, cool dog music and Russia open source community and other staff, with the open source community is constantly active, Digital China one after another, Albatron, PingCAP, soft and international companies such members to join other members of the company. The initial technical follow-up discussions in the group has a number of large companies have applied to the project, ranging from inconvenient to share problem-solving, to establish the appropriate forum. According to the exchange that some large companies already use.

# You can apply the following scenarios:

1. The telecommunications industry is used to store data and statistics to use.

2. Weibo user behavior data logging and analysis.

3. analysis of user behavior for advertising networks and RTB, e-commerce.

4. Information security inside the log analysis.

The detection and excavation of remote sensing information.

6. Business Intelligence.

7. Analysis of online games as well as data processing and data value of things.

8. The largest application of statistical analysis from Yandex service Yandex.Metrica, similar to Google Analytics (GA), or Friends of the Union statistics, millet statistics to help website or mobile applications for data analysis and refinement of operational tools, allegedly Yandex. Metrica analysis platform for the world's second-largest site. ClickHouse In this application, deployed nearly four machines to support 20 billion daily events and historical records total more than 13 trillion records that there are raw data (non-aggregated data), you can always use SQL queries and analysis, generate user reports.

Fives. Compare ClickHouse and some of the technology

1. Business OLAP database

For example: HP Vertica, Actian the Vector,

Difference: ClickHouse is open source and free

2. Cloud Solutions

For example: RedShift Amazon and Google BigQuery

Difference: ClickHouse can use your own machine deployment without having to pay for the cloud

3.Hadoop ecological Software

例如:Cloudera Impala, Spark SQL, Facebook Presto , Apache Drill

the difference:

ClickHouse support real-time highly concurrent systems

ClickHouse software and does not depend on the ecological basis for Hadoop

ClickHouse support distributed deployments room

4. Open Source OLAP database

For example: InfiniDB, MonetDB, LucidDB

Difference: the size of the application of these projects is small, and not used in large Internet service which, by contrast, ClickHouse maturity and stability much more than software.

5. Analysis of open source, non-relational database

For example: Druid, Apache Kylin

Difference: ClickHouse can support queries directly from the raw data, ClickHouse support SQL-like language, has facilitated traditional relational data.

six. to sum up

In the big data analytics, the traditional big data analysis requires a different framework and technology portfolio in order to achieve the ultimate effect on labor costs, technical capacity and the cost of hardware and maintenance costs make big data analysis becomes an expensive thing. So many small and medium enterprises is very distressed, forced a third-party leasing companies large data analysis services.

ClickHouse the emergence of open source big data and want to do so many want to do big data analysis of many companies and enterprises refreshing. It does not rely on Hadoop ClickHouse to ecology, installation and maintenance simple, fast query speed, and so can support SQL farther and farther in the big data analysis.

#

Published 308 original articles · won praise 27 · views 130 000 +

Guess you like

Origin blog.csdn.net/gwdgwd123/article/details/104037082