Why is columnar storage widely used in OLAP?

Hello everyone, my name is Big D.

I wonder if there are any questions from my friends, why is columnar storage widely used in the OLAP field, and what are its advantages compared to row storage? Today we will compare the differences between these two storage methods.

In fact, columnar storage is not a new technology, it can be traced back to the 1983 paper Cantor. However, limited by the early hardware conditions and application scenarios, traditional transactional databases (OLTP) such as Oracle, MySQL and other relational databases store data in rows.

Until the rise of analytical database (OLAP) in recent years, the concept of columnar storage has become popular again. Big data-related databases such as HBase and Cassandra all store data in a columnar manner.

The principle and characteristics of row storage

For OLAP scenarios, most of the operations are to add, delete, modify, and query an entire row of records, so row-based storage is a good choice to store data on disk in row-by-row format.

When the query queries and returns results based on the required fields, since these fields are buried in each row of data, each complete row record must be read, and the operation of a large number of disk rotation addressing greatly reduces the reading efficiency.

For example, the following figure shows the employee information emp table.

Data is stored on the disk in the form of rows, and the data of the same row is stored next to each other.

For the emp table, we need to query the names of all employees whose department dept is A.

select name from emp where dept = A
复制代码

Since the value of dept is discretely stored in the disk, during the query process, the disk needs to be rotated several times to complete the data positioning and return the result.

The principle and characteristics of column storage

For OLAP scenarios, a typical query needs to traverse the entire table to perform operations such as grouping, sorting, and aggregation. In this way, the advantage of storing a whole row of records together in row-based storage no longer exists. Moreover, analytical SQL often does not use all the columns, but only operates on some of the required columns, and the unrelated columns in that row also have to participate in the scan.

However, in columnar storage, the data in the same column is stored next to each other, as shown in the following figure.

Then, when querying and returning results based on the required fields, each row of data is not allowed to be scanned, and the required data is found according to the column, the number of disk rotations is small, and the performance will be improved.

It is still the query in the above example. Since the value of dept is stored on the disk in order in the columnar storage, the disk only needs to query and return the results sequentially.

列式存储不仅具有按需查询来提高效率的优势,由于同一列的数据属于同一种类型,如数值类型,字符串类型等,相似度很高,还可以选择使用合适的编码压缩可减少数据的存储空间,进而减少IO提高读取性能。

总的来说,行式存储和列式存储没有说谁比谁更优越,只能说谁更适合哪种应用场景。

非常欢迎大家扫描下方二维码,加我微信:Abox_0226,备注「进群」,有关大数据技术的问题在群里一起探讨。

Guess you like

Origin juejin.im/post/7083391031890149384