Huawei's open-source data format CarbonData project realizes second-level response to big data ad hoc queries

Huawei announced that it has open sourced the CarbonData project, which was voted by the Apache community on June 3 and successfully entered the Apache incubator. CarbonData is a lightweight file storage format that separates low-latency query, storage, and computation. So what advantages does CarbonData have compared to SQL on Hadoop solutions, traditional NoSQL or relative ElasticSearch and other search systems? What does the technical architecture of CarbonData look like? What are your plans for the future? We interviewed the technical director of the CarbonData project to clarify your doubts.

InfoQ: When did CarbonData start the project? Why open source to the Apache Incubator now? What is the development history of open source and the current status of the project?

CarbonData: The CarbonData project is gradually accumulated by Huawei from years of data processing experience and industry understanding. In 2015, we restructured the system to evolve it into a set of general columnar storage on HDFS, which supports and After the Spark engine is connected, a set of distributed OLAP analysis solutions are formed.

Huawei has always been a provider of big data platform solutions for telecommunications, finance, and IT enterprises. We have continuously refined data characteristics from many customer scenarios, and have summarized some typical demands for big data analysis, gradually forming CarbonData. this architecture.

Because in the IT field, only open source and open source can finally connect the data of more customers and partners and generate greater business value. Open source is to build an E2E ecosystem. CarbonData is a data storage layer technology. To play its value, it needs to be effectively integrated with the computing layer and the query layer to form a real ecological play of value.

And because Apache is currently the most authoritative open source organization in the field of big data, Hadoop and Spark have become the de facto standards for open source big data. We also recognize Apache's concept of driving technological progress with community, so we chose to enter Apache and work with the community. Build capabilities to integrate CarbonData into the big data ecosystem.

At present, the CarbonData open source project has been voted by the Apache community on June 3 and successfully entered the Apache incubator.

The relevant community information is as follows: Apache CarbonData github address: https://github.com/apache/incubator-carbondata

Welcome to the Apache CarbonData community: https://github.com/apache/incubator-carbondata/blob/master/docs/How-to-contribute-to-Apache-CarbonData.md

InfoQ: What was the reason or opportunity that prompted you to develop the idea of ​​CarbonData? What kind of difficulties did you encounter in previous projects?

CarbonData: We have been faced with many high-performance data analysis demands. In the traditional practice, we generally use databases and BI tools to realize reports, DashBoard, and interactive queries. However, with the increasing number of enterprise data, business-driven analysis Flexibility requirements are gradually increasing, and some customers hope to have more powerful analysis functions in addition to SQL, so the traditional methods are gradually unable to meet customer needs, so we have the idea of ​​doing CarbonData project.

Demand generally comes from several sources.

First, in terms of deployment , different from the previous stand-alone system, enterprise customers hope to have a distributed solution to cope with the increasing data, and can scale out horizontal expansion by adding general-purpose servers at any time.

Second, in terms of business functions, the business of many enterprises is in the process of gradually transferring from traditional databases to big data platforms, which requires big data platforms to have a high ability to be compatible with old businesses. It is full standard SQL support, as well as support for a variety of analysis scenarios. At the same time, in order to save costs, enterprises hope that “one data can support multiple usage scenarios”, such as batch processing scenarios for large-scale scanning and computing, OLAP multi-dimensional interactive analysis scenarios, ad hoc query of detailed data, low-latency primary key query, and For scenarios such as real-time query of real-time data, it is hoped that the platform can provide support and achieve second-level query response.

Third, in terms of ease of use , enterprise customers used BI tools in the past, and the OLAP model for business analysis needs to be established in the BI tool, which will limit the flexibility of the data model and analysis methods in some scenarios. In the era of big data, an ecosystem has been formed in the open source field of big data. The community is making progress at any time, and some new analysis tools often emerge. Therefore, enterprise customers hope to follow the community to continuously improve their systems and improve their own data. Quickly use new analysis tools to get greater business value.

To meet the appeal requirements at the same time is undoubtedly a big challenge for the big data platform. In order to meet these requirements, we began to accumulate experience in real projects and tried many different solutions, but we did not find that one solution could solve all the problems.

The first thing that will come to your mind is that in distributed storage involving low-latency queries, KV-type NoSQL databases (such as HBase, Cassandra) are commonly used , which can solve the problem of low-latency queries on primary keys, but if business queries A slight change in the schema, such as a query for flexible combinations of multiple dimensions, will turn the point lookup into a full table scan, resulting in a sharp drop in performance. In some scenarios, this problem can be alleviated by adding a secondary index, but this brings about management problems such as the maintenance and synchronization of the secondary index, so KV storage is not a general solution to enterprise problems.

Then, if we want to solve the problem of general multi-dimensional query, sometimes we will think of the solution of multi-dimensional time series database (such as Linkedin Pinot) . Their characteristic is that the data enters the system in a time series manner and undergoes data pre-aggregation and indexing. Because it is pre-computed, it is very fast to deal with multi-dimensional queries, and the data is very timely. At the same time, it has the advantages of multi-dimensional analysis and real-time processing. It is widely used in performance monitoring and real-time indicator analysis scenarios. However, it also has certain limitations in the types of queries it supports. Because of data pre-computing, this architecture generally cannot handle detailed data queries and does not support Join multi-table association analysis, which undoubtedly brings certain advantages to enterprise usage scenarios. limit.

The other type is the search system (such as Apache Solr, ElasticSearch) . The search system can do multi-dimensional aggregation or query detailed data. It also has fast Boolean query based on inverted index, and the concurrency is also high, which seems to be exactly what we want to find. plan. However, we found two problems in practical applications: First , because the search system is generally designed for unstructured data, the data expansion rate of the system is generally relatively high, and the data storage is not compact enough under the enterprise relational data model, resulting in The amount of data is large. Second , the data organization method of the search system is closely related to the calculation engine, which leads to the fact that the data can only be processed by the corresponding search engine after it is stored in the database, which breaks the expectation of enterprise customers to apply a variety of community analysis to a certain extent. The original intention of the tool, so the search system also has its own applicable scenarios.

The last type of system is the SQL on Hadoop solution that has emerged in large numbers in the community, represented by Hive, SparkSQL, and Flink . This type of system is characterized by the separation of computing and storage, and provides standard SQL functions for files stored in HDFS. They can meet the needs of enterprise customers in terms of deployment and ease of use, and can also cover various scenarios such as scanning, aggregation, and detailing in business scenarios. It can be seen that they can be regarded as a kind of general solution. In order to improve performance, open source projects such as Spark and Flink improve computing performance by continuously optimizing their own architecture, but the focus of improvement is on the enhancement of computing engine and SQL optimizer, and improvement in storage and data organization is not the focus.

Therefore, it can be seen that although many current big data systems can support various query scenarios, they are all designed for a certain type of scenario . Therefore, in order to deal with scenarios such as batch processing, multi-dimensional analysis, and detailed data query, customers often need to copy multiple copies of data, and each scenario needs to maintain a set of data.

The original intention of CarbonData's design is to break this limitation, save only one piece of data, and optimally support multiple usage scenarios .

InfoQ: Can you talk about the technical architecture of CarbonData? What are the features and advantages?

CarbonData: The start of the whole big data era can be said to originate from Google's MapReduce paper, which triggered the Hadoop open source project and a series of subsequent ecological development. His "greatness" lies in the decoupling architecture of computing and storage, which frees part of the business (mainly batch processing) of the enterprise from the traditional vertical solution, and the computing and storage can be expanded on demand, which greatly improves the efficiency of business development. Agility has enabled many enterprises to popularize this computing model and benefit from it.

Although MapReduce has opened the era of big data, it improves batch processing performance through pure brute force scanning + distributed computing, so it cannot solve customers' low-latency query requirements for all query scenarios.

In the current ecosystem, the one closest to customer requirements is actually a search engine solution. Through good data organization and indexing, search engines can provide a variety of fast query functions, but the storage layer of the search engine is tightly coupled with the computing engine, which is not in line with the enterprise's "one data, multiple scenarios". expect.

This inspired us, why don't we create a more efficient data organization for the general computing engine to meet customer needs, so that we can not only use the computing and storage decoupling architecture but also provide high-performance query. With this idea in mind, we started the CarbonData project. For more business, separate computing and storage, which has become CarbonData's architectural design concept .

After establishing this concept, we naturally chose the architecture based on HDFS+ general computing engine , because this architecture can provide scale out capability well. The next step is to ask ourselves what else is missing in this architecture? In this architecture, HDFS provides file replication and reading and writing capabilities, and the computing engine is responsible for reading files and distributed computing. The division of labor is very clear. It can be said that they are respectively positioned to solve storage management and computing problems. But it is not difficult to see that in order to adapt to more scenarios, HDFS has made a lot of "sacrifice", it sacrifices the understanding of the file content, it is precisely because of giving up the understanding of the file content that the calculation can only be done by full scan It can be said that the final result is that storage and computing cannot make good use of data characteristics for optimization.

Therefore, in response to this problem, we focus CarbonData's efforts on the optimization of data organization. Through data organization, the ultimate goal is to improve IO performance and computing performance. To this end, CarbonData has done the following work.

CarbonData Basic Features

  1. Multi-dimensional data aggregation: Reorganize data according to multiple dimensions when entering the warehouse, so that the data is "more cohesive in multi-dimensional space", which can achieve better compression ratio in storage and better data filtering efficiency in calculation. .
  2. Column storage file structure with index: First, CarbonData designs multiple levels of indexes for multi-type scenarios, and incorporates some search features, including multi-dimensional indexes across files, multi-dimensional indexes within files, minmax indexes for each column, As well as inverted indexes within columns, etc. Secondly, in order to adapt to the storage characteristics of HDFS, CarbonData's indexes and data files are stored together. Some indexes are data themselves, and the other indexes are stored in the metadata structure of files. They can all provide localized access capabilities with HDFS.
  3. Column group: On the whole, CarbonData is a column storage structure, but compared with row storage, the column storage structure will have the problem of high data restoration cost when dealing with detailed data queries. Therefore, in order to improve the performance of obvious data query, CarbonData supports The storage method of column groups, users can store some fields that are not often used as filter conditions but need to be returned as result sets as column groups. After CarbonData encoding, these fields will be stored in row storage to improve query performance. .
  4. Data type: CarbonData currently supports all common basic types of databases, as well as Array, Struct complex nested types. At the same time, some people in the community have proposed to support the Map data type, and we plan to add the Map data type in the future.
  5. Compression: CarbonData currently supports Snappy compression. The compression is performed for each column separately, because the characteristics of column storage make the compression very efficient. The data compression rate is generally between 2 and 8 based on different application scenarios.
  6. Hadoop integration: By supporting the InputFormat/OutputFormat interface, CarbonData can take advantage of the distributed advantages of Hadoop and can also be used in all Hadoop-based ecosystems.

CarbonData Advanced Features

  1. Computable encoding methods: In addition to the common encoding methods such as Delta, RLE, Dictionary, BitPacking, etc., CarbonData also supports joint encoding of multiple columns, and applies global dictionary encoding to achieve decoding-free computing. The computing framework can be directly used after The encoded data is used for aggregation, sorting and other calculations, which is very obvious for queries that require a lot of shuffles.
  2. Joint optimization with computing engine: In order to efficiently utilize the optimized data organization of CarbonData, CarbonData provides targeted optimization strategies. At present, the CarbonData community has first made in-depth integration with Spark. Based on the SparkSQL framework, the filtering pressure and delay are enhanced. Materialization, incremental warehousing and other features, while supporting all DataFrame APIs. It is believed that through the efforts of the community in the future, more computing frameworks will be integrated with CarbonData to give full play to the value of data organization.

At present, these features have been integrated into the Apache CarbonData backbone, and you are welcome to use them.

InfoQ: In which scenarios is it recommended to use it? What are the performance test results? Are there any application cases, current domestic usage and user scale?

CarbonData: Recommended scenario: It is hoped that a single storage can meet the scenarios of fast scanning, multi-dimensional analysis, and detailed data query at the same time. In Huawei's customer use cases, CarbonData can improve performance by 5 to 30 times compared to existing column storage solutions in the industry .

For more information on performance test data and application cases, please follow the WeChat public account ApacheCarbonData and the community https://github.com/apache/incubator-carbondata

InfoQ: Can CarbonData be perfectly integrated with the currently popular Spark? What other mainstream frameworks are compatible?

CarbonData:目前CarbonData已与Spark做了深度集成,具体见上述高级特性。

InfoQ:您们的项目在未来有什么样的发展规划?还会增加什么功能吗?如何保证开源之后的项目的持续维护工作呢?

CarbonData:接下来社区重点工作是,提升系统易用性、完善生态集成(如:与Flink,Kafka等集成,实现数据实时导入CarbonData)。

CarbonData开源的第一个月,就有几百个commits提交,和20多个贡献者参与,所以后续这个项目会持续的活跃。10多个核心贡献者也将会持续参与社区建设。

InfoQ:在CarbonData设计研发并进入Apache孵化器的过程中,经历了哪些阶段,经历过的最大困难是什么?有什么样的感受或经验可以和大家分享的吗?

CarbonData:CarbonData团队大多数人都有参与Apache Hadoop、Spark等社区开发的经验,我们对社区流程和工作方式都很熟悉。最大的困难是进入孵化器阶段,去说服Apache社区接纳大数据生态新的高性能数据格式CarbonData。我们通过5月份在美国奥斯丁的开源盛会OSCON上,做CarbonData技术主题演讲和现场DEMO演示,展示了CarbonData优秀的架构和良好的性能效果。

InfoQ:您们是一个团队吗?如何保证您们团队的优秀成长?

CarbonData:CarbonData团队是一个全球化的(工程师来自中国、美国、印度)团队,这种全球化工作模式的经验积累,让我们能快速的适应Apache开源社区工作模式。

 

http://carbondata.apache.org/

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326850701&siteId=291194637