Depth Interview: Huawei's open source data format CarbonData project, ad hoc query large data-second response

Tina

Reads: 14601 at 19:00 on July 2016 13

Huawei announced a CarbonData open source project, which on June 3 vote by the Apache community, successfully entered the Apache Incubator. CarbonData is a low latency query, storage and computing separate lightweight file storage format. So compared to SQL on Hadoop programs, traditional NoSQL or relative ElasticSearch other search systems, CarbonData what kind of advantage? CarbonData technical architecture look like? What kind of future planning? We interviewed the person responsible for technical project CarbonData everyone doubts.

InfoQ: Will CarbonData is when to begin the project carried out? Why open source Apache Incubator to do? Open source development and current status of the project is kind of how?

CarbonData: CarbonData project is Huawei's data processing from many years of experience and understanding of the industry gradually built up in 2015, we had a system architecture reconstruction, it evolved into a common set of columnar storage on HDFS, and support after docking Spark engine to form a distributed OLAP analysis solution.

Huawei has been for the telecommunications, finance, IT companies and other users with big data platform solutions provider from numerous customer scenarios, we continue to refine the data characteristics, summed up some of the typical demands of big data analytics, and gradually formed a CarbonData this architecture.

Because in the IT field, only the open-source open, in order to ultimately make data more customers and partners joined together to produce greater business value. Open source is to build E2E ecology, CarbonData data storage layer technology to maximize the value, need and calculation layer, layer inquiry effective integrated together to form a complete ecological real play value.

And because Apache is the most authoritative of large data fields open source organization, which Hadoop, Spark has become a major data open de facto standard, we also recognized the concept of Apache to Community driven technological progress, so we chose to enter Apache, and community together ability to build, integrate into the CarbonData large ecological data.

Currently CarbonData open source project by the Apache community has voted on June 3, successfully entered the Apache Incubator.

Community-related information is as follows: Apache CarbonData github Address: https://github.com/apache/incubator-carbondata

Welcome everyone to participate in the Apache CarbonData community: https://github.com/apache/incubator-carbondata/blob/master/docs/How-to-contribute-to-Apache-CarbonData.md

InfoQ: What is the reason or opportunity to promote the idea that you have to do to generate CarbonData of this project? What kind of difficulties encountered in previous projects?

CarbonData: We have been facing a lot of demands high performance data analysis, the traditional approach, the general is using the database add BI tools to achieve statements, DashBoard and interactive query and other services, but with the increasing corporate data, business-driven analysis gradually increasing flexibility requirements, as well as some customers want a more powerful analysis features in addition to SQL, so the traditional way gradually can not meet customer needs, so that we had to do CarbonData this project idea.

General demand from several aspects.

First, deployment, different from the previous stand-alone system, distributed enterprise customers want to have a program to deal with the growing number of data, you can always scale out by adding general purpose servers fashion scale out.

Second, the business functions, many companies are in the business gradually shifted from a traditional database migration process big data platform, which requires the ability to have a high compatibility with older business big data platform, which consists mainly of full support for standard SQL support, and a variety of analysis scenarios. Meanwhile, in order to save costs, companies want to "copy of the data supports a variety of usage scenarios," such as large-scale batch scanning and computing scene, OLAP multi-dimensional interactive analysis scenarios, detailed data ad hoc queries, primary keys, low latency point of the investigation, as well as real-time access real-time data such as scenes, want to be able to support the platform, and achieve second-level query response.

Third, the ease of use, enterprise customers previously used BI tools, OLAP analysis of business model is to be established in the BI tool, which can lead to data model under some scenarios flexibility and analytical tools are limited, and in the era of big data, big data open source ecosystem has formed a community in progress at any time, often emerge in a number of new analytical tools, so enterprise customers want to be able to follow the community continue to upgrade their systems in its own data fast Lane to spend new analytical tools, greater business value.

To appeal while meeting requirements of big data platform is undoubtedly a big challenge. To meet these requirements, we begin to accumulate in the actual project experience, but also tried many different solutions, but they can not find a solution to all problems.

We first thought that comes to storage in a distributed low-latency query, KV type is commonly used NoSQL databases (such as HBase, Cassandra), the primary key can solve the problem of low latency queries, but the query if the business mode minor changes, such as a flexible combination of multi-dimensional queries, the enumeration will become a full table scan performance dramatically. Under some scenarios, then you can to alleviate this problem by adding a secondary index, but this brings maintenance management and synchronization issues secondary indexes, so the KV-type storage solutions not solve common business problems.

So, if you want to solve common multidimensional inquiries, sometimes we think of scheme multidimensional sequence databases (such as Linkedin Pinot), they are characterized by data in a time series manner into the system and through data pre-aggregation and indexing, because is pre-calculated, so when dealing with multi-dimensional look very fast, but also very timely data, along with the advantages of multi-dimensional analysis and real-time processing, the scene in performance monitoring, real-time metrics analyzed in more applications. But it also has the support of the types of queries to certain restrictions, as do the pre-calculated data, so this architecture is generally unable to cope with detailed data queries, and does not support multi-table Join correlation analysis, which will undoubtedly bring to the enterprise using a certain scene limit.

Another one is the search system (such as Apache Solr, ElasticSearch), the search system can do multi-dimensional summary can also check the detailed data, it also has inverted index-based fast Boolean queries, concurrency is also high, seems to be what we are looking for Program. However, in practice, we found two problems: First, because search systems are generally designed for unstructured data, data expansion system is generally higher, in enterprise relational data model data stored under less compact, resulting in large amount of data, the second is closely related to the search system of data organization and calculation engine, which led after the data storage can only be treated with the appropriate search engine, which in turn broke a certain extent, corporate clients want to apply a variety of community analysis mind tool, so the search system also has his own application scenarios.

The last category system, which is currently the community in large numbers of SQL on Hadoop programs to Hive, SparkSQL, Flink represented characteristics of these systems is to calculate the phase separation and storage, provides standard SQL functionality for files stored on the HDFS, they meet on the deployment and usability needs of enterprise customers, the business scenario can also cover the scan, aggregation, and other details of a single scene, showing that they can be regarded as a general class of solutions. To improve performance, Spark, Flink and other open-source projects by continuously optimizing their architecture to enhance computing performance, but to enhance the focus on enhancing both calculation engine and SQL optimizer to improve not focus on the storage and data organization.

So, you can see the current system, although many large data queries can support a variety of scenarios, but they are biased towards a certain type of scene design, not the target scene in case either does not support either degenerate into full table scan, resulting in enterprises in order to respond to the batch, multi-dimensional analysis, data query detail scenarios, customers often need multiple copies of data, each scene to maintain a set of data.

CarbonData is designed precisely to break this limit, so that only one copy of the data stored, optimized to support a variety of usage scenarios.

InfoQ: talk specifically about the technical architecture CarbonData? What are the characteristics and advantages of it?

CarbonData: open throughout the era of big data, can be said to be derived from Google's MapReduce paper, he led Hadoop open source project and the subsequent series of ecological development. His "great" is that computing and storage decoupled architecture, the part of the business enterprise (primarily batch) freed from the traditional vertical solutions, the computing and storage can expand as needed greatly enhance the business development agility, so many companies popularized the computing model, benefit from it.

While MapReduce opened the era of big data, but it is to enhance the pure violence by batch scanning + distributed computing performance, therefore it can not address customer requirements for low latency queries all queries scene.

In the current ecological, the closest to the customer is actually a search engine such programs. Through good data organization and indexing, search engines can provide a variety of fast search function, yet the search engine and the storage layer and tightly coupled compute engine is not in line with enterprises' data to a variety of scene " expectations.

This gives us inspiration, why do not we create a more efficient organization of data for the general-purpose computing engine to meet customer needs it, so that both the use of computing and storage decoupled architecture can provide high-performance queries. With that in mind, we launched CarbonData project. For more business, the computing and storage phase separation, which has become the architectural design of CarbonData.

Having established this concept, we naturally selected based on HDFS + general-purpose computing engine architecture, because this architecture can provide a good Scale out capability. Next we ask ourselves what is still missing in this framework? This architecture, replication, and HDFS provides the ability to read and write files, calculation engine is responsible for reading the file and distributed computing, clear division of labor, it can be said that they were located in storage management and problem-solving calculations. But not difficult to see, in order to meet more scenarios, HDFS do a lot of "sacrifice", it sacrifices the understanding of the contents of the file, it is precisely because the understanding of the contents of the file to abandon, leading the way to calculate only by full scan to carry out, it can be said that ultimately lead to storage and computing can not do a good use of data optimization features.

So address this issue, we CarbonData development effort focused on the optimization of data organization, data organization through ultimate goal is to improve the IO performance and computing performance. To this end, we CarbonData do the following work.

CarbonData basic features

Multidimensional Data aggregation: the data re-organized as a plurality of dimensions at the time of storage, so that the data "more cohesive multidimensional space" in, better compression ratios on storage, better data computationally efficient filtration .

Column deposit indexed file structure: First, CarbonData as many types of scene design multiple levels of indexes, and incorporates some of the features of the search, there are multi-dimensional cross-indexed file, multi-dimensional index within the file, minmax index of each column, and inverted index within the column and so on. Secondly, in order to adapt to the characteristics of HDFS storage, CarbonData index and data files stored together, the data itself is part of the index, the index is stored in another part of the metadata structure of the file, they can provide local access capability with HDFS.

Column Group: Overall, CarbonData a column deposit structure, but the line is relative to the deposit, deposit structure in response to the column will be high when detailed data query data reduction the cost problem, so in order to significantly enhance query performance data, CarbonData support storage column groups, some users can not often as a filter but as the field required to return the result set is stored as the column group, after CarbonData of these fields will be encoded using the line memory to store query to improve performance .

Data Type: CarbonData currently supports all common database of basic types, as well as Array, Struct complex nested types. At the same time it was suggested that community support Map data types, the future we plan to add Map data types.

Compression: Snappy currently CarbonData support compression, compression is performed separately for each column, because the column memory features make very efficient compression. Data compression ratio based on different application scenarios typically between 2-8.

Hadoop Integration: by supporting InputFormat / OutputFormat interfaces, CarbonData Hadoop may utilize the advantages of distributed, can also be used in all Hadoop based in ecosystems.

CarbonData Advanced Features

Computable encoding: In addition to the common encoding Delta, RLE, Dictionary, BitPacking the like, CarbonData also supports multi-column joint coding, dictionary coding, and applied to achieve global computing Free decoded frame may be used directly after computing do polymerization encoded data, sorting calculation, which requires a lot shuffle queries that performance is very obvious.

Joint optimization and calculation engine: In order to efficiently utilize CarbonData After optimization of data organization, CarbonData provide targeted optimization strategies currently CarbonData community first made Spark and deep integration, which enhances the framework based on SparkSQL under pressure filter, delay materialized, incremental storage and other features, while supporting all DataFrame API. I believe the future through the efforts of the community, there will be more integrated with the computational framework CarbonData, maximize the value of data organization.

Currently these properties have been combined into the Apache CarbonData trunk, welcome to use.

InfoQ: What are the recommended scenario? How to performance test results? There is no application cases, in the domestic usage and user scale?

CarbonData: Recommended Scene: want a store while meeting the quick scan, multi-dimensional analysis, data query detail of the scene. Huawei customer use cases, compared to the industry's existing columns deposit scheme, CarbonData can bring 5 to 30 times performance improvement.

For more information on performance tests and data applications such as case, please pay attention to micro-channel public number ApacheCarbonData, and community https://github.com/apache/incubator-carbondata

InfoQ: CarbonData energy and current Spark normalizing the perfect combination of it? What are the main framework is also compatible with it?

CarbonData: Currently CarbonData has done a deep integration with Spark, specific see above advanced features.

InfoQ: You have a project in the future, what kind of development plan? What will increase functionality? How to ensure continued maintenance of open source projects after it?

CarbonData: The next community focus is to enhance the ease of use of the system, improve the ecological integration (such as: integration with Flink, Kafka and other real-time data import CarbonData).

CarbonData the first month, there are hundreds of open source commits submission, and more than 20 contributors involved, we will continue to follow an active project. More than 10 core contributors will also continue to participate in community building.

InfoQ: In CarbonData design and development and into the Apache Incubator in the process, which has gone through the stage, what is the biggest difficulty is experienced? What kind of feelings or experiences you can share it?

CarbonData: CarbonData team most people have the experience of community participation in the development of Apache Hadoop, Spark, and we are very familiar with the community and work processes. The biggest difficulty is to enter the incubator stage, to convince the Apache community to accept the new high-performance Big Data ecological data format CarbonData. Through May at the OSCON open source event in Austin, United States, do CarbonData technology DEMO keynote and live demonstration, showing the excellent infrastructure and good performance results CarbonData.

InfoQ: You have a team do? How to ensure that you have good growing team?

CarbonData: CarbonData is a global team (engineers from China, USA, India) team, experience this mode of globalization, so that we can quickly adapt to the Apache open source community working mode.

Interview guests: Apache CarbonData of PMC, Committers Li Kun, Chen Liang.

"Depth Interview: Huawei's open source data format CarbonData project, ad hoc query large data-second response."

Depth Interview: Huawei's open source data format CarbonData project, ad hoc query large data-second response

Guess you like