Build a lake or build a warehouse? StarRocks: Now the lake and warehouse can be integrated into one

On November 17, StarRocks' annual large-scale technology exchange summit StarRocks Summit 2023 was held in Shanghai. Big data experts from leading companies such as Ping An Bank, China Resources, Tencent Games, Alibaba Cloud, Yili, Midea, and JD.com all shared their experiences in big data. The latest technologies and best practices in the field of data analysis attracted hundreds of enterprise user representatives and developers to listen and communicate.

This is the third time StarRocks has held an annual technology exchange summit. As a technologically leading open source OLAP database product, StarRocks has always been favored by large users. As of now, more than 300 companies with a market value of US$1 billion have used StarRocks, and the number of community users has exceeded 10,000. At this summit, Zhang Youdong, CTO of Jingzhou Technology and member of StarRocks TSC, also shared the latest progress of StarRocks with everyone.

In the past year, StarRocks has released three major versions: 2.5, 3.0, and 3.1. The storage and computing separation architecture launched in version 3.0 is the first in the open source industry. After upgrading to a storage-computing separation architecture, users' storage costs can be reduced by 80%. Since computing nodes are stateless, computing availability can be improved through rapid elasticity and cross-availability zone deployment, and computing resources can be physically isolated. Independent elastic scaling on demand.

As of version 3.1, when Local cache is turned on, the performance under the storage-computation separation architecture is close to the level of local storage.

At the same time, StarRocks' lake warehouse analysis capabilities are now very complete, not only supporting internal, data lake, JDBC, ES and other catalogs, but also supporting joint analysis across data sources.

In addition, the capabilities of the primary key model have been continuously improved in the past year. It already supports both full-memory and persistent indexes, and supports partial update and conditional update capabilities. In terms of performance, for batch update scenarios, press In column update mode, the performance is improved by more than 10 times compared to row update.

Zhang Youdong said that the future trend of data evolution is to integrate lakes and warehouses. Users do not need to pay attention to whether to build a lake or a warehouse. Whether building a data lake or a data warehouse, the ultimate goal of an enterprise is to solve data analysis problems at low cost and efficiently. After having a series of heavyweight features such as storage and calculation separation, lake warehouse analysis, and materialized views, StarRocks has realized the upgrade to the Lakehouse engine. With the help of StarRocks, it can have the advantages of both data lake and database warehouse.

So, how exactly is StarRocks implemented? What direction will it develop next? Let’s see what Zhang Youdong said——

 

Q1: StarRocks has introduced the architecture of separation of storage and calculation since 3.0. So compared with the integration of storage and calculation, how does it ensure that the performance does not decrease?

Zhang Youdong: There is a consensus in the industry that the performance of an architecture that separates storage and computing is lower than that of local data access. Because it accesses data latency becomes higher. In the industry, the common technology now is to accelerate through Cache. Whether it is Snowflake or other popular data warehouses or lake warehouses, they are all accelerated through Cache. StarRocks actually uses local Cache here.

Because in most business scenarios now, data can be divided into hot and cold. For example, in data mining, the value of data in the last seven days or half a month may be higher than the value of data half a year and a year ago. After the separation of storage and calculation, after the data is stored through this kind of object storage and unified storage, the cost becomes lower, but what should I do if the access becomes slower? I store these hot data that need to be accessed frequently on a local SSD or NVMe disk, and I will access them locally the next time I visit. At the same time, there is memory on top, and the whole is a multi-level Cache mechanism.

Because Cache is only responsible for storing hot data, its overall cost is relatively controllable. Maybe I have three years of data, but my hot data is only the last three months. Only 1/10 of the storage is stored through high-performance media, which will not lead to an increase in the entire storage cost, but it also ensures this. Performance of hot data queries.

 

Q2: At present, the integration of lakes and warehouses is a major demand for the development of the industry. If we want to realize the integration of lakes and warehouses, what technical problems need to be solved? At what stage has the industry developed currently and what is the effect of implementation?

Zhang Youdong: If you want to do data analysis, you will definitely choose between a data lake and a data warehouse: either build a data warehouse or a data lake. The advantages and disadvantages of the two are also obvious.

The current situation is that data warehouse has been a widely accepted concept due to decades of development. It has evolved from the original offline data warehouse to the current real-time data warehouse; the data lake, the previous generation, has been popularized in the country through the Hadoop ecosystem. , now many companies have built Hadoop systems, using Hadoop to build data warehouses, and then evolved from Hive to Iceberg and Hudi data lakes. This is the current status quo.

These two routes are now beginning to merge. For example, the data was originally managed uniformly in Hive, which is a large lake, but the query performance is not enough, so part of the data may be imported into another real-time data warehouse - a product like StarRocks. To analyze it inside, this brings about the complicated process of ETL on the data. Or with the analysis of the data lake and the analysis of the data, the data of the database can be directly analyzed at the same time. Although these two are gradually integrated in the current practice of many enterprises, after all, they are still two different things. Ultimately, its maintenance complexity is still very challenging for enterprises. But in the future evolution, it will definitely move in the direction of integration. It is also on this road now, and some leading companies have already taken the lead and achieved the integration of lakes and warehouses.

After the lake and warehouse are integrated, the effect it achieves is unified data storage and unified analysis, and the management of the entire data technology stack is simpler.

 

Q3: With the rise of large models , there will be an "explosive" growth of unstructured data. What challenges does it pose to the underlying data lake, lake-warehouse integration and other architectures? What are the possible combinations with vector databases? Can StarRocks become a technical architecture for big data and big models in the future?

Zhang Youdong: Currently, large models, including AI, are actually very popular. StarRocks' entire lake warehouse currently focuses on the processing of structured data and semi-structured data. The current core positioning is BI-oriented, and the AI ​​line is in a state of exploration. Because AI involves unstructured data (robust data for training), StarRocks has plans to focus on investing in and improving its capabilities in this area for the processing of these unstructured data. .

Semi-structured and structured data are actually widely used in AI. For example, StarRocks has begun some exploration into the vector retrieval type capabilities required at the bottom of large AI models. Currently, the StarRocks community cooperates with Tencent in this regard. Now the whole combination of AI and database, I feel there is no shortage of technology, but the lack of implementation scenarios, because Tencent has a clear scenario, they use StarRocks in a very large scale, the business data is here, how to make the business based on this set of data To become smarter, the current idea is to expand AI within StarRocks. For example, vector retrieval capabilities can serve businesses based on existing data, rather than creating a new set of support businesses. If everything goes well, we may be able to contribute this capability to the StarRocks community next year. By then, StarRocks will be able to provide some basic capabilities in the entire AI and large model system, but it is certainly not a complete AI solution.

 

Q4: When building a data warehouse , you may have to benchmark it more or less against the foreign Snowflake, but because Snowflake is inherently a cloud-based cloud data warehouse. Therefore, I would like to ask, the current application status of domestic cloud data warehouses, and what are the future usage trends?

Zhang Youdong: Regarding the "cloud data warehouse", some capabilities currently provided by domestic cloud vendors, such as Alibaba's AnalyticDB, Holograss, and Tencent's TCF, are widely used. However, there are currently many scenarios in China where the out-of-warehouse construction cannot be moved to the cloud, and it needs to be maintained in this privatized environment. But I don’t think this hinders the development of cloud data warehouses.

Because from the perspective of the cloud, everyone can still clearly feel that: whether it is applications, databases, or various components, moving to the cloud is already a trend. I think the migration of data warehouses to the cloud is also a trend, including StarRocks’s current storage separation architecture. Of course, you can use it in a private environment, build this kind of HDS cluster locally, and build a storage separation system. But if you want to maximize its value, you should deploy it on the cloud and use domestic systems such as OSS and COS to maximize its value and achieve the lowest cost and best flexibility.

From the perspective of technology development, architecture development trends, including business trends, I think the evolution towards the cloud will definitely remain unchanged. In addition, I think there may be some changes in restrictions in certain industries. Because even within financial institutions, data is hierarchical, and perhaps the core part needs to be entirely in the private domain. However, many business-related data can be managed in the form of private cloud or private domain cloud.

 

Q5: From the perspective of the role of a data warehouse manufacturer, how do you view HTAP ? Because database companies may have mentioned it more in the past two years, this year data warehouse companies are also leaning into HTAP hot spots. So, how do we view the technical hot spots of HTAP?

Zhang Youdong: In fact, the load difference between TP and AP is too big. From an overall technical point of view, systems such as domestic TiDB currently focus on HTAP. These can achieve certain architectural simplification in small and medium-sized scenarios. For example, my data is mainly TP workload, and there are some simple report query and analysis, which are not complicated and do not put much pressure on the entire system. In this case, I think HTAP can be tried, and He should be able to do the job.

However, for more complex analysis scenarios, using an HTAP-based database based on TP is still very challenging. In some actual scenarios, for example, when users compare StarRocks with some HTAP systems, they can clearly feel the performance gap between the two in such complex analysis scenarios.

I think the core point here is that the business-oriented principles of TP and AP are different. TP is for high concurrency and stability, and AP is for speed. One is to try to maintain this stability, and resources may not be used too much; the other is the MPP architecture, which needs to use resources as quickly as possible. These two things are put together , slightly more complex requirements are difficult to reconcile.

 

Q6: Today I also mentioned the "integration of lakes and warehouses". I would like to ask, what do you think of the current trend of Zero-ETL?

Zhang Youdong: I think this trend is quite obvious. Whether it is called Zero-ETL or No ETL, in short, it is to reduce ETL. We believe that in the entire data pipeline construction, the most complex part may be ETL, which is more important than the optimization of internal query analysis.

Now StarRocks' entire "lake and warehouse integrated" structure is actually solving this problem. The core of the integrated lake and warehouse architecture is that your data is stored in a unified manner, but unified storage does not necessarily require that the data be imported into StarRocks for storage through ETL. If you have already saved it, you don't need ETL. You can just put it directly into a system like Hive or Iceberg. This actually reduces ETL.

In addition, in the dimension of data processing or post-processing acceleration, we use materialized view technology to make users less aware of the ETL process. The user still performs some queries and builds a materialized view, and then StarRocks does the scheduling and refreshing actions for the user. In other words, from the entire link, we are helping users simplify the ETL Pipeline.

 

Q7: Is there any idea of ​​combining it with privacy computing? Nowadays, the combination of foreign data warehouses and privacy computing is quite popular. Because due to regulatory requirements, there may be multiple data warehouses abroad, and it is inconvenient for data to flow between them. However, this way of immutable data and dynamic model can be achieved through privacy computing.

Is there such a trend in China? Through the method of "privacy computing + lake warehouse integration", can data analysis or intelligent applications be realized? This may be more attractive to industries such as finance.

Zhang Youdong: This is a good question. We have also been thinking about how to solve the subsequent data sharing problem after StarRocks evolves into an integrated lake and warehouse. Judging from the customers we have contacted so far, everyone will pay attention to this privacy computing, but there are still relatively few people who actually explore and practice it. Mainly due to the need to use private computing to solve business problems, the experience is very small. But this issue is actually very important. Like Databricks, it is also a Lakehouse, but it also uses this unified Catalog for data sharing, including some permission management, to share private data with other companies or organizations, and also Different rule strengths controlling intermediate access.

So first, from the perspective of trends, I feel that not much is actually invested in this in China. It is also possible that the companies that are really doing this part have not considered StarRocks yet, or we just haven't sensed it.

Second, from a technical perspective, the current Lakehouse architecture of StarRocks can actually meet the needs of data sharing between organizations in the future. It can achieve such granular data access control, making it easy for organizations and clusters to Share data well.

 

Q8: Please introduce the next technical product planning, some key technologies or major topics to be developed in the next step.

Zhang Youdong: We will continue to enhance StarRocks around the path of "cloud native real-time lake warehouse".

Cloud native is easy to understand and makes StarRocks more cost-effective and efficient. These features of elastic scaling are better reflected in StarRocks.

In addition, we will be committed to solving the problem of users’ construction of real-time analysis links. For example, a set of links requires a Flink calculation, or a series of Spark Streaming technologies to extract data, and then import it into StarRocks for some processing and analysis. The entire link is extremely long. We hope to further simplify this link and make it easier to build real-time links based on StarRocks.

Third, after the unification of Hucang, most of the user's functions and management actions must be completed within StarRocks. Originally, we were doing interactive analysis, mostly reporting, and more and more data processing may be done in StarRocks later. Therefore, StarRocks will focus on the unified Lakehouse to enhance some capabilities that support ETL batch running, so that the analysis requirements can be completely completed through one component of StarRocks.

 

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6852546/blog/10149239