Create a new paradigm for data processing, DataPilot swims in the sea of data vectors

AI and Data have always been closely related. In the past decade or so, data has generally been considered the raw material and basic element of AI. We call it the architecture era of Data for AI. The emergence of large models allows data to be reversely empowered by AI, which is a sign of the new New Data era.

When big breakthroughs in data and AI capabilities collide, how will its future change? At the Jiuzhang Yunji DataCanvas new product launch conference, Zhou Xiaoling, vice president of Jiuzhang Yunji DataCanvas, gave an in-depth interpretation of the answer given to the world by the company's self-developed new generation data architecture tool product, DataPilot.

Insert image description here

Zhou Xiaoling, Vice President of Jiuzhang Yunji DataCanvas Company

Speech Record

Zhou Xiaoling: Good afternoon everyone! Let me introduce to you a grandly released new product, DataPilot. DataPilot is an AI-enhanced new data architecture developed by Jiuzhang Yunji based on the Yuanshi large model. DataPilot is also a new data processing paradigm with three characteristics. : 1. Multi-modal vector sea data architecture; 2. On-demand automated data integration, code generation, process orchestration and computational analysis; 3. Natural language-based data acquisition, analysis and machine learning.

How does DataPilot implement these features? Let’s first take a look at the historical changes in the evolution of data architecture. In the past 15 years, our data architecture began to develop from data warehouses. In 2008, we started to build data warehouses. From 2016 to recently, we were still building Data Lake. These two platforms are relatively stable. Since last year, we have been improving Regarding the concept of the new data stack, especially concepts like Data Match and Data Fever, there were some exchanges overseas last year. In China, starting this year, everyone is considering whether to move from the old architecture to the new architecture. Such a plan has never been well implemented. The emergence of generative artificial intelligence and large language models, the understanding of natural language and the excellent performance of code generation have brought a new dawn to the New Data Stack era.

From the perspective of data scale, data scale is not a numerical value, it is a concept from micro to macro. For example, in the early days, for a specific analysis goal, we introduced some data on very well-managed topics for analysis. As enterprises need to analyze more data or explore more unknown situations, we introduce various data from some enterprises. In the era of large models, we are now beginning to consider introducing world knowledge and even industry knowledge, not just internal knowledge within the enterprise. As the scale of data develops, we see that the data modality itself is also developing forward. From the early days, most of the data was in the form of table structures, from storing the data in structured tables to the Data Lake era. More modal data, including images, videos, voices and even time series data, are stored in the file system, or displayed in the form of tables through the file system. In the new data era, we hope that these multi-modal data can be expressed and calculated in a more effective way. At this time, the industry introduced the concept of vectorization, using vectors to put various modal data into Unified coding, unified calculation, and unified alignment in mathematical space are processes of evolution and changes in data modalities.

The third change is the change in the complexity of data analysis. From the early days of making reports, we mainly used SQL and OLAP methods. Later, we added statistical methods, machine learning, and deep learning methods. In the new era, we hope that the system can give us recommendations, suggest how to analyze, and provide us with self-service analysis suggestions. , or help us generate data, or whether the modeling process is implemented through chat or dialogue. The complexity of data analysis is increasing. The data business person is not a technical engineer. He wants to take a look at the changes in the business in the past month and give him some suggestions. He can look at more data. You can ask like this. After business personnel look at these data, they can use chatting to build a machine learning model, use the machine learning model to predict future business changes, and even use chatting and natural language to analyze the reasons for business ups and downs in the past month. This is The new era shows you the new changes.

The evolution of data architecture is essentially driven by business. Business driving changes our data technology stack. In the era of data warehouses, our business is mainly reports, mainly doing BI. What are reports? Looking at yesterday’s data, what do we need this data for? It is used to make operational decisions, do operational analysis, or make regulatory submissions. At this time, we have very high requirements for the accuracy and completeness of the data. We will introduce a split modeling mechanism or a strong governance mechanism to meet the timeliness requirements of the data. It’s not that high. Data calculation, scheduling and data access are all accessed in batches, and more calculations are done through relational algebra calculations using SQL or OLAP. In the Data Lake era, the demand for data application has increased. The original batch offline reports are no longer satisfied with yesterday's reports, and I want to see real-time reports. It turned out that the prepared data reports were not enough. I wanted to look at the unfinished reports and conduct interactive analysis. It is no longer enough to just look at history. I want to look at the future and predict unknown situations. At this time, our calculations have been improved. We need to introduce more data that were not prepared before. We will introduce some weak governance data. At this time, calculation From SQL to statistical analysis, to numerical calculation methods, because we have more things to look at, and the data mode has shifted from the original simple structured data mode to more other modes, which are stored independently in the data lake. Stored in the file system as files or objects.

In the New Data Stack era, demand analysis has improved again. We need to be able to provide systems, provide suggested analysis, natural language-based queries, use automatic conversational machine learning methods, and introduce vector search calculation methods. Our needs are real-time It is proposed that the system may not be prepared in advance. Based on the understanding of my needs, it can schedule and process data immediately and generate data codes. We hope to store data of different modalities through the same encoding method, which is also an opportunity for vectorization to appear.

There is a very interesting thing about the automation level of data architecture. After so many years of dataization or AI, our main purpose is to provide automation capabilities to all walks of life and improve the level of automation. Actually, if we look at it from the other side, the level of automation of the data itself is relatively low. In the era of data warehouses and data lakes, in the data ingestion process, most of the data is actually manually accessed, either manually connected to the Linux system or integrated into the data platform. , only a small amount of data is automatically offloaded or copied.

The second very important thing is about the automation of data structure analysis and understanding. We need to rely entirely on manual methods to understand the business, understand the data structure of the upstream system, and then convert it into coding or thinking for the downstream system. The act of writing.

Third, data modeling and maintenance, once you have an understanding of the upstream system, you can start the modeling process of midstream data. This process is basically still done manually. Only when some upstream data format changes are notified to the downstream, and the downstream does some iterations.

Fourth, in terms of metadata management and data governance, the degree of automation is the highest. We have made some visualization tools, canvas drag-and-drop data, or drag-and-drop data when making reports to extract data. There are a lot of machine learning Python codes. No one understands these codes, and metadata management is all done manually.

Fifth, data usage and services. After the data is processed, the last step is to provide it to the user for use by manually coding to develop an API, develop a service, write a script, and write a SQL to implement data services and applications. This level of automation is also very low.

Sixth, data performance optimization and architecture optimization are largely done manually. What do we hope the new era of DataPilot or large model-driven will look like? In the new data era, in the field of automation, data governance has been greatly improved. At the level of metadata management and data services and governance, the automation level has reached 100%. This is reflected in the ability to understand when customers have a need. This demand is transformed into the processing and sorting of existing data, and the application is input to it. For the analysis and understanding of data structure, the maintenance automation of data modeling has reached 75%. It is hoped that through the large model method, we can understand business knowledge, understand the data structure of the upstream system and the data itself, and automatically generate the code for data modeling. Data ingestion and data performance optimization, these two parts involve integration with peripheral systems. Its automation may not reach 100% or be very high. We have set a goal of 50%. What is our purpose in developing the DataPilot product? Greatly improve the automation level of the data application itself, and use the Jiuzhang Yuanshi large model to understand natural language, that is, understand customer needs, understand the data structure, and automatically generate data codes and data orchestration processes to achieve automation. the process of.

What does this process look like? We cast a New Data Stack on it, and the data application is on the far right. If the data has not been prepared before, it will take a very complicated process to reach the data application stage. Sometimes we encounter a problem with a customer. The customer asks for data, and the data engineer or the IT department responsibly says that it will take several days, weeks, or months to give you this data. Why? Because data engineers need to understand from scratch and slowly go through the process of data ingestion, transformation, storage, calculation, analysis, prediction and application.

Let's take a look at how DataPilot, with the support of the Jiuzhang Yuanshi large model, covers the entire process from data ingestion to application. Data ingestion is covered by DDS products, data transformation work is implemented by the orchestrated data pipeline GPT and fusion computing platform RT, and in the storage and computing direction, the unified DingoDB multi-modal database covers the calculation and analysis process. We have also developed two data applications, one is TableGPT and the other is DataQueryGPT. Let’s briefly introduce these two products. DataPilot is a new data processing paradigm. The basic architecture is Vector Ocean. Based on Vector Ocean, we provide several products. One is the DDS product, which is a real-time database synchronization transmission tool that supports various data sources. Adaptive collection is the ingestion link we talked about earlier. The RT fusion processing platform supports data integration, development, computing, and governance platforms. It also shifts from the original coding method to the data generation method. DingoDB is the underlying support platform of Vector Ocean, providing unified storage and joint analysis capabilities for structured and unstructured data. As you know, we have AutoML products, and TableGPT goes one step further to achieve automatic machine learning modeling and training in a conversational manner.

DataWaveGPT is an automated data ETL code generation and data task orchestration tool. DataQueryGPT, a conversational structured and unstructured data query and analysis tool. What is the structure of our vector ocean based on Vector Ocean? The following is introduced by Hu Zongxing, senior product director of Jiuzhang Yunji DataCanvas Company!
Preview of the next article: Hu Zongxing's "Two-way empowerment of AI and data, DingoDB becomes a super powerful engine in the vector sea era", so stay tuned!

Guess you like

Origin blog.csdn.net/weixin_46880696/article/details/131837963