Jiuzhang Yunji DataPilot: A data navigator for large models sailing into the sea of vectors

Computing power, algorithms, and data are known as the three major elements of AI. The generative AI and large models of the current fire are no exception.

With the flourishing of domestic and foreign hardware manufacturers and cloud manufacturers, AI computing power has become the easiest element to realize; AI algorithms also have relatively mature classic algorithms and tuning methods, and there are many AI basic software to provide solid guarantee.

The reason why the large model is "big" is more related to the data. The larger the amount of data and the higher the quality of the data, the better the effect of the large model. PC Internet, mobile Internet, Internet of Things, etc. have generated massive amounts of data, and multi-modal forms such as text, pictures, and videos have further increased the complexity of the data. How to effectively solve data storage, calculation and circulation, so as to provide a reliable learning source for the evolution of large models, has become a top priority for the development of large models.

On June 30, 2023, Jiuzhang Yunji DataCanvas, a provider of basic AI software, held a new product launch conference in Beijing. While releasing AIFS, an infrastructure platform for artificial intelligence application construction, it also launched the world's first data "Vector Ocean" (Vector Ocean). And launched the DataPilot data navigator, a new generation of data architecture tool product based on large models that swims in the "vector sea".

Jiuzhang Yunji DataCanvas Product Strategy Map

Vector sea , the ultimate form of data development

AI and data have always been closely related. In the past, it was more of a one-way output of data to AI, which is usually considered as the raw material and basic elements of AI. However, the emergence of large models allows data to be reversely empowered by AI.

Where is the future of data when AI capabilities leap forward and achieve "two-way travel" with data? The answer given by DataPilot is - "vector sea".

Vector, a mathematical term, refers to a quantity that has magnitude and direction. In two-dimensional space, a vector usually consists of two values, representing the magnitude in the horizontal and vertical directions. In three-dimensional space, a vector usually consists of three values, representing the magnitude in three directions.

In computer science, a vector is a commonly used data structure, also known as an array or list. Each vector contains a number of elements, and each element has an index that can be used to access or modify its corresponding value.

In machine learning and data science, a vector is usually represented as a set of numbers that form a multidimensional numerical space. Each dimension of the vector represents a different feature or attribute of the space, such as the color value of different pixels in the image, the frequency of each word in the text, and so on. By performing mathematical operations on vectors, various machine learning algorithms and data analysis techniques such as clustering, classification, regression, etc. can be implemented.

The proposal of "Vector Sea" is the ultimate form of data development creatively proposed by Jiuzhang Yunji DataCanvas based on years of research and practice in the database field, combined with the development direction of vector data.

DataPilot , the data navigator for large models sailing into the sea of ​​vectors

As a bridge between the vector sea and the large model, the DataPilot proposed by Jiuzhang Yunji DataCanvas this time is such a role to establish links and point out the direction for the application of vector data in large models.

As a new data processing paradigm and a new generation of data architecture tool products based on large models, DataPilot helps users realize The intelligence and automation of data modeling in the whole life cycle.

According to Zhou Xiaoling, vice president of Jiuzhang Yunji DataCanvas, the features of DataPilot include multi-mode "vector sea" data architecture, on-demand automatic data integration, code generation, process arrangement and analysis calculation, as well as data acquisition, analysis and analysis based on natural language. Machine learning modeling capabilities. DataPilot can greatly reduce the technical threshold of data integration, governance, modeling, calculation, query, analysis, and machine learning modeling, reduce the cost of data-driven business development, and accelerate the process of digital innovation.

It is precisely based on the concept of "vector sea", DataPilot includes DataCanvas RT real-time decision-making center platform, open source DingoDB multi-modal vector database and other data software, so that users can have real-time, multi-modal data that is urgently needed in the case of AI technology breakthroughs ability.

Among them, DingoDB, as an open source multi-modal vector database, will be a powerful engine of the vector sea era. It combines the characteristics of data lakes and vector databases, and supports storing data of any type (key-value, PDF, audio, video, etc.) and any size. Through DingoDB, users can build an exclusive data "vector sea", whether it is structured or unstructured data, only one set of SQL can complete the analysis and scientific calculation of multi-modal data.

Vector database , the future has come

Since last year, with the explosion of generative AI and large models, the vector database has stood on the cusp.

As a database system dedicated to storing, indexing, and querying embedding vectors, vector databases allow large models to store and read knowledge bases more efficiently, and perform Fine Tune (model fine-tuning) at a lower cost. At the same time, the vector database also has multi-modal functions, which can greatly expand the time and space boundaries of large models. All of these are destined to make the vector database a good data companion for large models.

The vector database market is huge, and it is still in the 0-1 stage. Since last year, many vector database products at home and abroad have obtained considerable financing. According to the forecast of Northeast Securities, by 2030, the global vector database market is expected to reach 50 billion US dollars, and the domestic vector database market is expected to exceed 60 billion yuan.

"The effective storage, calculation and circulation of data still have a broad space for development. In the real world, there are many independent data domains among industries, enterprises, and professions. The huge amount of data and the difficulty of penetrating data domains indicate that the general large model The difficulty of landing." said Fang Lei, chairman of Jiuzhang Yunji DataCanvas.

The emergence of DataPilot with vector sea and vector database DingoDB was born to solve the new generation data problems of large models. Facing the future, DataPilot is expected to leave a strong mark in the development of large models.

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/131667530