[2023 Yunqi] Large model drives the intelligent upgrade of DataWorks data development and governance platform

As large models set off a wave of AI technology innovation, big data has also entered an innovative period of deep integration with AI. At the 2023 Yunqi Conference, Alibaba Cloud DataWorks product manager Tian Qixian released many new product capabilities such as DataWorks Copilot, DataWorks AI enhanced analysis, DataWorks lake warehouse integrated data management, etc., making DataWorks, a big data development and management system that has been developed for 14 years, Platform products are constantly upgrading and evolving from one-stop to intelligent.

Data+AI two-wheel drive

Entering the AIGC era, AI for Data and Data for AI have become hot words today. AI for Data, this is easier to understand. Through large model-driven AI intelligent assistants, the efficiency of data platform tools can be improved. DataWorks has built a one-stop, full-link tool chain for enterprises. In the process, it has also continuously built data assets for enterprises, such as data models, metadata, data lineage, data indicators, etc. In the era of big models, These can also be called enterprise-specific domain knowledge. With the help of the large model's powerful semantic understanding, reasoning, contextual learning, and memory capabilities, and through the Prompt Engineering of the large model, the DataWorks one-stop platform can provide AI intelligent assistants with closer, More timely and comprehensive contextual information allows AI to achieve better results and performance. This is Data for AI. With a good data foundation, many of the new products we released today rely on the capabilities of AI large models, and provide a new paradigm for data development and analysis through the two-wheel drive of Data + AI, further improving the efficiency of enterprises in obtaining data value.

Yunqi releases: DataWorks Copilot intelligent SQL programming assistant improves data development and analysis efficiency by 30%

DataWorks Copilot is a SQL programming assistant based on the NL2SQL large model. We use the NL2SQL large model trained and fine-tuned based on public data sets, combined with Prompt Engineering, to provide rich natural language-generated SQL operations.

  • SQL generation

Enter the natural language description you want to query and analyze, such as "statistics of product sales rankings in the last 7 days", and DataWorks Copilot will automatically generate the corresponding SQL statement.

  • SQL continuation

When writing SQL code in SQL IDE, DataWorks Copilot can provide intelligent code prompts and suggestions to improve SQL programming efficiency.

  • SQL error correction

When an error is reported when SQL is running, DataWorks Copilot can provide one-click error correction services to help ETL engineers and analysts quickly repair SQL errors.

  • SQL comments

Writing code comments used to be a burden. We didn’t want to write comments ourselves, but we wanted other people’s code to have comments. DataWorks Copilot can generate field comment information for table creation statements in batches, and can also add line-by-line comments to SQL statements to improve the readability of SQL.

  • SQL explanation

For some business personnel or analysts, they are often given a relatively complicated fetching script by data warehouse engineers. Some of the advanced SQL syntax and functions used do not understand the meaning but they want to change the fetching logic. In the past, Look for information everywhere or ask others for advice. DataWorks Copilot can directly interpret SQL codes, helping our business personnel understand SQL logic and usage faster, and improve the efficiency of data analysis and SQL learning.

DataWorks Copilot intelligent SQL programming assistant has been used internally for some time. According to some of our observations, it can improve ETL development and data analysis by more than 30%.

From GUI to LUI , DataWorks Copilot assists ETL data warehouse development

The graphical user interface (GUI) appeared more than 40 years ago. The powerful natural language understanding ability of large models has brought about a new natural language user interface (LUI). This is also a new way of human-computer interaction. A software product can Whether to provide LUI is also one of the hallmark capabilities of large-model applications moving from AI smart assistants to AI native applications. DataWorks is also thinking and exploring how to hide complex product operation logic behind the scenes and use large models to provide users with a simple, direct, and more humane natural language user interface.

We did some product practice. To cite a few application scenarios, in actual work, finding a table is a headache. In order to calculate an indicator, business personnel have to ask a classmate from Shucang which table should be used. The classmate from Shucang deals with this kind of consultation every day and is very annoyed. . DataWorks Copilot can provide quick table search through natural language, eliminating the need to ask questions when looking for tables, thus improving the data consumption efficiency of enterprises. In the ETL development process, some operations are relatively complex or cumbersome, such as scheduling configuration, parameter configuration, and data quality rule configuration. In the past, it was often necessary to jump back and forth to different product pages and configure manually. Now DataWorks Copilot provides conversational Natural language user interface. In a unified dialogue window, many cross-product tool operations can be completed through natural language interaction. For example, just saying "Configure a certain quality rule for a certain table" can complete the rule configuration for data quality inspection. In the future, we will continue to enrich the coverage of natural language interactive interfaces.

Click the link to view the video: https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/437757941217.mp4

DataWorks Copilot Product Demonstration

DataWorks Copilot provides two model services. The first is a large NL2SQL model based on public data set training and fine-tuning. Currently, you can directly apply to participate in the invitation test on the Alibaba Cloud DataWorks official website. If some companies have higher expectations for our model effects, or hope that Copilot can provide answers that are closer to the company's internal business, we can provide company-specific model fine-tuning services, combined with Alibaba Cloud artificial intelligence platform PAI and large model expert services, to Tailor-made exclusive code models and privatized large model deployment services for enterprises.

Yunqi releases: DataWorks AI enhanced data analysis

Enterprises invest so many resources in data production and construction. The ultimate hope is to gain insight into the business value in the data and guide the enterprise's operations and decision-making. Traditional statistical analysis methods often assume a statistical model first, and then estimate the model parameters based on data samples to understand the characteristics of the data. However, in practice, there are often many data that do not conform to the assumed statistical model. Exploratory data analysis emphasizes letting the data "speak" for itself, first exploring the data characteristics and statistics, and then selecting an appropriate model for further analysis. This is an analysis method that is more in line with the actual situation. In the AI ​​era, data insights are constantly evolving towards intelligence. AI enhanced analysis uses AI technology to accelerate or automate data exploration and insights, helping analysts to liberate themselves from manual data exploration. AI technology can also better discover patterns and trends hidden in data, helping analysts further break through the limitations of their own inherent cognition.

DataWorks combined with DataV data visualization products and deeply integrated AI technology to launch AI enhanced analysis products. Four core capabilities are currently provided:

  • Automatic data exploration

Automatically explore data sets to quickly understand data characteristics and statistical distribution without professional technical background.

  • AI automatic chart generation

Based on the information of automatic data exploration, data chart cards are automatically generated. Combined with AI technology, it automatically identifies the correlation between different data field combinations and generates charts. It does not require you to manually write a lot of SQL for analysis, and can help you quickly get inspiration and save opinion.

  • AI intelligent data query

Combined with large model technology, SQL query data is generated through natural language, and data chart cards are automatically recommended and generated for query results.

  • Build and share data reports with one click

Just like making PPT, you can use the data chart card generated above to generate a long data chart report with one click, and support exporting as a picture or sharing with one click.

DataWorks AI enhances analysis and lets the data "speak" for itself, making the data insight process as automated and code-free as possible. Through AI, it can also automatically discover potential trends in the data, tell data stories, and express data opinions. This product is currently in public beta. After activating DataWorks and entering the data analysis product, you can apply for the public beta experience.

Click the link to view the video: https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/438309479548.mp4

DataWorks Enhanced Analytics Product Demonstration

Yunqi releases: DataWorks lake warehouse integrated data management

As the market continues to change, enterprise business continues to develop, and enterprises face increasing competition and uncertainty. Data needs range from simple queries and statistics to BI to data science to recommendation predictions to AI applications. Overall, From simple fixed query statistics to complex, changeable and flexible intelligent analysis, the corresponding enterprise data architecture has also changed. From database to data warehouse to data lake, and then to lake-warehouse integration, the entire evolution process is in pursuit of higher data Efficient and better and faster to meet the various flexible data needs of enterprises. The lake warehouse integrated data architecture takes into account the standardization and enterprise-level capabilities of the data warehouse, as well as the flexibility and ecological openness of the data lake. It has become a data architecture that more and more enterprises are paying attention to.

DataWorks currently fully supports data management of lake-warehouse integration. At the storage layer, offline data warehouse MaxCompute and real-time data warehouse Hologres as well as data lake storage OSS/OSS-HDFS have been seamlessly connected at the storage layer. No need to Copying the mobile data allows you to perform federated queries on the data. On top of this, DataWorks provides a unified Hucang integrated data management user interface.

  • Real-time data enters the lake in seconds

In terms of data integration, DataWorks itself supports offline and real-time synchronized warehousing of more than 50 heterogeneous data sources. This year, the ability to enter real-time data into the lake has been added, enabling data to be entered into the lake in real time within seconds. It also supports automatic updating of database table fields during the data synchronization process. At the same time, automatic discovery and registration of metadata can also be performed during this process. With the help of DLF can perform unified metadata management of lake warehouses in the DataWorks data map.

  • Hucang integrated ETL development and scheduling

For various computing engines in the Hucang converged architecture, such as MaxCompute, Hologres, Spark, Hive, Presto, etc., it provides unified ETL task development, task orchestration and scheduling, and operation and maintenance services to achieve a unified data development pipeline and solve the problem of enterprise data Inconsistent architectures cause fragmentation and instability of data production links and other difficult-to-manage problems.

  • Hucang integrated data governance

DataWorks newly supports Hucang integrated data management. Not only can it support the unified metadata management, data modeling and data quality management of the lake warehouse, but DataWorks' proactive and automated data governance tool "DataWorks Data Governance Center" also fully supports the EMR+OSS data lake.

DataWorks Data Governance Center fully extends mature data warehouse management capabilities to the EMR+OSS data lake. In order to simplify the difficulty of data governance under the Hucang architecture and make data governance no longer a movement, but truly sustainable, followable, and implementable, the DataWorks Data Governance Center has added a "data governance plan" function. To assist users in completing proactive data management planning and diagnosis.

The data governance plan has built-in templates for data governance scenarios such as cost management of computing and storage, task stability management, etc. for data managers. It supports enterprises to set a data governance goal and provides multiple dimensions of data governance health assessment models to help Let’s evaluate the effectiveness of data governance.

The data governance plan is aimed at data governance practitioners and provides more than 60 governance rule libraries covering 5 dimensions. Combined with the set data governance target directions, the data governance product can automatically recommend selected and target-related data governance issues and provide corresponding Governance means and methods help data governance executors to discover and solve problems in a timely manner. At the same time, the data governance center provides prior problem interception. During the data development stage, many problems can be discovered in advance, such as code specification issues and task name naming specification issues. These plug-ins for pre-interception and post-issue discovery plug-ins can be intercepted in advance. They all allow supporting companies to define themselves.

Data Governance Application: Cost Optimization-Automated Offline of Invalid Tasks

As enterprise business continues to change and enterprise personnel change, more and more invalid data tasks will inevitably appear, consuming a large amount of computing and storage costs every day. Traditional manual governance requires manual analysis and judgment by data engineers to conduct complex impact analysis, and there are also communication and collaboration costs with relevant affected personnel. It is extremely easy to cause malfunctions due to inadvertent mistakes that affect online tasks, causing data engineers to fail. Afraid of problems and do not dare to manage ineffective tasks and are unwilling to manage.

DataWorks data management center provides a product function called "elegant offline", which can perform batch process and automated offline management of invalid tasks. First, the impact of task offline is analyzed automatically, and then the task offline is decomposed into five steps: delay scheduling, pause scheduling, offline task, backup output table, and delete output table. Each step also provides a silent period and Automatically notify relevant responsible persons or affected persons. The whole process is similar to a "grayscale offline" mechanism, which can quickly recover once something goes wrong and minimize the impact.

In Alibaba’s internal data team, the original management of offline operations involved a group of 1,000 tasks involving 30 responsible persons, from organizing group meetings to communicating, analyzing the impact of offline operations, formulating offline plans, and individually executing offline operations to result follow-up. , it will take 3-5 months. With the graceful offline function of the DataWorks data management center, management actions can be completed in 2 days, impact observation can be completed in 1 week, and the project can be officially closed in 15 days. The graceful offline operation of the DataWorks data governance center has helped Alibaba's internal data warehouse team successfully offline tens of thousands of invalid tasks, saving a large amount of storage and computing costs.

The DataWorks Data Governance Center has provided services in the DataWorks Enterprise Edition, and trial activities for the Enterprise Edition will be launched in the near future. You can pay attention to the official website information of the product.

Since its birth within Alibaba Group in 2009, DataWorks has been an advocate and staunch implementer of a one-stop platform, including data integration, data development tool chain, data governance tool chain, and analysis on the data consumption side. and service products, we continue to build and accumulate enterprise data assets for enterprises through a one-stop platform. In the AI ​​era, DataWorks continuously integrates and innovates the product capabilities accumulated over the past 14 years with large models to provide enterprises with one-stop intelligent data platform products to improve the efficiency of enterprise data flow and accelerate the acquisition of enterprise data value.

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10148350