A brief analysis of data engineering

At present, digital transformation is not a new thing to the market. From a technical perspective, although the emergence of large models has brought more attention to people, artificial intelligence and big data related technologies are still in the innovation stage, and various industries are looking for and Explore the balance point of the integration of value scenarios and emerging technologies, hoping to occupy a favorable position in the fierce competition with the support of emerging technologies.

data,data

Data is a production factor under the new generation of technological revolution. Mastering the production factors and their processing methods is to master the value code under the digital economy. This is already a basic consensus in the industry.

If enterprises want to better manage and utilize data, they must understand the source and organizational form of data in modern enterprises. Enterprise digital transformation is generally divided into three stages:

b870f7ba844baa8a462d8961a3b6ff55.jpeg

In the process from data generation to data value implementation, the information density of data is getting higher and higher, and the knowledge contained in it is getting richer and richer. By analyzing the entire process of enterprise data, enterprises can seize key links and formulate implementation plans according to local conditions. Analysis of the entire data process is the prerequisite for every enterprise to implement data engineering.

data engineering

From the emergence of software development to the gradual scale-up of software development, IT practitioners have gradually accumulated best practices in terms of requirements, design, implementation, testing, operation and maintenance, etc. The flow of data within an enterprise will go through multiple stages, and there are various problems between each stage.

152239ce3801e09de5eccd0caaac3810.jpeg

Data engineering is the best practice to help enterprises efficiently mine the value of data, continuously empower business growth, and accelerate the process of sublimation from data to assets.

Data engineering includes stages such as requirements, design, construction, testing, maintenance and evolution, and covers project management, development process management, engineering tools and methods, construction management, and quality management. It is a set of requirements for large-scale production and use of data. The business provides data support and ultimately generates value.

  • Data engineering is a system

  • Data engineering is best practices for scaling to accelerate the data-to-value process

  • Data engineering is part of software engineering

  • Data engineering is not a simple reappearance of traditional software engineering in the data field

For enterprises, data engineering includes three strategic links: data vision alignment, data engineering implementation, and continuous data operation.

ec05ca3d6316405aa4139fd886aa30f4.jpeg

The first step in vision alignment is to identify business value scenarios by defining and unifying the business value measurement framework. The explored business value scenarios need to include the background of the scenario, value points, users involved, what capabilities are required, user journeys, entities involved, risks and other information.

The implementation process is like giving birth to a new life, in which data sorting and planning blueprint, data architecture design and planning skeleton, data model design constitutes the organs, data access gives information perception capabilities, data processing constitutes the central brain, and the testing and safety parts are responsible for the newborn To provide protection, each step is interdependent and indispensable. Data engineering is implemented through seven steps of data sorting, data architecture design, data access, data processing, data testing, data security, and capability reuse and assurance.

The purpose of data operations is to form a "data culture" in which enterprises look at data, use data, and use data as a communication language and tool. Only when data is easy to discover can it have the possibility of generating value.

Data engineering competency model

In the final analysis, the implementation of data engineering still needs to be completed by people. Building an enterprise's own personnel capability training mechanism and building a channel for improving enterprise personnel's data capabilities are important guarantees for the continuous iteration of data engineering capabilities.

The data engineer competency model is as follows:

9b9e7b6de5d5bbb0c580f4e629b07e93.jpeg

The competency model of data product managers is as follows:

5f3ba446b378bc615dec4abf20ba66a0.jpeg

The competency model of data analysts is as follows:

4f2374687045956f71ecd8deda99585e.jpeg

Data engineering is an important guarantee to ensure the transformation of data value under the digital economy. It is an important means to accelerate the transformation of data into value. It needs to cope with the general trend of the future digital economy. In order to deal with various new problems in the data field, various new technologies and new concepts are gradually emerging. Modern data warehouses, data lakes, lake-warehouse integration, distributed data architecture, machine learning, data cloud native, etc. have come to the stage one by one.

Data Engineering Tool Map

Data engineering is a concept given by the consulting company Thoughtworks, but it is still old wine in new bottles. Personally, I think it can be mapped to data governance in the traditional sense. For data governance, there is already a relatively mature system. The following is a panorama of data governance tools:

5b664ebba7afc8d03422eea705722603.jpeg

In particular, the map of AI computing capability support tools is as shown in the figure below:

872bd1c3d709d9eea79225b7ac5fbe59.jpeg

Large models and data engineering

Breakthroughs in the development of artificial intelligence benefit from the development of high-quality data. Data is one of the key elements in the competition of large models. The training of large models requires high-quality, large-scale, and diverse data sets, and high-quality Chinese data sets are scarce. . The value of industry data is very high, and companies with high-quality data and certain large model capabilities may empower their businesses through industry large models.

In the future, the proportion of data costs in the development of large models may increase, mainly including data collection, cleaning, labeling and other costs. Under the premise that the model is relatively fixed, the training effect of the entire model can be improved by improving the quality and quantity of data. The data-centric AI workflow is shown in the figure below:

8651124d2d4b2fddd40d69bec609f94a.jpeg

Large language model data sets from GPT-1 to LLaMA mainly include six categories: Wikipedia, books, journals, Reddit links, Common Crawl and other data sets. Multimodal large models require deeper networks and larger data sets for pre-training. In the past few years, the amount of multi-modal large-modality parameters and data have continued to increase. For example, the Stable Diffusion dataset released by Stability AI in 2022 contains 5.84 billion image-text pairs/images, which is 23 times that of the DALL-E dataset released by OpenAI in 2021.

Domestic industries have abundant data resources, and the CAGR of data volume from 2021 to 2026 is higher than that of the world. The data mainly comes from government/media/service/retail and other industries. According to IDC, China's data volume will increase from 18.51ZB to 56.16ZB from 2021 to 2026, with a CAGR of 24.9%, higher than the global average CAGR. Although domestic data resources are abundant, high-quality Chinese data sets are still scarce due to insufficient data mining and the inability of data to circulate freely in the market.

The unique data for training Baidu's "Wenxin" large model mainly includes trillions of web page data, billions of search data and image data, etc. The training data for Alibaba's "Tongyi" large model mainly comes from Alibaba DAMO Academy. The unique training data of Tencent's "Hunyuan" large model mainly comes from high-quality data such as WeChat public accounts and WeChat searches. In addition to public data, the training data of Huawei's "Pangu" large model is also supported by B-side industry data, including meteorological, mining, railway and other industry data. The training data of SenseTime's "RiRiXin" model includes the self-generated Omni Objects 3D multi-modal data set.

Therefore, in this era of big models, the data engineering of enterprises must integrate large model-oriented data architecture, complete self-annotation when the data is generated, and supplement it with data provided by data service providers, using large models as the default option to form their own domain model.

Let’s wait and see!

[Reference materials and related reading]

Guess you like

Origin blog.csdn.net/wireless_com/article/details/132137892