Wang Cheng: When Data Governance Meets ChatGPT

Technologies such as artificial intelligence, represented by ChatGPT, are "soaring", bringing about earth-shaking changes to the world. On April 27th, at the 2023 Data Governance New Practice Summit, Mr. Wang Teng, founder and CEO of Datablau Digital Technology, shared the theme of "Data Governance New Practice and Artificial Intelligence", and explored together with participating colleagues when data governance meets ChatGPT, what kind of "chemical reaction" will this round of AI technology wave have with data governance.

The following is the transcript of Mr. Wang Teng's speech. For the convenience of reading, the editor has made some wording changes and text optimization.

Hello everyone, first of all, on behalf of Shuyu, I would like to thank everyone for coming to the 2023 Data Governance New Practice Summit! Today's main topic is around ChatGPT, an inflection point in human history.

Why do data elements become new production factors?

First of all, let's take a look at the data element. At present, it is considered a new type of production factor in China. Why is this? I interpret it more from the perspective of the three stages of economic development. The first stage is the agricultural economy, and the core elements are labor and land; the second stage is the industrial economy, and the core elements are capital, technology, etc., and the third stage is also It is the digital economy we are talking about. The core change is that the first two stages focus on the "supply and demand side", that is, resource allocation and value exchange between enterprises and customers. However, when data is integrated, more content will be generated, and there will be AIGC (AI generate content), which means that more companies, customers, and stakeholders create value together.

From the perspective of enterprise scenarios, that is, the digital twin is to digitize the content information, and then do some digital twins and predictive deduction, and then generate corresponding values. The 1.0 version of the digital twin is called role optimization, and the 2.0 version is called role optimization. Parallel world, that is, digitalization is completely made into a digital twin to run in advance to predict what may happen in the real world, and feed back to the real world to optimize in advance. I think this is the real value of data being introduced as a factor of production .

insert image description here

What is the impact of technology-driven digital development?

Next, I will quote a few lectures of Dr. Lu Qi who have been very popular recently. From the perspective of labor force, in the agricultural society, farmers and the land are related together, which is a strong coupling relationship; in the later industrial society, the labor force began to flow, and the products produced also flowed; at the present stage, in the process of digitalization, in fact, It is more of a service economy, in which the core roles are programmers, designers, analysts, etc.; from the ubiquity of digital information to the ubiquity of digital models, this is a big inflection point. So everyone is predicting that the model may replace programmers, designers, analysts, etc. This is a matter of anxiety in the current society. After the model is more mature, the main job may be to be an entrepreneur or a high-end scientist.

insert image description here

Dr. Lu Qi divided the human environment into three systems. The first is the perception-information system, that is, information is everywhere; the second is the thinking-model system, which is actually our knowledge model; the third is the implementation-action system. In the early days of information systems, IBM, Microsoft, etc. were all sensing and collecting information. It can be seen that the inflection point is that Google has basically reduced the cost of obtaining information for humans to zero. Of course, information systems will exist for a long time in the future. We are currently at the turning point of the second thinking-the model system Open AI. ChatGPT 3.5 has brought about a qualitative change, which we call a new paradigm, which lowers the cost of our acquisition of knowledge (thinking). Transformed into knowledge representation, expected memory and generalization are achieved through reasoning and induction. The final action system is more about the conversion between people and the physical world.

insert image description here

Regarding the transformation of data into knowledge expression, and the realization of expected memory and generalization through reasoning and induction, there is a real example in the past two days. In the group of the Datamodeling open source model community, someone initiated a discussion on the design of the relationship between the parties in the LD-FSM model.

insert image description here

Everyone responded in a rush of tongues, from various angles, but there was never a sense of breaking the truth with one word. That's when someone started posting ChatGPT's responses.

First of all, I gave ChatGPT a context, "You are a senior data modeling expert", but this version of the reply still doesn't feel right.
insert image description here
So, ChatGPT was asked to answer again. The answer this time has been quite reliable. Can basically reach the level of industry experts.

But there are still some vague expressions in it, such as "The modeling of the relationship between the parties focuses on the interaction between the parties." What does this interaction refer to? So, I asked ChatGPT to clarify this again. ChatGPT gave an example to clarify this issue very clearly.
insert image description here
Finally, let ChatGPT give another clarification and example.

insert image description here
Let's see if this is to lower the cost of acquiring knowledge (thinking), behind it is to transform data into knowledge expression, and achieve expected memory and generalization through reasoning and induction.

Before we could get this done, we might have to hire a model expert to do a consulting project. It took several months of tossing and spending tens of thousands or hundreds of thousands, but now the cost is almost zero. This is the same as when Google launched a search engine back then, and our cost of obtaining information was reduced to zero. So, we're currently standing at a big inflection point.

What are the core elements of ChatGPT's success?

The GPT model of ChatGPT is based on the Transform sequential model architecture. Compared with the previous knowledge graph and other methods, the Transform sequential model architecture can compress a large amount of information more efficiently, which is the core breakthrough point; secondly, English It is a global language, and the amount of ChatGPT information is actually contributed by people all over the world. If it is placed in a Chinese environment, there may still be quite a few challenges, because Western culture itself has a philosophical logic of deduction and deduction, but Chinese is more complicated and difficult to understand, so it is an order of magnitude worse than English corpus . From the perspective of Chinese, in the future, for the capture and training of these information, should we convert English information into Chinese, or start directly from Chinese? This is a relatively large intersection.

insert image description here

How far can artificial intelligence develop?

The artificial intelligence technology represented by ChatGPT has powerful capabilities. Generally speaking, the development of artificial intelligence can be divided into three stages. The stage where AlphaGo defeated the human chess master belongs to weak artificial intelligence. The current stage is basically approaching strong artificial intelligence, which is similar to the level of the human brain, or even surpasses the human brain. After that, it is Super artificial intelligence has reached the stage where all human knowledge can be covered. Some people predict that super artificial intelligence may be realized by 2030 or 2040.

In the American Trivia Grand Prix, the human champion competes with the machine, and it is very difficult to win. So things like quizzes, arithmetic, rote memorization, etc. have long been covered by artificial intelligence. Then there are things like autonomous driving, speech recognition, vision, translation, etc., which can almost all achieve artificial intelligence, but things like science, design, book writing, and art are still hard to achieve in the short term, so there are some discussions to the end To what extent artificial intelligence can develop, here is an interesting theory-John Searle's "Chinese Room Experiment", can future machines have some emotions, and can they develop to an uncontrollable level? This has not yet been concluded, and it is an open-ended thinking question for everyone.

Powered by AI, an intelligent engine for data governance

In fact, we have also done a lot of research on ChatGPT. First, let's ask what ChatGPT can do to help data governance? Its answer: First, it can do some institutional processes for data governance. Second, it can analyze the validity and consistency of some data. Third, some quality monitoring, security compliance of data governance, and some task automation at the same time. For the first point it answered, we asked it to list 100 industry data standards for the manufacturing industry, and it can roughly give an answer that meets expectations.

insert image description here

Next, let it write a piece of "code to check the validity of the ID card number using SQL", which is perfectly written and really strong.
insert image description here

So, how should data governance embrace the new wave of AI technology represented by ChatGPT?

Datablau Security Classification and Grading Intelligent Practice

Starting from practice, Datablau has been doing intelligent research and development of data security classification and classification. In our product platform architecture, we form a set of classification and classification corpus by training the industry classification and classification system. Then use Word to Vector to compare the distance between the word vectors, that is, the distance between a classification and a metadata vector. Of course, there must be some optimization in this process. For the description of large pieces of information, we usually use the method of splitting words, which may make the split information meaningless, and manual optimization is required at this time.

As shown in the figure below, we perform word segmentation on the description of the classification, and then put it in the vector space for correlation operations to see the correlation between the field and the description of the classification, obtain the value of the vector space, and obtain the data classification recommendation with the highest correlation with the field.
insert image description here

In fact, at present, we have done a lot of intelligent security classification and classification in the securities and banking industries, especially for the industry standard of the People’s Bank of China’s data security classification and classification. A set of 12.2 million industry corpora is used to supplement the corpus of the People’s Bank of China. Therefore, the first-time recognition rate of banking data classification and classification can reach 76%, and with manual optimization, it can reach 90%. Of course, the whole process has the effect of self-feedback. It is also the process of machine self-learning.
insert image description here
Ok, that's all about the ChatGPT topic.
* Some pictures in the article are from Dr. Lu Qi’s courseware

Guess you like

Origin blog.csdn.net/weixin_39971741/article/details/130576899