How to solve the data bottleneck problem in the battle of hundreds of models?

3205270da9dd8b39a87cf3d611c69687.png

59b8487eb507b5f264c208c66242ba4b.jpeg

762d6435b9cb347473015b7360d50010.png




‍Data intelligence industry innovation service media

——Focus on digital intelligence and change business


Since ChatGPT, large models have set off a wave of upsurge in China, and there is a great intention to compete with a hundred models. Among them, data, algorithms, and computing power are the troika of large-scale model training. The quantity, quality, and diversity of pre-training data have become the key factors for the performance of large-scale models. The importance of data to the field of artificial intelligence deserves our re-examination.

However, while pursuing technological innovation, the consideration of ensuring the legality of data, privacy protection and ethical issues should also be given sufficient attention. Judging from the current development of AI large models, disputes over data sets have become more common in recent years. Since large-scale training datasets are critical to training powerful AI models, the source and use of datasets has raised a series of legal and ethical controversies. With the rapid development and wide application of AI technology, it is especially important to ensure the legal and transparent use of data.

So, data is the basis of large model training, how to ensure data security? For large model training, should "quantity" or "quality" be the main focus? What is the solution to the data problems in the current large model training process?

Data has become a bottleneck problem for large model development

The quality of data sets is the key to the development of large models. Only through high-quality and diverse data sets can large models show real intelligence and creativity. However, in the process of large-scale model development, the data used has become an important obstacle on the way to its growth. In foreign countries, data disputes caused by the development of large models have already appeared.

OpenAI, the developer of ChatGPT, is stealing troves of personal information to train its artificial intelligence models in a desperate pursuit of profit, a group of anonymous individuals allege in a class action lawsuit. The anonymous individuals accused OpenAI of violating privacy laws by secretly scraping 300 billion words from the internet and wiretapping "books, articles, websites and posts, including personal information obtained without consent."

In addition, data disputes related to large-scale models have also appeared in China. Among them, the accusation of Bishen Composition against the Xueersi large-scale model once again made everyone pay attention to the importance of data to large-scale models. Bishen Composition stated that Xueersi illegally accessed and cached Bishen Composition APP server data as many as 2.58 million times through "crawler" technology, which seriously violated the data rights of Bishen Composition APP. This behavior not only violated the terms of the contract between the two parties, but also violated the relevant provisions of the "Data Security Law", and seriously violated the data rights of the Bishenzuowen APP.

a7990db57fb782c7a48f5c4b71c93c5a.png

In response, Xueersi’s official microblog responded: “First of all, MathGPT is a self-developed large model focusing on the field of mathematics, without any composition-related data; secondly, the ‘Composition AI Assistant’ is currently under development and has not yet been released. The service does not use any data from Penshen Composition."

In addition, Twitter and Reddit, the "US version of Tieba" also announced in the first half of this year that they would charge for the API interface, and the price is not cheap. Previously, the content of these platforms could be crawled by companies such as Google and OpenAI for free and used as a training library for large language models. Twitter CEO Musk once said, "They (Microsoft) illegally used Twitter data for training, and it's time to sue them."

Subsequently, Samsung also paid attention to this phenomenon and introduced a new policy that requires employees not to use generative artificial intelligence such as OpenAI's ChatGPT and Google Bard in the workplace. According to Samsung, the internal source code was accidentally leaked in April after an engineer uploaded it to ChatGPT. This has Samsung worried that its data will end up in the hands of other users through artificial intelligence platforms. As a result, Samsung employees are prohibited from using AI tools on company devices, including computers, tablets, phones, and more. Employees can still use AI tools on personal devices, though only for non-work-related things.

Has the data become a bottleneck for large model training? To this end, Data Ape communicated with industry experts on related issues.

Lei Tao, CEO of Tianyun Data, said: We need to reflect on this issue to the root: make a large model or feed a large model? Currently, the corpus that large models can extract is open, shared, and free. According to Phoenix Weekly, ChatGPT Chinese data is 0.09905%, which is less than one-thousandth of the proportion. If the steam engine encapsulates and moves power, and electricity encapsulates and moves energy, then artificial intelligence will encapsulate and move knowledge. The knowledge of the large model will become the future infrastructure. At that time, whether the "preaching" will be the "Bible" or the "Hundred Schools of Philosophy", the core difference will be huge. So filling the large model corpus is the fundamental bottleneck problem. There is a saying in "1984": "Whoever controls the past controls the future; whoever controls the present controls the past." This sentence is completely suitable for large-scale model data.

Dr. Yang Xiaodong, Director of Computing Technology of Huayuan, believes that the current problem of large-scale models being stuck is mainly concentrated in two aspects:

First of all, for companies and solution providers in specific industries, high-quality industry data is indeed a major bottleneck. The 80/20 rule is also applicable here, which means that 80% of the final effect of the large model is determined by the data. Low-cost fine-tuning (PEFT) of the model through high-quality data, or combined with Langchain, can make a large-scale industry model with good experience in all aspects. But if the data is relatively weak, you can only do some general, painless scenarios based on the capabilities of the base model itself.

Secondly, from the perspective of large-scale model technology, in order to continuously improve model performance and accelerate engineering implementation, the network structure innovation of the pre-training model itself, the optimization of Transformer and Attention, and the optimization of the communication library nccl are also crucial. , It is necessary to invest in the research of the underlying basic capabilities, and get rid of the status of a follower in basic research.

Rich and diverse data can help models better understand language structure, semantic relationships, and contextual information. However, constructing high-quality datasets is not an easy task.

Data is the cornerstone of large model training

In the 100-model battle in the field of artificial intelligence, the training of large language models has become a key field of competition. Data, algorithms, and computing power, as the troika of large model training, play a vital role in this competition. Among them, data sets, as the cornerstone of large model training, have a key impact on model performance and innovation capabilities, especially data quality issues cannot be ignored.

Currently, data for large models generally come from multiple sources, including the following:

First, public datasets. There are public datasets in many fields, such as image datasets such as ImageNet and MNIST, and text datasets such as Wikipedia. These datasets are open by research institutions, scholars or companies, and are widely used and shared in specific fields. Public datasets are the main source of datasets for most common large models.

Second, cooperative data sharing . Many companies, institutions and scholars have unique data resources and are willing to cooperate with others to share these data resources to support research and applications in different fields. For example, many medical institutions collect a large amount of medical image data, which can be used for training tasks such as image analysis or lung cancer detection. This is exactly what happened to Penshen Composition. Although the two parties are partners, they disagree on data citations.

Third, large-scale network data. When we use the products and services of large Internet companies, the companies usually collect and store our data, including search history, browser history, GPS location, social network, etc. These data can be used to train large language models, natural language processing models, etc. The data source of the domestic large-scale model has a strong correlation with its own advantageous business. As a leading domestic search engine company, Baidu’s large-scale model product Wenxinyiyan mainly sources data sets from online texts, books, news, and social media content , scientific papers, speech transcription, etc., which is also one of the advantages of its model training.

Fourth, data crowdsourcing. Crowdsourcing is a method of solving problems by collecting data from a large number of users or workers. In this way, large-scale data sets can be quickly collected, such as image annotation, audio translation and other tasks. These datasets can be used to train vision and speech models, etc.

OpenAI previously disclosed that in order for AI to talk as smoothly as humans, R&D personnel provided GPT-3.5 with as much as 45TB of text corpus, equivalent to 4.72 million sets of China's "Four Great Classics". The sources of these corpora include Wikipedia, online articles, books and periodicals, etc., and even the code open source platform Github is included.

Recently, after half a year of research and development, the domestic AI quasi-unicorn enterprise Shishi intelligent self-developed large language model in the vertical field - TARS (TAS) officially started internal testing! As for the data sets currently used for training large models, Sun Linjun, the founder and CEO of Real Intelligence, said that the current data sources are various, mainly including public data sets, classic books, documents, knowledge content, encyclopedias, and open source data collection, as well as the data accumulated by its own business, if it is a vertical large model cooperative enterprise, it will provide relevant data sets. The proportion is not fixed, but it must be the largest amount of public data, and the access to training data is mainly through the establishment of a database.

Dr. Ma Liang, CTO and Chief Data Scientist of HCR Huichen Co., Ltd. who recently released a variety of AIGC products, said: We provide professional data analysis services for the industry, so our training focuses on constructing industry-specific AIGC analysis models to provide business intelligence for specific industries. Generated abilities have higher level requirements. Therefore, there are very few data from external sources in the training data, mainly from the field data accumulation of the company's long-term service in various industries, and most of them are based on business data resources generated by experts (including a large number of industry public data, professional questionnaire templates, project proposal templates, business analysis report template, etc.). At present, the relevant data of cooperative enterprises have not been accessed.

Du Junping, chairman of the board of directors of the LF AI&DATA Foundation, once publicly stated: "The large AI model is like a greedy 'monster', and researchers always need to feed more and better quality data." He said that the current data are almost all It is obtained from the three channels of "actively collecting on the Internet", "purchasing from a third party" and "using public data sets". In Du Junping’s view, the data obtained from the first channel has strong limitations. Due to copyright issues, many companies can only obtain data from their private domains; the data obtained from the second channel faces data pricing, data quality and other issues. ; and the data obtained from the third channel can only be used for research, and there are many restrictions in commercial or other aspects.

And industry data is very core private domain data. The larger the volume and quality of private domain data, the more valuable it is.

Take the large vertical industry model trained by Xueersi in this incident. An education company has a large amount of educational data, so it can develop products such as vertical large model education. Similarly, project data in the construction industry, user profile data in the financial industry, and ship position data in the shipping industry are the keys to empowering large vertical models.

However, these private domain data are held in the hands of the enterprise itself or its partners, and for the sake of data security and compliance, most organizations will only try large-scale model training after localized deployment. It is hard to imagine that enterprises will put their core The data is given to others for training.

Upgrade from "quantity" to "quality"

If it is said that the large model training that everyone focused on in the early stage is based on "quantity", so far, with the further improvement of training, "quality" will become the only way to choose in the future in the data training of large models.

Therefore, it is also very important to reasonably label and label the data. Data classification and classification can help improve product efficiency, and high-precision labeled data can further enhance the professional performance of large models. However, at this stage, the cost of obtaining high-precision labeling data for vertical industries is relatively high, and there are few industry-specific data in public databases. Therefore, high requirements are placed on the construction of vertical large models.

Regarding the quality of the current large-scale model data set, He Conghui, research director of SenseTime's large-scale device, said that large-scale language models have high requirements for the quality of pre-training data, which is mainly reflected in fluency, cleanliness, knowledge-intensiveness, and security. The training data needs to contain a large amount of correct syntax and semantics so that the model can understand and generate text that conforms to the language rules. Fluency directly affects whether the text generated by the model is smooth and easy to read. Cleanliness means that the pre-training data should be clean and accurate without errors, noise or inconsistent information. During the training process, the model will learn the patterns and features in the data. If the data quality is not high, it may cause errors and inaccuracies in the text generated by the model. Security is also very important. Language models should abide by certain ethical and legal norms and not generate harmful, offensive or inappropriate content. Pre-training data needs to be screened and reviewed to exclude inappropriate content to ensure that the text generated by the model complies with social values ​​and ethical standards.

Sun Linjun, founder and CEO of Real Intelligence, said that large-scale model training has relatively high data quality requirements. Model training, model fine-tuning, and reward model training all require relatively high-quality data sets, multiple rounds of interactive data, and generated data. The quality of the data sorted by the results will have a great impact on the model performance. For low-quality public datasets, either clean them or discard them. At the same time, the distribution and density of data are also important factors that determine the quality of the model and are part of the data quality.

GPT has high requirements for data quality, while industry AIGC has higher requirements for data quality that represents industry understanding. This is mainly reflected in two points: it is highly compatible with the industry and contains professional in-depth cognition of the business. The data we are training now, even the data accumulated in the professional field, has many problems before training, not only the problem of routine cleaning, but also the problem structure and expression of the industry's in-depth business cognition, and there are many adjustments of. The same batch of original corpus, after different cleaning and optimization methods, after training, the business analysis effect of the model is different. Dr. Ma Liang, CTO of HCR Huichen Co., Ltd. said so.

Large language models are deep neural networks with billions to trillions of parameters that are "pre-trained" on huge corpora of natural language, including structured data, online books, and other content, in terabytes. The biggest breakthrough of ChatGPT was when GPT-3 appeared, with about 175 billion parameters and a data volume of 45 TB.

Li Wei, vice president of Mobiz, believes that data is the fuel for large models, and the quality of data largely determines the quality of models. Our data enhancement work is mainly divided into two parts, pre-training and subsequent alignment training (SFT, RLHF), the former seeks quantity, and the latter focuses on quality. In principle, the pre-training data should be kept as diverse and clean as possible. The alignment data for later training, especially the SFT data, should not be large in quantity, but high in quality, and should reflect the diversity and proportion of the alignment work. The literature shows that small data with some high-quality diversity can also perform well in alignment work. Of course, in actual engineering implementation, it is not advisable to blindly pursue small SFT-aligned data (for example, 1,000-10,000 pieces), because excessively bloated SFT data (such as tens of millions or more) does not necessarily produce a good model. The routine data enhancement and alignment training work in this area must be streamlined to achieve rapid iteration, and the quality improvement of the large model can be effective.

Can co-construction and sharing solve the data set problem of large model training?

The development of large models is inseparable from the help of massive data. At present, the intellectual property rights of data sources have become the Achilles heel of the development of large models. Based on the above-mentioned Xueersi and ChatGPT incidents, it mainly involves the "data theft" behavior of the AI ​​large model. What factors can be used to judge that the data has been stolen?

There is essentially no difference between AI data grabbing cases and typical data grabbing cases in judicial judgments in recent years. For these cases, it is necessary to judge whether the data capture behavior has caused damage to the data holder's commercial interests and market competitive advantages, whether the unauthorized use of other people's labor results has violated business ethics, and the reasonableness of the captured data needs to be considered. sex and legality.

Data capture behaviors that use capture technology to destroy others' market competitive advantages, have and exist subjective intentions to gain competitive advantages for themselves, violate the principle of good faith, and disrupt the order of competition may constitute acts of unfair competition. At the same time, it directly violates the relevant provisions of the Data Security Law.

Especially for the data of the partner, if there is a clause on liability for breach of contract in the cooperation agreement, it should be handled according to that clause. If the cooperation agreement does not involve this situation, it will be regarded as an infringement, and the corresponding infringement liability will be borne, including but not limited to apologizing, stopping the infringement, and compensating for losses.

The balance of data and privacy is an important issue faced by large model applications. How to protect the privacy of users on the premise of ensuring data security is an important problem that needs to be solved in large-scale model applications. At present, privacy computing technology and industry have become a key technical path to balance data circulation and privacy security.

Facing these challenges, how should we solve the problem of data set bottleneck?

1. National and social levels.

The first is that data security can be guaranteed through legislation. At present, Japan, the United Kingdom, and the European Union have legally confirmed the use of data mining as a fair use: Japan has stipulated copyright exceptions for text data mining in the name of "computer information analysis", and the United Kingdom has also introduced copyright licensing for text and data mining or exceptions.

On June 14, the European Parliament voted to pass the draft negotiating authorization for the "Artificial Intelligence Act", which means that the bill will enter the final stage before the EU starts regulation. The bill requires suppliers of basic models such as OpenAI, Google and Microsoft to disclose whether they use copyrighted data in the process of training models.

Previously, the "Generative Artificial Intelligence Service Management Measures (Draft for Comment)" announced by the Cyberspace Administration of China in April this year also clarified that the pre-training and optimized training data used for generative artificial intelligence products should comply with the Internet Security Law, etc. The requirements of laws and regulations do not contain content that infringes intellectual property rights, and those that include personal information should meet the requirements of the "inform-consent" principle, and the authenticity, accuracy, objectivity, and diversity of the data should also be guaranteed.

Zhang Xin, executive director of the Digital Economy and Legal Innovation Research Center of the University of International Business and Economics, said that the "Generative Artificial Intelligence Service Management Measures (Draft for Comment)" has established a clear framework for the compliance requirements of AI training data sets. In addition to intellectual property rights, we can also explore the use of various legal means to achieve.

According to Zhang Xin's analysis, the implementation of supervision still has problems such as difficulty in tracing after the event, especially in the case of increasing algorithm complexity and the emergence of "algorithmic black boxes". If you restore and trace the compliance of the data set afterwards, it is very dependent on the large model Developers provide data processing records and logs, and it is difficult to confirm from the outside. In addition, technically speaking, it is difficult for a large model to accurately delete a user's personal information, which limits the exercise of the "right to delete" in personal information protection.

Secondly, through the co-construction and sharing of data sets, large-scale model enterprises can have richer data sets.

Research institutions and developers are beginning to realize the importance of cooperation and sharing. Establishing a dataset sharing platform and cooperation network can promote the sharing and complementarity of data resources, thereby reducing the burden of data collection and labeling for a single team.

By sharing datasets, data from different sources and domains can be obtained, increasing the diversity of data. This helps to train large models with wider application capabilities and adapt to the needs of different scenarios and tasks. The sharing of data sets by all parties can make full use of their respective data resources, avoid duplication of labor and waste, and improve data utilization efficiency. The joint construction and sharing model can effectively integrate the expertise and resources of all parties to achieve win-win cooperation. In the joint construction and sharing mode, the risks of data collection and use can be shared. All parties can jointly formulate data usage guidelines and cooperation agreements, clarify data rights and responsibilities, and reduce legal and ethical risks.

Dr. Yang Xiaodong said that the sharing and co-construction data mechanism can provide continuous assistance to the research and implementation of large models. Of course, first of all, it is necessary to balance the interests of all parties, and ensure the quality and quantity of data through administrative and technical means, so as to realize real value and form a benign development ecology.

However, the co-construction and sharing model also faces some challenges and limitations: First, in the co-construction and sharing model, data privacy and protection is an important issue. Partners need to ensure data security, formulate privacy protection measures, and abide by relevant laws and regulations to protect the rights and interests of data owners; the multi-party joint construction and sharing model requires a good cooperation and coordination mechanism. Partners need to communicate and collaborate effectively on data collection, labeling, and use to ensure the consistency and quality of data sets. Finally, in the joint construction and sharing model, it involves the rights and interests of data and the distribution of benefits. All parties need to negotiate and reach consensus, and formulate a fair and reasonable benefit-sharing mechanism to ensure that the rights and interests of all parties are respected and protected.

2. For large-scale model R&D enterprises.

For large-scale model R&D companies, it is crucial to resolve data disputes. First of all, it should ensure that relevant laws and regulations are complied with in the process of data collection, use and storage, including data protection and privacy regulations. Establish clear policies and processes to ensure data compliance and legality.

Secondly, establish clear contracts and agreements with data providers, partners or customers to clarify the rights, scope of use and restrictions of data. Ensure that both parties have a clear agreement on the use and sharing of data, and clarify the responsibilities and obligations of each party.

Of course, in the process of data collection and use, data review and verification are carried out to ensure the source and legality of the data. Verify the accuracy, completeness and authority of the data, and communicate and confirm with the data provider.

Moreover, appropriate data security measures should be taken, including data encryption, access control, data backup and disaster recovery plans, etc., to prevent data from being stolen, tampered with or leaked. Ensure the confidentiality and integrity of data is protected.

At the same time, it is recommended that large-scale model R&D companies seek professional legal support, especially when dealing with data disputes or disputes. Legal professionals can provide targeted legal advice and guidance to ensure that companies resolve data disputes within the legal framework.

Follow integrity and business ethics, and uphold the principles of integrity and business ethics in the process of data collection and use. Follow the principles of fair competition and reciprocity, respect the rights and interests of data owners, and avoid unauthorized or malicious use of other people's data.

Large-scale model R&D enterprises should pay attention to data disputes and take corresponding measures to solve and prevent these problems. Compliance and legality, contracts and agreements, data review and verification, data security measures, legal support, training and education, and integrity and business ethics are all key aspects that need to be effectively applied and implemented in the data management and operations of an enterprise. implement.

3. For partners or users.

Data security is already a cliché. For large-scale model partners or users, how should they protect their own data security from infringement?

The first is to carefully read and review the contract. Before cooperating with large-scale R&D companies, carefully read and review the terms of the contract, especially the part about data use and protection. Make sure contracts include clear data security clauses covering confidentiality, security and compliance of data.

Secondly, the scope of data provision should be limited. During the cooperation process, the scope and purpose of data provision should be clearly stipulated, only necessary data should be provided, and the disclosure of sensitive information should be limited. Ensure that only data that is reasonably needed is used, reducing the risk of data breaches and misuse. When sharing data, take steps to protect the privacy and anonymity of the data. Data masking techniques, data encryption, and data anonymization methods may be used to reduce the likelihood of data being identified and associated. Develop internal risk management mechanisms, including plans and procedures for monitoring and responding to security incidents such as data leakage and unauthorized access. Establish the ability to respond to and handle data security issues in a timely manner.

Of course, real-time monitoring of data usage is also required. For shared data, it is recommended to keep monitoring and tracking the data. Ensure that data is used in compliance with contracts and agreements, and monitor for unusual activity or unauthorized data access. Partners or large-scale model R&D companies are required to take appropriate data security measures, such as data encryption, access control, and bug fixes, to ensure data security and confidentiality.

The most important thing is to choose a reliable partner. When choosing a partner, carefully evaluate its data security and privacy protection capabilities. Choose a reputable and trustworthy business and understand its data security measures and compliance.

In short, whether it is a large-scale model research and development enterprise, a partner or a user, it is very important to protect data security. As a key link in the development of large models, data sets require the support of comprehensive technologies, partners and ethical guidelines. Only by solving the bottleneck problem of data sets can we promote the further development of large models and bring more innovations and applications to the field of artificial intelligence.

Text: Yu Xiaoyu  /  Data Ape

b596996ec9a36ea441a94f95dec1ffe3.jpeg9fec632d78f5ee19276e86966ac6c199.png

07e8260df783c2718ed646d89d1c019f.png

Guess you like

Origin blog.csdn.net/YMPzUELX3AIAp7Q/article/details/132033146