OpenAI announced GPT4's crawler tool-GPTBot, which complies with the crawler protocol and can be used for model training

 Xi Xiaoyao's technology sharing
 source | Heart of the machine

As we all know, OpenAI has kept the technical details completely secret since GPT-4. Initially, it only used a Tech Report to show the benchmark test results, but kept silent about the training data and model parameters. Although netizens broke the news later, OpenAI never responded.

It is not difficult to imagine that training GPT-4 requires massive amounts of data, which is not a problem that can be solved by paying for it. With a high probability, OpenAI uses a web crawler. Many users accused OpenAI on the grounds that this method would violate users' copyright and privacy rights.

Just now, OpenAI had a showdown: directly announced the web crawler that crawls data from the entire Internet - GPTBot.

These data will be used to train AI models such as GPT-4 and GPT-5. However, GPTBot guarantees that the crawled content absolutely does not include content that violates privacy sources and requires payment.

"GPTBot is used to crawl web data to improve the accuracy, functionality, and security of AI models," OpenAI said.

Website owners can allow and restrict GPTBot to crawl website data according to their needs. Next, let's take a look at how GPTBot works, and learn about the blocking method by the way.

 Large model research test portal

GPT-4 Portal (free of wall, can be tested directly, if you encounter browser warning point advanced/continue to visit):
Hello, GPT4!

First, the User-Agent String of GPTBot is as follows:

User agent token: GPTBot

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

Use the following method to add GPTBot to the robots.txt of the website to prohibit GPTBot from accessing the website:

User-agent: GPTBot

Disallow: /

It is also possible to allow GPTBot to access the content of specific parts of the website:

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

OpenAI recently faced backlash for training large language models such as GPT-4 on website data without explicit approval. Critics say companies like OpenAI should follow training protocols even when the content is publicly accessible. There are also concerns that content will be taken out of context when fed into AI systems.

But even if the robots agreement is followed, since it is not a specification, but only a convention, it cannot guarantee the privacy of the website.

Since the release of GPTBot, the development has sparked a debate on Hacker News about the ethics and legality of using scraped web data to train artificial intelligence systems.

Some believe that the launch of GPTBot demonstrates the "grey area" of using public data to develop AI models:

“It would be nice to scrape the data after training the model. Presumably these headers won’t affect any pages they’ve already crawled to train GPT.”

"Now, they can lobby for anti-grab regulation and block any other catch-up."

picture

Since GPTBot identifies itself, webmasters can block it via robots.txt, but some see no benefit in allowing it, unlike search engine crawlers that drive traffic.

One concern is the use of copyrighted content without attribution. ChatGPT currently has no attribution.

picture

Questions have also been raised about how GPTBot handles licensed images, videos, music and other media on the site. If these media are used in model training, it may constitute copyright infringement.

Other experts believe that the data generated by the crawlers could degrade the performance of the model if the content written by the AI ​​​​is fed back into the training.

Instead, some argue that OpenAI has the right to freely use public web data, and liken it to a person learning from online content. But others argue that if OpenAI monetizes network data for commercial gain, then the profits should be shared.

In sum, GPTBot has sparked complex debates about ownership, fair use, and incentives for creators of web content. While following robots.txt is a good step, there is still a lack of transparency.

This may be the next focus of public opinion in the technology world: With the rapid development of AI products, how should "data" be used?

 

References

 [1]https://twitter.com/GPTDAOCN/status/1688704103554359296
 [2]https://searchengineland.com/gptbot-openais-new-web-crawler-430360
 [3]https://platform.openai.com/docs/gptbot
 [4]https://news.ycombinator.com/item?id=37030568
 [5]https://www.searchenginejournal.com/openai-launches-gptbot-how-to-restrict-access/493394/#close 

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132178136