GPTBot

Website owners can allow and restrict GPTBot to crawl website data according to their needs. GPT-5 relies on it to train, and it can be blocked if necessary

As we all know, OpenAI has kept the technical details completely secret since GPT-4. Initially, it only used a Tech Report to show the benchmark test results, but kept silent about the training data and model parameters. Although netizens broke the news later, OpenAI never responded.

It is not difficult to imagine that training GPT-4 requires massive amounts of data, which is not a problem that can be solved by paying for it. With a high probability, OpenAI uses a web crawler. Many users accused OpenAI on the grounds that this method would violate users' copyright and privacy rights.

Just now, OpenAI had a showdown: directly announced the web crawler that crawls data from the entire Internet - GPTBot.

These data will be used to train AI models such as GPT-4 and GPT-5. However, GPTBot guarantees that the crawled content absolutely does not include content that violates privacy sources and requires payment.

"GPTBot is used to crawl web data to improve the accuracy, functionality, and security of AI models," OpenAI said.

Website owners can allow and restrict GPTBot to crawl website data according to their needs. Next, let's take a look at how GPTBot works, and learn about the blocking method by the way.

First, the User-Agent String of GPTBot is as follows:

User agent token: GPTBot

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

Use the following method to add GPTBot to the robots.txt of the website to prohibit GPTBot from accessing the website:

User-agent: GPTBot

Disallow: /

It is also possible to allow GPTBot to access the content of specific parts of the website:

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

​​​​​​Recently, OpenAI has faced backlash for training large language models such as GPT-4 on website data without explicit approval. Critics say companies like OpenAI should follow training protocols even when the content is publicly accessible. There are also concerns that content will be taken out of context when fed into AI systems.

But even if the robots agreement is followed, since it is not a specification, but only a convention, it cannot guarantee the privacy of the website.

Since the release of GPTBot, the development has sparked a debate on Hacker News about the ethics and legality of using scraped web data to train artificial intelligence systems.

Some believe that the launch of GPTBot demonstrates the "grey area" of using public data to develop AI models:

“It would be nice to scrape the data after training the model. Presumably these headers won’t affect any pages they’ve already crawled to train GPT.”

"Now, they can lobby anti-crawling regulation and thwart anything else from catching up." Since GPTBot identifies itself, webmasters can block it via robots.txt, but some see no point in allowing it to do so. The advantage, unlike search engine crawlers will bring traffic.

One concern is the use of copyrighted content without attribution. ChatGPT currently has no attribution. Questions have also been raised about how GPTBot handles licensed images, videos, music and other media on the site. If these media are used in model training, it may constitute copyright infringement. whaosoft  aiot  http://143ai.com 

Other experts believe that the data generated by the crawlers could degrade the performance of the model if the content written by the AI ​​​​is fed back into the training.

Instead, some argue that OpenAI has the right to freely use public web data, and liken it to a person learning from online content. But others argue that if OpenAI monetizes network data for commercial gain, then the profits should be shared.

In sum, GPTBot has sparked complex debates about ownership, fair use, and incentives for creators of web content. While following robots.txt is a good step, there is still a lack of transparency.

This may be the next focus of public opinion in the technology world: With the rapid development of AI products, how should "data" be used?

Reference link:

https://twitter.com/GPTDAOCN/status/1688704103554359296

https://searchengineland.com/gptbot-openais-new-web-crawler-430360

https://platform.openai.com/docs/gptbot

https://news.ycombinator.com/item?id=37030568

https://www.searchenginejournal.com/openai-launches-gptbot-how-to-restrict-access/493394/#close

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132178532