Nature: Why should generative AI be open source? New York University professor published a paper, "The Moral Path of Scientific Development"

picture

Source: Nature

By Arthur Spirling, Professor of Politics and Data Science, New York University

It seems like every day a new Large Language Model (LLM) is born , and its creators and academics comment wildly on its remarkable ability to respond to human cues. It can fix code! It can write a letter of recommendation! It quickly summarizes an article!

I'm a political and data scientist who is using and teaching such models, and from my perspective, academics need to be wary of LLMs . The most widely touted LLMs are proprietary and closed: run by commercial companies, their underlying models are not publicly available for independent inspection or verification by others, and researchers and the public do not know what files the models were trained on.

The rush to incorporate such artificial intelligence (AI) models into research is a problem. Their use threatens hard-won progress in research ethics and the reproducibility of results.

Instead, researchers need to work together to develop open-source LLMs that are transparent and company-independent.

Admittedly, proprietary models are convenient, "out of the box". However, it is imperative to invest in open-source LLMs, both to help build them and to use them for research. I'm optimistic that they will be widely adopted, just like open source statistical software, where proprietary statistical programs were initially popular, but today the community mostly uses open source platforms like R or Python .

An open source LLM, BLOOM, was released last July, and other efforts to build open source LLMs are underway. These kinds of projects are great, but I think we need to collaborate more and pool international resources and expertise. Open source LLMs are generally not as well funded as larger firms. And, they need to keep their feet on the run: the field moves so fast that a version of LLM becomes obsolete within weeks or months. The more scholars who join these efforts, the better.

Also, using an open source LLM is critical for reproducibility. Owners of closed LLMs can change their product or its training data at any time -- and this can change the results of scientific research.

For example, a research group might publish a paper testing whether recommendations from a proprietary LLM can help clinicians communicate more effectively with patients. **If another group tries to replicate this research, they don't know if the underlying training data for the model is the same, or even if the technique is still supported. **OpenAI's GPT-3 has been superseded by GPT-4, and supporting earlier versions of LLM will no longer be a major priority for the company.

With an open-source LLM, by contrast , researchers can look at many details of a model to understand how it works, customize its code, and flag errors. These details include the adjustable parameters of the model and the data it was trained on. Community participation and oversight help keep these models stable over time.

Furthermore, the use of proprietary LLMs in scientific research has disturbing implications for research ethics. The texts used to train these models are unknown: They may include direct messages between users on social media platforms, or content written by children who cannot legally consent to their data being shared. While the people producing the public text may have agreed to the platform's terms of service, that may not be the standard of informed consent researchers would like to see.

In my opinion, scientists should try not to use these models in their work as much as possible . We should switch to open LLMs and promote them as best we can. Furthermore, academics, especially those with large social media followings, should not tell others to use proprietary models. If prices spike, or companies fail, researchers may regret promoting techniques that keep colleagues locked up in expensive contracts.

Currently, researchers can turn to open LLMs produced by private organizations. For example, my colleagues and I are using Meta's open LLM OPT-175B. Both LLaMA and OPT-175B are free to use. But in the long run, this has the downside of making science dependent on the "benevolence" of corporations, a situation fraught with instability.

Therefore, there should be a code of academic conduct, as well as regulation, for working with LLMs. But these all take time. I expect such regulations to be initially clumsy and slow to take effect.

At the same time, support is urgently needed for large-scale collaborative projects to train open-source models for research. The government should increase funding through grants. The field is developing at lightning speed and now needs to start coordinating national and international efforts. The scientific community is best placed to assess the risks of the resulting models, and caution is needed in recommending these models to the public**. **

But it's clear that an open environment is the way to go.

Original link:

https://www.nature.com/articles/d41586-023-01295-4

Guess you like

Origin blog.csdn.net/AMiner2006/article/details/130264034