Databricks is here to spoil the situation: 0-threshold clone ChatGPT, completely open source and can be modified for commercial use at will

1477e4997bbe169a8fb87f07d8c1441b.png

bbb0b2aa55718a44436a70cb8a320720.jpeg

The world's first fully open source large language model, performance comparable to GPT3.5!

The big data boom has spawned many successful companies such as Snowflake, Databricks, Splunk, and Cloudera. Now that we have entered the era of generative artificial intelligence, will there be a new combination of "artificial intelligence and big data"?

Recently, big data company Databricks has taken action in the field of generative artificial intelligence. Two weeks ago, the company released an open source large-scale language model called Dolly, aimed at responding to the strong demand for generative AI and related applications in the market, we can call it Dolly 1.0.

6559ddf01b6664b95779e4f4f90e07d8.png

Generative AIs like ChatGPT and Bard use data that is often collected from thousands of different websites, and the amount of data used is staggering, and it takes thousands of powerful GPUs to train AI using this data. Provide support behind. Databricks hopes that by open-sourcing Dolly 1.0 and its training data, anyone can develop a truly human-like AI without investing millions of dollars, making this type of AI no longer affordable only for big tech companies Something that millions of small companies will also be able to benefit from.

Ali Ghodsi, CEO of Databricks, said that Dolly 1.0 requires very little data and a very short time to complete the training. "With just $30, a server and three hours, we can teach Dolly to start performing human-level tasks." interact."

On April 12, Databricks released the open source iterative version of the large language model (LLM) again, and named it Dolly 2.0. According to Databricks, Dolly 2.0 is the industry's first open source, follow-the-direction LLM fine-tuned on a transparent and freely available dataset that is also open source and available for commercial purposes. This means that Dolly 2.0 can be used to build commercial applications without paying for API access or sharing data with third parties.

1

Birth of Dolly 2.0

Dolly 1.0 is based on GPT-J, a natural language processing model open sourced by EleutherAI in 2021. GPT-J is a natural language processing AI model based on GPT-3 consisting of 6 billion parameters. However, the model uses the 52,000 question-answer dataset from the StanfordAlpaca project and is trained based on the output of OpenAI's ChatGPT. Because of OpenAI's terms of use, Dolly 1.0 cannot be used for commercial purposes.

Databricks pointed out in the official blog post, "The data set used to train Dolly 1.0 contains output from ChatGPT. The Stanford team explicitly mentioned that OpenAI's terms of service try to prevent anyone from creating AI models that can compete with it."

Dolly 2.0 is based on the first version of Dolly from Databricks. In order to avoid this problem and establish a commercially available model, Databricks successfully built Dolly 2.0 using the 12 billion parameter language model in the Pythia model family based on EleutherAI.

The company said they conducted model training and fine-tuning exclusively through crowdsourcing among 5,000 Databricks employees, building a training dataset with high-quality human-generated instructions. The company calls this high-quality dataset of human-generated responses/reveals databricks-dolly-15k, and it licenses it under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

"Anyone can use, modify or extend this dataset for any purpose, including commercial applications." Databricks also emphasizes that the dataset is available through the GitHub page (https://github.com/databrickslabs/dolly/tree/ master/data) to download directly.

Model weights can be downloaded from the Databricks Hugging Face page (https://huggingface.co/databricks).

2

Dolly 2.0 wants to be a boon for companies big and small

The reason why Databricks released a large language model based on open source data is mainly to consider the needs of enterprise customers to control the model and introduce targeted scenarios/specific use cases. This is also in stark contrast to commercial closed-loop training models (such as ChatGPT) that are common in the industry.

Bradley Shimmin, chief analyst at Omdia, a market research company, said, "Most of the models like Dolly 2.0 are open and do not require months of training on large-scale GPU clusters, so they are ideal for companies that want to build internal generative AI solutions." opened the door to a new world."

“These small (i.e., the size of the training parameters are small) models use a large number of prompt/response pairs as training data, so they are especially suitable for enterprise customers who want to control the entire solution and support targeted use cases. For example, they can use existing The help desk database built by the question-and-answer pair trains its own AI model."

According to Hyoun Park, principal analyst at consulting firm Amalgam Insights, another big advantage of open source big language models is that efforts such as Dolly 2.0 allow enterprises to better track data governance and residency, and align well with the use cases they support. relevance.

Park also specifically joked about OpenAI's name, saying, "Because other models such as OpenAI's ChatGPT depend on the API when they are used. For some enterprises, this dependence may raise questions about API compliance, governance or data security. question."

This also means that Dolly 2.0 and other large language models based on open source will be a boon for enterprises in heavily regulated industries. This is a good start for businesses to realize that they too can create and own their own models without having to pay for API access or share data with big language model providers. These can all create huge problems in a heavily regulated industry.

The Difference Between Open Source and Closed Source Big Language Models

Compared with the closed-source large language model, the training data used by the open source-based model is open to the public, so it can be fine-tuned and customized according to the business to meet the needs of the enterprise. In contrast, closed-source models such as ChatGPT are trained based on training mastered by its developer OpenAI, the model can be accessed through an API for a fee, and direct commercial use is prohibited.

According to Chandrasekaran, "'Open Large Language Models' can be understood in many ways. The most obvious and most important point is to make adjustments to the source code and deployment flexibility of these models. In addition, the scope of openness can also cover decision-making at the level of model weights, training data sets, and open/collaborative approaches. "

IDC's Schubmehl said Dolly 2.0 follows the philosophy of an open source-based model. "Dolly 2.0 is a set of large language models. The model ontology, training code, data sets and model weights can all be obtained from Databricks as open source resources for enterprises to create their own customized large language models according to business needs." Schubmehl also provided We note that this approach is in stark contrast to other large language models, which often do not open up the various building blocks of the model.

Analysts also mentioned that another difference between closed-source and open-source large language models is mainly reflected in the amount of training parameters. Among them, the parameter scale of the closed-source large language model is often larger. Taking ChatGPT4 as an example, 100 trillion parameters are used in its training; in contrast, the number of parameters of Dolly 2.0 is only 12 billion.

3

How Dolly 2.0 Fits In

Databricks' generative AI strategy

Constellation Research's Thurai said that Databricks' launch of Dolly 2.0 can be regarded as an important strategy for it to capture the generative AI market share.

"Essentially, many of the big language model and base model businesses are in the hands of the hyperscalers. Each has their own variant - Microsoft has ChatGPT, Google has Bard, AWS has it through the Huggingface partnership Infrastructure, processes, tools, and model sharing and catalog services. Databricks certainly cannot sit still and must take a slice of the booming big language model market.”

Other analysts believe that Dolly's release is in line with Databricks' strategy of bringing open source products to market.

"Databricks specializes in helping customers get the most out of their data and operations through a variety of open source AI tools and services," said IDC's Schubmehl. It’s a big language model.” But analysts admit that Databricks’ Dolly 2.0 may not have an immediate impact on competitors like ChatGPT or Bard.

Shimmin of Omdia believes, “The emergence of Dolly and other open source generative AI large language models will completely overturn the future prospects of existing large language models such as Bard, ChatGPT, and Galactica. The position in products such as Microsoft Office will be firmly maintained."

Amalgam Insights' Park disagrees, arguing that Dolly will eventually become a functional companion to general-purpose tools like ChatGPT. "People learn how to use and prompt generative AI from general-purpose tools, while models like Dolly are responsible for helping users deal with more specific and specialized work-specific use cases."

In addition, some comments pointed out that one of the capabilities of Dolly-like LLM can be used to write code, especially SQL code. This may result in non-SQL experts being able to set up and run queries on the Databricks lakehouse.

This can be understood in two ways: first, SQL developers can use it to improve productivity, and second, you don't need as many SQL developers. Dolly reduces Databricks' need for SQL programmers. Extending this idea to Snowflake and all other data warehouse environments, SQL skills may become less valuable in the future.

Reference link :

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llmhttps://www.infoworld.com/article/3693349/why-did-databricks-open-source-its-llm-in-the-form-of-dolly-2-0.html

Reprinted from丨InfoQ

Editor丨Weng Peipei

Related Reading| Related Reading

2fc21c6cdb565d7d995d336ba7195eb8.jpeg

The number of contributors to China's open source projects has exceeded 100,000, and "China's Open Source Ecological Map 2023" was released

89796b187e8001add2e52deacbd9eb11.jpeg

Talking about KPI and sustainable development of open source

Introduction to Kaiyuanshe

Founded in 2014, the Open Source Club is composed of individual members who voluntarily contribute to the open source cause. It is formed according to the principles of "contribution, consensus, and co-governance". It has always maintained the characteristics of vendor neutrality, public welfare, and non-profit. International integration, community development, project incubation" is an open source community federation with the mission. Kaiyuanshe actively cooperates closely with communities, enterprises and government-related units that support open source. With the vision of "Based in China and Contributing to the World", it aims to create a healthy and sustainable open source ecosystem and promote China's open source community to become an active force in the global open source system. Participation and Contributors.

In 2017, Kaiyuanshe was transformed into an organization composed entirely of individual members, operating with reference to the governance model of top international open source foundations such as ASF. In the past nine years, it has connected tens of thousands of open source people, gathered thousands of community members and volunteers, hundreds of lecturers at home and abroad, and cooperated with hundreds of sponsors, media, and community partners.

6ad7c933502d30bc1fa1a8da9129ddb3.png

Guess you like

Origin blog.csdn.net/kaiyuanshe/article/details/130299351