Starcoder: May the source code be with you, the latest open source code generation LLM is here

Foreword:

On May 4th, BigCoder released the open source code generation model Starcoder with the support of HuggingFace.

For easy finding, here are the main reference and resource links:

Code repository: GitHub - bigcode-project/Megatron-LM: Ongoing research training transformer models at scale

官网:Open and responsible development of LLMs for code

博客:StarCoder: A State-of-the-Art LLM for Code

Paper: https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view

Model address: bigcode/starcoder Hugging Face

Demo address: BigCode - Playground - a Hugging Face Space by bigcode

Also accessible through the HuggingChat interface: HuggingChat

Contact email: [email protected]

VSCode plugin: HF Code Autocomplete

Release notes:

About BigCode

BigCode is an open science collaboration co-led by Hugging Face and ServiceNow dedicated to the responsible development of large code language models.

Introduction to StarCoder

StarCoder and StarCoderBase are Large Language Models of Code (Code LLM) trained using licensed data from GitHub, including from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. Similar to LLaMA, the researchers trained a ~1B parameter model for 15 trillion tokens. They fine-tuned the StarCoderBase model of 35B Python Token, resulting in a new model called StarCoder.

The researchers found that StarCoderBase outperforms existing open-code LLMs on popular programming benchmarks, and matches or exceeds closed models such as those from OpenAI (the original Codex model that powered an early version of GitHub Copilot). With a context length of over 8,000 tokens, StarCoder models can handle more inputs than any other open LLM, enabling a variety of interesting applications. For example, prompting StarCoder models through a series of dialogues enables them to act as technical assistants. In addition, these models can be used to automatically complete code, modify code through instructions, and explain code fragments in natural language. The researchers took several important steps towards secure open model release, including an improved PII editing pipeline, a novel attribution tracking tool, and publicly making StarCoder available under an improved version of the OpenRAIL license. The updated license simplifies the process for companies to integrate models into their products. The researchers believe that with its strong performance, the StarCoder model will be a solid foundation for the community to use and adapt to its use cases and products.

Evaluate

We thoroughly evaluate StarCoder and several similar models as well as various benchmarks. A popular Python benchmark is HumanEval, which tests whether a model can complete a function given its signature and docstring. The researchers found that both StarCoder and StarCoderBase outperform the largest models, including PaLM, LaMDA, and LLaMA, despite being much smaller. They also outperform CodeGen-16B-Mono and OpenAI's code-cushman-001 (12B) models. The researchers noted that one failure of the model was when it generated #Solution here codes, possibly because this type of code is often part of an exercise. To force the model to generate realistic solutions, the researchers added hints. This significantly improves StarCoder's HumanEval score—from 34% to over 40%, creating new state-of-the-art results for open models. The researchers also tried hints from CodeGen and StarCoderBase, but did not observe much difference.

An interesting aspect of StarCoder is that it is multilingual, and the researchers evaluated it on MultiPL-E, which extends HumanEval to many other languages. The researchers observed that StarCoder performs as well or better than many languages. It clearly beats it, and every other open access model, on a data science benchmark called DS-1000.

technical assistant

Through exhaustive evaluation, the researchers found that StarCoder is very capable of writing code. But they also want to test whether it can be used as a technical assistant, after all it has been trained with many documents and GitHub issues. Inspired by Anthropic's HHH cue, the researchers constructed a tech assistant cue. Amazingly, with just a hint, the model can act as a technical assistant and answer programming-related requests!

training data

The model is trained on a subset of The Stack 1.2. The dataset contains only licensed code and includes an opt-out process so that code contributors can have their data removed from the dataset (see I am in stack). Additionally, the researchers removed personally identifiable information, such as names, passwords, and email addresses, from the training data.

other versions

In addition to the model, the researchers also published a list of resources and demos:

  • Model weights, including intermediate checkpoints with an OpenRAIL license
  • All code for data preprocessing and training is licensed under Apache 2.0
  • Comprehensive Evaluation Tool for Code Models
  • A new PII dataset for training and evaluating PII removal
  • Fully preprocessed dataset for training
  • Code attribution tool for finding generated codes in datasets

论文:STARCODER: MAY THE SOURCE BE WITH YOU!

content

The title of the paper is "Starcoder: A Large Language Model for Code Generation", written by researchers from ServiceNow Research and Hugging Face. The main content of the paper is as follows:

The theme and research purpose of the paper is to explore the application of large-scale language model (LLM) in code generation tasks. A 1.5 billion parameter LLM called Starcoder is proposed, which can generate code from natural language, or generate natural language from code.

The research method and data source of the paper is to use a massive source code dataset containing more than 80 programming languages, called The Stack v1.2, to train Starcoder. The dataset consists of a trillion source code tokens containing only license information, covering a variety of programming paradigms, domains, and difficulties.

The paper uses 1399 crowdsourcers from 39 countries to mark a Personally Identifiable Information (PII) dataset. Used for training to remove PII.

The main finding and conclusion of the paper is that Starcoder demonstrates superior performance on multiple code generation tasks, surpassing existing models and baselines. These tasks include text-to-code, text-to-workflow, code completion, code summarization, and more. The paper also provides an extensive analysis of Starcoder, explores its strengths and limitations, and sheds some light on future research.

The innovation and significance of the paper are:

For the first time, a large-scale Transformer-based language model is proposed for code generation from natural language, or natural language from code, demonstrating its versatility and power over a variety of programming languages ​​and tasks.

Constructed for the first time a massive source code dataset of more than 80 programming languages ​​for training large-scale language models, and published it on the Hugging Face platform in an open access, open science, and open governance manner for the community to use and improve .

For the first time, a responsible AI model license is adopted, which places restrictions on the modification and application of the model, such as prohibiting the model from being used to generate or distribute malicious code.

Link

Model

  • Paper : A technical report on StarCoder.
  • GitHub : Everything you need to know about using or fine-tuning StarCoder.
  • StarCoder: StarCoderBase  was further trained in Python.
  • StarCoderBase : Trained in 80+ languages ​​from The Stack.
  • StarEncoder : Encoder model trained on TheStack.
  • StarPii : PII detector based on StarEncoder.

Tools and Demos

Data and Governance

You can find all resources and links at  huggingface.co/bigcode  !

Simple test:

HuggingChat:

Link: HuggingChat

Playground:

https://huggingface.co/spaces/bigcode/bigcode-playground

You can just write a function definition and let it do it:

Another complicated one:

Give it another 50% off

The overall feeling effect is not bad.

VS Code plugin

Friends who feel helpful, welcome to agree, follow, and share Sanlian. ^-^

Guess you like

Origin blog.csdn.net/hawkman/article/details/130641629