CodeFuse has been open source for half a year

hjfgjkf.png

2023 can be called the first year of large models. In the past year, the field of large models has developed rapidly, new large models have emerged one after another, and new products based on large models have also attracted everyone's attention. In the future, this field How many surprises will it bring to everyone?

 

Ant also launched its own large-scale Bailing code model, CodeFuse. After nearly half a year of internal polishing, it was officially open sourced in September. Let's take a look at what progress CodeFuse has made in open source in the past six months?

 

1. Make research and development easier

 

In the process of implementing large models into multiple scenes, automatic code generation has become a necessary part of technical implementation. Under this trend, Ant Group launched the Ant Bailing R&D Assistant based on the Bailing large model to help developers automatically generate codes, comments, test cases, etc., to improve research and development efficiency.


CodeFuse originates from Ant's own development scenarios and code base accumulation. It is based on massive high-quality code data and characteristic word lists in the coding field, as well as multi-task fine-tuning technology MFT. It is based on the daily coding, testing, operation and maintenance of Ant's more than 10,000 internal R&D personnel. In the scene, it has been repeatedly verified and iterated. Currently, CodeFuse has evolved from R&D efficiency and DevOps to the exploration of enterprise IT intelligent scenario agents. At the same time, based on CodeFuse, Ant Group has created a complete tool chain for large code models, including: model services, risk protection, data quality, and platform engineering.


In mid-2023, CodeFuse and its necessary tool chains will be open source and open to the technical community to help community developers conduct research, evaluation, secondary development and training on it.


Currently, CodeFuse is implemented in various departments of Ant and supports more than 40 programming languages ​​​​and more than 10 mainstream IDE platforms. The overall adoption rate is 30%, and the code passed through AI accounts for 20%. For example, CodeFuse is fully integrated with Ant Digital Technology's SOFAStack cloud native application intelligent business product line, covering design, research and development, testing, operation and maintenance and other fields, forming an end-to-end Copilot product solution from domain modeling to intelligent operation and maintenance, improving the The delivery efficiency and quality of enterprise-level applications accelerate the digitalization of the industry to reduce costs and increase efficiency.

 

2. Rich open source content

CodeFuse's mission is to develop large-scale code language models (Code LLMs) specifically designed to support the entire software development life cycle. The current content covers six major directions: code, operation and maintenance, analysis, testing, reasoning, and evaluation. As of 2023.12.31, CodeFuse has open sourced 11 code repositories, 4 data sets, and 11 large model parameter files. The total number of followers/likes exceeds 3,000, the number of downloads exceeds 24,000, and 1 paper has been accepted. , 2 preprints on Arxiv.

 

1. Code - MFTCoder series:

The world's first high-precision, high-efficiency, multi-task, multi-model support, multi-training algorithm, and large-model code capability fine-tuning framework ; the technical details of multi-task fine-tuning have been announced on Arxiv. You can refer to the MFTCoder paper and previously published articles.

Pre-trained language models can learn common language patterns and structures on large amounts of text data. By applying unsupervised learning techniques, the model can predict the next word in a sentence based on the sequence of previous words. However, pre-training alone cannot achieve high performance on specific natural language processing tasks. Therefore, pre-trained models need to be fine-tuned on small task-specific datasets to learn task-specific features and improve performance. The fine-tuning process uses supervised learning techniques to adapt a pre-trained model to a specific task. Dividing the training process into two stages: pre-training and fine-tuning, the natural language processing model can give full play to the advantages of unsupervised learning and supervised learning.

However, it should be noted that when the number of parameters of the model is huge, fine-tuning and deploying each downstream task independently will require a lot of resources. However, is there a way for one model to support all downstream tasks simultaneously? The answer is yes, multitask fine-tuning (MFT) provides an effective way to solve this problem.

Multitasking fine-tuning not only saves resources, but also brings other benefits. Through joint training, the model can learn the characteristics and patterns between multiple tasks. Compared with fine-tuning each task individually, multi-task fine-tuning can better complete various tasks. At the same time, because the learned features and rules are related to each other, the generalization ability of the model will also be improved. This means that the model can perform well even when faced with unseen tasks because it has learned many characteristics and patterns of related tasks.

 

2. Operation and maintenance-DevOps series:

The industry's first open source Chinese development and operation model can help engineers answer questions encountered in the DevOps life cycle, and provide an AI intelligent assistant to build the entire software development life cycle through search enhancement generation, tool learning and sandbox environment  ; For detailed introduction, please refer to previous articles DevOps-Eval , DevOps-Model , DevOps-Chatbot 

We hope that users will gradually transform from the traditional development and operation model of data query everywhere and independent and decentralized platform operations to the intelligent development and operation model of large model question and answer, and change people's development and operation habits.

Core differentiating technologies and functional points:

  • Intelligent dispatching core: A dispatching core with complete system links has been built to support multi-mode one-click configuration and simplify the operation process.
  • Whole code library analysis: Achieve in-depth understanding of warehouse-level code, as well as project file-level code writing and generation, improving development efficiency.
  • Document analysis enhancement: It integrates the document knowledge base and knowledge graph to provide deeper support for document analysis through retrieval and reasoning enhancement.
  • Vertical category exclusive knowledge: An exclusive knowledge base customized for the DevOps field, supporting self-service one-click construction of vertical category knowledge base, which is convenient and practical.
  • Vertical model compatibility: Small models for the DevOps field ensure compatibility with DevOps-related platforms and promote the integration of the technology ecosystem.

Relying on the open source LLM and Embedding models, offline private deployment based on the open source model can be realized. In addition, OpenAI API calls are also supported.

 

3. Analysis-CodeFuse-Query :

Query-based code analysis engine, suitable for large-scale and complex code base analysis scenarios. Please refer to the paper  https://arxiv.org/abs/2401.01571 ; for detailed introduction, please refer to previous articles

The features and advantages of CodeFuse-Query can be summarized as follows:

  • Highly scalable: CodeQuery can handle large-scale code bases and can adapt to different analysis needs. This high degree of scalability allows CodeQuery to play an important role in large organizations.
  • Data-centric: CodeQuery treats source code and analysis results as data. This data-centric approach gives it a unique advantage when dealing with code analysis problems in big data environments.
  • Highly integrated: CodeQuery integrates seamlessly into a variety of systems across large organizations, including data warehouses, data computing facilities, object storage, and flexible computing resources. This high level of integration makes it easier and more efficient to use CodeQuery in large organizations.
  • Support diverse requirements: CodeQuery can not only handle large-scale code bases, but also handle various complex analysis requirements, including service quality analysis requirements, cross-programming language analysis requirements, algorithm requirements, and performance requirements.
     

4. Test-Test-Agent :

"Intelligent agents" in the testing field, create innovative solutions in the testing field, and build 24-hour online testing assistant services; for detailed introduction, please refer to previous articles

The horn of large models has sounded, and large models in the testing field are also constantly evolving. Through the rich world knowledge accumulated in the pre-training process, they have demonstrated extraordinary reasoning and decision-making capabilities in complex interactive environments.

Although the basic model has achieved remarkable results in the field of testing, there are still some limitations, and domain-specific testing tasks usually require specialized tools or domain knowledge to solve. For example, the basic model can complete tasks such as single test code generation and test text generation through pre-training knowledge, but when dealing with complex integrated use case generation, domain-specific use case generation, and test process pipeline interaction, more professional tools and fields are needed. Knowledge.

Therefore, integrating specialized tools with basic models can give full play to their respective advantages. Specialized tools can solve the problem of insufficient model timeliness, enhance professional knowledge, improve interpretability and robustness. The basic model has human-like reasoning and planning capabilities, can understand complex data and scenarios, and interact with the real world.

 

5. Reasoning - ModelCache :

The large model semantic caching system reduces the response time of similar requests and improves user experience by caching generated model results; for detailed introduction, please refer to previous articles .

Currently, large model services face the following three challenges:

  • High cost: Large model parameters are in the hundreds of billions, and a single instance requires multiple A10 cards, making large-scale deployment costly. Therefore, current large model services are basically billed according to the number of tokens processed, resulting in high user-side usage costs.
  • Slow: Inference speed for large models is also a critical issue. In many real-time applications, such as dialogue systems and business assistants, response time requirements are very high, usually at the millisecond level. However, the inference speed of large models is often slow, on the second level, resulting in the inability to return results in real time and a degraded user experience.
  • Stability is not guaranteed: Due to the high cost of single-instance deployment, when current large-model services receive large traffic requests, current limiting is used to prevent service unavailability.

In response to the above challenges, the introduction of large model caching can solve the current problem: by introducing the Cache mechanism, the calculated results are cached. When similar requests are received, the results can be obtained directly from the cache, avoiding repeated calculations, saving computing resources, and significantly improving response time. , to improve user experience; at the same time, caching can play a diversion role, reducing the number of requests transparently transmitted to the backend, reducing backend pressure, and improving service stability. Therefore, Cache, as an important large model service deployment solution, can help enterprises and research institutions better apply large language models and improve model performance and efficiency in scenarios with limited resources and high real-time requirements. In the future, as large models are widely used in various fields, the importance of Cache will continue to be highlighted.

 

6. Evaluation-CodeFuse-Evaluation :

A multi-task evaluation benchmark in the programming field developed on the basis of HumanEval-x and MBPP. It can be used to evaluate the performance of large models in code completion, natural language code generation, test case generation, cross-language code translation, Chinese instruction generation code, etc. Task performance; for detailed introduction, please refer to previous articles .

Currently, the evaluation of large language models is divided into objective evaluation and subjective evaluation according to whether the generated results can be quantitatively measured, such as mathematical calculations and article generation. Objective evaluation: evaluate the generated content in various dimensions based on the most influential evaluation benchmark in the industry; subjective evaluation: organize multiple experts with professional background knowledge to evaluate relevant dimensions.

According to the evaluation execution method, it can be divided into three categories: automated evaluation, manual evaluation and model evaluation.

After the model training is completed, the score is run based on the evaluation benchmark. This process can be fully engineered and therefore becomes an automated evaluation. Manual evaluation, especially domain knowledge, requires experts in various fields to conduct evaluation. This method of evaluation is more expensive but the evaluation results are more convincing. The model (such as PandaLM) evaluation model learns the overall human preference for different generated texts by training a large model, and makes a relative evaluation based on the learned human preference. This evaluation method is more stable and efficient than manual.

 

3. Wonderful community activities

We know that open source is not just about opening code, but also about sharing and communicating in the community. There is a lot of useful information on open source content, and community activities will definitely not fall behind. Let’s take a look at what they have! !

In August, we held a special session to share "Test Generation Based on AIGC" at the AI+ Software R&D Digital Summit;

In September, CodeFuse was officially announced as open source at the Bund Conference;

In October, at MLSummit 2023, CodeFuse R&D experience was shared with the outside world;

In early November, a special speech on CodeFuse was given at the Yunqi Conference;

In November, the "Code Large Model Technology and Application Development" forum was jointly held with Shizhi AI and others;

In early December, at the CCF China Software Conference, we had on-site experiences and interactions with participants;

At the end of December, experience sharing "Exploration of next-generation R&D based on CodeFuse" was held at QCon, the global software developer conference.

 

4. Obtain industry recognition

This year, CodeFuse also received multiple awards to thank the industry for their recognition:

  • Won the Outstanding Open Source Technology Team of Open Source China 2023

  • Selected as TOP10 Large Model Pioneer Cases in Geek Park 2023

 

5. New expectations in 2024


Since 2023, large models have been increasingly implemented in the code field. After a year of practice, we have a deeper understanding of related technologies. I also saw a lot of interesting directions and implementation practices. In the new year of 2024, we will continue to delve into open source:

  • More innovative features will be released, for example, MFTCoder v0.2 that supports MoE will be released in January; a training framework and model that supports front-end design to code will be released in February;
  • For more offline activities, we will organize multiple CodeFuse offline meetups for interested colleagues to participate; we will also actively participate in domestic and international industry conferences/forums to share more practical experience of CodeFuse;
  • More community participation and interaction, community research, so that everyone can participate in the project; including but not limited to initiating the community to catch bugs together, contribute new features together, promote the standardization of related systems, and even organize related competitions, etc.

 

We very much welcome everyone to communicate and explore with us, and work together to define the next generation of full life cycle R&D solutions based on large models. Everyone is welcome to participate in our community, discuss and communicate together. 2024, let’s move towards the future together!

CodeFuse official website: https://codefuse.alipay.com

MySQL 5.7, Moqu, Li Tiaotiao... Taking stock of the (open source) projects and websites that will be "suspended" in 2023. Kingsoft WPS crashed . Linux's Rust experiment was successful. Can Firefox seize the opportunity... 10 predictions about open source The middle school purchased an "intelligent interactive catharsis device" - which is actually a shell for the Nintendo Wii. "Ruiping", the father of Redis, LLM programming: omniscient and omnipotent&& Stupid The "post-open source" era has arrived: the license has expired and cannot serve the general public. Vim 9.1 is released , dedicated to Bram Moolenaar 2024 "New Year's Battle" in the front-end circle: React digs holes but does not fill them, must it rely on documentation to fill them? China Unicom Broadband suddenly limited the upload speed, and a large number of users complained. Niklaus Wirth, the father of Pascal, passed away.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6942768/blog/10678192