GitHub Copilot: Only 6 people are needed to make an epoch-making product

Finishing | Chu Xingjuan Nuclear Cola

Source | AI Frontline ID | ai-front

Currently, Copilot has become a commonly used auxiliary tool for domestic developers. As one developer commented, "When coding, I want the least amount of distraction. Copilot has been a huge help to me in this regard. It cuts down on the time I might spend searching for solutions on the web, and they work on my Your favorite IDE right at your fingertips.” Copilot brings a lot of convenience.

While artificial intelligence and automation have long been part of developer workflows, Copilot, a cloud-based artificial intelligence tool developed by GitHub and OpenAI, has made everyone truly feel the power of "intelligence." According to Stack Overflow's latest developer report, Copilot is the most popular developer search tool today. So how is such an "epoch-making" tool created?

The silent research and development of six people

"We were kind of a skunkworks [specifically, a secret research program], and no one knew about us," recalls Alex Graveley, one of the creators of GitHub Copilot, which was run by a small team based on entrepreneurial principles. Developed in "a very dysfunctional GitHub/MSFT organization" in less than a year. In this team, there are only 6 developers, in addition to a PM and a VP who are mainly responsible for the landing page and icon.

Alex is not sure exactly when it started, but at that time OpenAI and Microsoft had reached an agreement on a supercomputing facility to build a large training cluster. They're also working on another partnership agreement that could bring AI-related terms to Office and Bing. GitHub is certainly no exception, and they want to see what role AI can play in development.

OpenAI intends to fine-tune the model to see if it can better assist programming with a small model. What is a "small" model? At that time, no one in the team knew how to control the scale, but what was certain was that there would never be too many parameters and a huge volume. Alex recalls that this "little" model was not as big as Davinci's.

OpenAI's base model is like a training artifact. They want to bring the code in and see how their base model reacts. "I think this has a positive impact on the chain of thinking. After all, code reasoning has a clear linearity, and the AI ​​​​model should be more suitable for this kind of application scenario where one thing is done and the previous thing affects the next thing. ,” said Alex.

But the effect at the beginning was not ideal, and it can even be said to be quite bad. After all, this is just a low-level artifact, and encountered a small sample of data on GitHub. Only Alex and another machine learning engineer, Albert Ziegler, were fiddling with the model. He feels that although it doesn't work in most cases, the AI ​​model seems to be gathering strength.

At first, they fed the data only Python code, and wanted it to produce useful output from it. "We didn't know anything, so we started with the easy stuff and jumped in. See if this works, see if that works. Frankly, we have no idea what we're doing. So the first task is to do more Test it and see what it can do."

Alex and they crowdsourced internally a bunch of Python questions that were definitely not going to be in the training set. Then they started to select repo and design tests to see if the functions generated by the model could pass the test. The basic process is to ask the model to generate the corresponding function, and then run the test to see if the given function passes.

The pass rate at the beginning was very low, around 10%. Afterwards, the team began to give the model more attempts, trying to let it slowly figure out the solution. In other independent tests, Alex They also write the test function, and then try to make it fill the function body. If you can pass, it proves that it really works. In the wild, they would download a repo and run all the tests, then see which tests passed, which functions were called, if the function body was generated correctly, and then re-run the tests to see if it passed. Finally, record the result and calculate the percentage.

As you can imagine, the passing percentages for the pre-tests are quite, very low. So the team started feeding the model all the code from GitHub, and introduced some other new tricks that hadn't even been thought of at the beginning. Ultimately, it went from less than 10 percent of its pass rates in the wild to more than 60 percent. In other words, give it any two code generation tests and it will pass almost one of them. "It's a gradual process, from 10% to 20%, to 35% and 45%, and then slowly increase."

During the exploration process, the team also tried to improve the design quality of the prompt words and guide it in specific links. This model is exposed to all versions of the code, not just the latest version. With diffs, the model can understand the small differences between different versions.

"Anyway, it ended up being better and stronger. But at least in the beginning, everything had to start from scratch, and we were like ignorant children. The only thought was that maybe this thing would replace the Stack one day Overflow and other development workflow tools," says Alex.

go one step further

The first iteration of Copilot can only be regarded as an internal tool to help people write some simple tests. Then the team started trying to generate common UI. "After all, the pass rate of the generated code was only 10% at the beginning, and the UI design is actually a relatively open problem, which may avoid the fact that AI is not capable. It would be great if it succeeds."

So, next, the team started fine-tuning and testing the model. In addition, they want him to implement the extended functions of VS Code, such as code auto-completion? At the time, Alex felt that this should be no problem, and the exploration of automatic completion also represented a huge leap forward. "Although the ultimate goal is still to replace Stack Overflow, I can't figure out how to achieve all this at the initial stage. It is true to implement some functions in VS first."

"As a small step for us, autocompletion works, and it's fun and useful. It pops up a prompt box like any other autocompletion feature, and lets you select a string in it. It's convenient and easy to use. , very comfortable. We also tried some other methods of function delivery, such as adding a small button on the empty function, which can be quickly generated for the developer; or the developer can click the control key and choose from a large list that pops up. In short, we tried almost every VS Code UI we could think of,” said Alex.

Although everything is still in its infancy, the recommendation list it provides can be described as "changing with each passing day". After all, the model has only been exposed to a small number of samples at this time, so it can only be used as a toy for technical enthusiasts and test designers. The team wants it to be as good as Gmail's text autocompletion.

"I really like that product. It is the first deployment result of a large language model. It is very fast and the effect is very good. Google also specially published a paper to share the specific technical details and detailed adjustments. We are working hard in this direction. Just now Completion was pretty bad at first, but it felt like it was going in the right direction. After much trial and error, I finally came up with a small demo video,” says Alex.

Alex recalls the team working 12-hour days, overcoming obstacles and ignoring best practices. At that time, only the CEO, vice president and the team believed in this matter, and others were more skeptical.

Microsoft's global push

Before the release of the general version, Copilot has been open for public beta, free for everyone to use, and has made many optimizations for different groups. For example, how will experienced programmers use it, how will novice developers use it, and what habits and tendencies users in different countries and regions will have.

The Copilot team gathered a bunch of statistics and realized that speed is the most important metric in any group. “We found that for every 10 milliseconds increase in latency, 1% of users abandon the feature. Also India has the lowest usage completion rate in the first few months of a new feature’s public launch — not sure why, but the completion rate It is indeed significantly lower than in Europe."

Later, the team discovered that this was because OpenAI only had one data center, and it was located in Texas, USA. As you can imagine, if the data needs to travel from India across Europe and the Atlantic Ocean to finally reach Texas, the back and forth delays must be maddening. This will lead to a disjoint between the prompt rhythm and the input rhythm, and the function completion rate will inevitably be affected.

After finding the crux, the team members were relieved. And users not far from Texas have given good reviews. For example, someone will comment, "I don't know how to program, but because of work needs, I want to know how to write a script that is 100 lines long." As it turns out, AI Models are particularly good at this development pattern, and once you find the pattern, the UI you design can come in handy.

Then came the team's "highlight moment": release the results, get praise from the market, and then update and iterate as soon as possible.

"Some customers said that they heard that Azure intends to fully undertake OpenAI in the next six months, but they can't wait, and it is best to open it next month." Alex said that the team was trying to meet these requirements, such as in Europe and Asia. Provide infrastructure to bring AI models closer to all users in the West Coast, Texas, and even Europe. Microsoft has put a lot of effort into this, and with the facility up and running, Copilot is here to stay.

"Copilot would not be possible without the geniuses at OpenAI and the principled VSCode editors," said Alex.

Reference link:

https://sarahguo.com/blog/alexgraveley

https://twitter.com/alexgraveley/status/1607897474965839872

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/131308715