GOTC 2023 Enterprise Insights: The entire machine learning tool will eventually be dominated by open source

" The key is not the underlying model, but its usability , including  RLHF (Reinforcement Learning from Human Feedback) and the interface to interact with it."

"RLHF is how we use human feedback to make adjustments. The simplest version is to show two outputs, ask which one is better, and which human reader prefers, and then use reinforcement learning to feed that into the model."

" RLHF aligns models with human-desired goals ."

The above remarks are a few words that OpenAI CEO Sam Altman (Sam Altman) said in a conversation with MIT research scientist Lex Fridman (Lex Fridman) not long ago.

As ChatGPT kills the Quartet, the coverage of human knowledge and the emerging reasoning ability have been praised. The key link behind this is human feedback reinforcement learning, which also leads to a series of data labeling related requirements.

In fact, the research and development of data annotation tools in the industry has actually been going on for a long time, and Beisai Technology is one of the leading companies. In 2015, Du Lin, a machine learning practitioner, founded Beisai Technology and began to develop his own data labeling system Origin1. It was only at the end of 2017 that he began to undertake customer business. In 2022, Beisai will create Xtreme1, an open source platform for "training data engineering" based on Origin1, and build a cloud system. In less than a year of open source, dozens of users have changed from open source product users to  SAAS commercial product customers.

Looking back now, Beisai’s growth seems to have subtly stepped on the AI ​​trend this time in advance, and its open source products have also achieved good results in the commercial market. OSC invited Du Lin, the founder of Beisai, to talk about Beisai's growth and the story behind it, as well as his views on the current data labeling field and open source.

Wang Jiajun, Technical R&D Director of Beisai Technology, will also attend GOTC2023 and deliver a keynote speech on "Xtreme1 Next Generation Multimodal Open Source Training Data Platform", introducing the technical features and architecture of the Xtreme1 training data platform in detail. Interested partners are welcome to click the link below to register for the conference!

For conference registration, please visit: https://www.bagevent.com/event/8387611

"I've always believed that engineers and open source can change the world"

The earliest contact with computers was when I used her computer at my mother’s work unit, and I fell in love with it. Later, my parents bought me a computer. I was in junior high school at that time, and I began to teach myself programming, and I was very obsessed.

In the first year of high school, two classmates and I formed a robot team. We built a wheeled robot and participated in the National Science and Technology Innovation Competition. At that time, I was responsible for writing the robot control and vision programs. Later, I entered the A CM class of Shanghai Jiaotong University's computer specialty class, and continued to study in the computer field. I also had a foundation of love for machine learning and computer vision.

I'm actually still writing code now, both as a hobby and as a career.

I have always believed that engineers can change the world, and open source. We do open source because we firmly believe that open source can change the world.

I founded Beisai because I am a practitioner of machine learning, and I see that training data is the part that consumes the most time of machine learning engineers. You can see that the most powerful part of OpenAI is the engineering ability of training data. A large part of its engineering is around the training data, and its ability is very strong.

So at that time, we thought that in the entire industry landscape, a complete system centered on training data, from client MLOps to the AI ​​​​Trainers market, was needed to accelerate the process of machine learning. After seeing this opportunity, I established Besai. Beisai has been established for 6 years now, and the data labeling business has always been one of the core businesses. From the business, we learned about the pain points of many customers in data training, so until two years ago, we decided to condense all the pain points and create A brand new product, Xtreme1 came into being.

In fact, the matter of data labeling, taking Model Training as the time node, we can divide it into two categories, one part is the work done before Model Training, and the other part is after. You may think that data labeling is the previous part, labeling data and then building a model. But in fact, there is still a part of more important work-the evaluation of the effect of the entire model, that is, the result of the inference fed back by the model, which requires humans to correct errors and correct for the machine.

These two parts of data are very important for the model, part of which is to build a basic model, and part of it is equivalent to further improvement of the model effect through human feedback input.

Therefore, the entire machine learning cycle is always inseparable from Human In The Loop, but with the development of the industry, the work on this part of the human will be more refined and clear. The so-called clarity means that in the end we can design some processes to allow machines to do tasks that do not need to be done by humans, and only allow humans to do tasks that humans must do ! This is also our goal in building the MLOps tool.

"The paradigm of data labeling is changing, but the essence remains the same"

Large language models, especially the recently emerging batch of large language models, have introduced Reinforcement Learning From Human Feedback, referred to as RLHF.

Its core is Human Feedback as the sorting of the results of the large model and the correction of the answer. In essence, machine learning introduces human knowledge to align the machine-generated answers with humans.

So why is it a continuous behavior, because the large language model belongs to a large number of unsupervised, and has mastered a lot of general knowledge, but in some very vertical scenes, or on corpus that has never been seen, it often says something If it sounds right, but it is actually wrong (hallucination), it is because there is no human labeling and alignment.

"From Human Feedback" is still Human In The Loop. In my opinion, this is a continuous evolution along with the entire life cycle of machine learning. In other words, the paradigm of data labeling has changed, but the essence has not changed.

The paradigm shift has two dimensions.

One is that our requirements for people are gradually increasing. When Beisai was founded six years ago, there were still many Object Detection tasks for common objects in 2D images, which needed to be framed and labeled for common objects. With the progress of the entire model, this type of work is gradually mastered by the model, and the labeling work begins to migrate to some more complex scenarios, such as scenarios that are deeply integrated with customer business, or that the machine has never encountered corner cases, or scenarios where annotation objects have more complex properties. Therefore, the labeling industry's requirements for personnel are also constantly improving.

The second change is that the modal dimension of data is increasing. Originally, we mostly dealt with single image, voice, and text data. Later, the joint annotation between cross-modalities began to explode, such as the combination of graphics and text, radar point cloud and Images combined with data and more.

To talk about the difficulties of the industry. In fact, the elements that the entire data label depends on are composed of two parts. Part of it is the need for people—AI Trainers, what we call data labelers. Then the first challenge is the supply of personnel. It is necessary to have a platform for the management, training, collaboration, task distribution and quality inspection of data labelers. Especially after the emergence of large language models, the requirements for personnel have been further improved. For example, OpenAI has used many PhDs to help them do data labeling work, because they require a stronger understanding of dialogue data.

We have made a quantifiable, on-demand labeling human resources platform Origin1. Customers can use Origin1 to find the people they need and assign tasks. With the help of accurate portraits and intelligent matching algorithms, efficient task distribution is realized. At present, more than 50,000 labelers and labeling companies have accepted tasks on our platform, which has solved the industry's demand for AI Trainers flexibility.

The second difficulty is data-centric MLOps, which is the main reason why we created Xtreme1 open source products. We believe that on the customer's business side, a big pain point is how to define and distribute these data and labeling requirements to the labeling party for the customer's own business data. Xtreme1 is a machine learning engineer that solves data management, A complete set of training data lifecycle management platform for visualization, definition of labeling tasks, and monitoring of the labeling process.

Positive cycle from open source to commercialization

We now have three product lines: Origin1, an annotator task assignment platform, Xtreme1, an open source training data management platform, and the corresponding commercial version, BasicFinder Cloud. Our internal personnel are relatively evenly distributed among the three product lines, and we did not say which one is particularly important or unimportant, because the matrix formed by the three products can coordinately complete the whole chain from supply to demand in the training data labeling work road work.

Recently, more efforts have been made on the open source end.

Open source is a very powerful ecology, and in the entire ecology, there can be strong synergy among tool chains, which can accelerate the efficiency of model development.

Xtreme1 decided to open source in September 2022 because we have clearly planned our product positioning. At the MLOps layer, we need a data-centric training data management platform, and we firmly believe that the entire machine learning tool will eventually be dominated by open source . .

When Xtreme1 was open-sourced, there were not many open-source labeling solutions on the market, especially our multi-modal capabilities. We have seen such a data-centric blank space, which can be connected to other processes of machine learning, such as data pipeline, model training, reasoning, evaluation and other systems; at the same time, we can use our products accumulated through a large number of businesses over the years The ability is shared with the majority of community users to empower more business scenarios, so we decided to open source.

There are two types of open source companies. One is to open source from Day 0, and to start open source from hobbies. The other is to perceive the general trend of open source. After the product has enough experience and technology accumulation, it hopes to integrate with more upstream and downstream software ecosystems to expand the influence of the product. We belong to the latter. Xtreme1 now has dozens of commercial users, thanks to our open source path and model. Many companies that do open source will face a challenge to balance open source investment and commercial revenue generation. But our model is relatively clear. When we use open source for ecology, we will find some customers who have stronger commercial needs for our business. Because for many customers, if they are at the engineer level, they may use an open source software and product to solve the problem, but if they go to the enterprise, they are more likely to want to buy a commercial version of the software, because it will It has higher security, controllability and system performance.

As a result, many customers migrated from open source products to our commercial version and SAAS version. In fact, whether they are open source users or our commercial software customers, they must be accompanied by a large number of labeling needs. They are using our labeling software, and we provide them with labeling services at the same time, thus forming a normal process from open source to commercialization. to the loop.

One thing we are proud of is the decision to donate the entire open source system Xtreme1 to the Linux Foundation. It will be open-sourced in September 2022 and decided to donate in December 2022. In January 2023, it was unanimously approved by the TAC members of the foundation and became the Linux Foundation It is the world's first "Annotation & Visualization" project in the MLOps territory of the foundation, and it is also the only one at present. This is very fast.

Our omni-channel downloads exceeded 2,000 in early April this year. Currently GitHub has 400+ stars, machine learning engineers from all over the world.

In the future, we will continue to expand open source products and invest in more research on multi-modal data capabilities. The latest version of Xtreme1 has integrated a multi-round dialogue labeling tool for LLM. In addition, we will continue to improve the synergy between products to achieve one-stop lifecycle management of machine learning training datasets.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/oscpyaqxylk/blog/8787138