Talk to InfoQ about Baidu’s open source high-performance search engine Puck

Recently, Baidu announced the open source self-developed search engine Puck under the Apache 2.0 protocol. This is also the first open source vector search engine in China suitable for very large-scale data sets. Vector retrieval algorithms play an important role in application scenarios such as personalized recommendation systems, multi-modal retrieval, and natural language processing, especially when processing large-scale data and high-dimensional feature data.

The name "Puck" is taken from the intellectual hero Puck in the classic MOBA game DOTA, which symbolizes elegance and agility. This project has been carefully polished within Baidu for many years, and in the BIGANN competition, the world's first vector retrieval competition held by Nerulps at the end of 2021, all four projects Puck participated in won first place. InfoQ interviewed Ben, chief architect of Baidu Search Content Technology Department, to understand the development history and core advantages of the project.

Open source address: https://github.com/baidu/puck

InfoQ: Could you please introduce your work experience and your current main responsibilities?

Ben : I joined Baidu right after graduation. I initially worked in the mobile search department, responsible for basic search and relevance work, and experienced the rapid development of mobile. Later, as a founding member, he helped establish the multi-mode search department to be responsible for visual search. He was among the first batch of Baidu employees to enter the AI ​​field. Currently working in the Search Content Technology Department, he is responsible for content-related technologies, including content acquisition, content understanding, content calculation, content processing and generation, etc.

InfoQ: When did you start paying attention to open source? What made you decide to open source Puck? What is the reason for choosing open source at this time?

Ben : We have been thinking about open source for a long time. We have seen that FAISS (a large-scale vector retrieval library developed by Facebook AI Research) has received widespread industry attention and applications after being open sourced. We also hope that open source Puck can promote the development of the community. And use the power of the community to improve code quality, accelerate technological innovation, and better adapt to market needs. The self-researched open source market is becoming more mature and standardized, which may bring more business models and cooperation opportunities.

We have actually been preparing for open source for a long time and done a lot of preparation work. The explosion of large models has led to widespread attention on vector retrieval technology. We believe that this is a suitable open source opportunity.

InfoQ: Can you talk specifically about the development history of Puck on Baidu, and from your perspective, where is its value to Baidu search?

Ben : The idea of ​​Puck first came from the visual search business. We needed an ANN engine that could support the retrieval of tens of billions of similar images, while also supporting requirements such as high throughput, low latency, high accuracy, low memory, and high flexibility. At that time There was no engine in the industry that could meet our needs, so we started the process of self-research.

In 2017, Puck was launched for the first time, and achieved extremely significant improvements in cost and effect on the tens of billions of images. Later, as the Transformer model shined in the nlp field, semantic retrieval based on embedding became more and more valuable, and Puck The application of Puck is also becoming more and more widespread. In 2019, Puck was open sourced within Baidu, and the number of supported businesses has grown rapidly. It has been widely used in multiple internal product lines such as Baidu search, recommendation, network disk, and knowledge graph, and the support scale has exceeded one trillion. . At present, ANN has become one of the underlying basic technologies of the Internet, the cornerstone of the AI ​​era, and one of the most important supporting technologies for search.

InfoQ: Several optimizations have been made during this period. What are the key points of optimization? Can you describe it in detail?

Ben : Today, Puck has been a product that has been polished for many years. There are countless optimizations in the process. Generally speaking, it can be divided into the following stages:

  1. From 2016 to 2019, we polished the core algorithm and implementation, focusing on basic performance optimization, constantly adjusting details, and making extreme optimizations in our own scenarios. Puck’s core framework was established during this period and is still in use today.

  2. From 2019 to 2021, marked by open source within the company, with the increase in business access, Puck needs to adapt to various application scenarios and demands. Ease of use, scalability, and functional diversity have become the main goals, such as High-performance real-time insertion, multi-condition retrieval, distributed database construction and other functions were all completed during this period.

  3. From 2021 to 2022, taking the opportunity of large-scale content relational computing applications, Puck will focus on optimizing performance under single-instance ultra-large-scale data, and greatly improve performance and reduce costs on billion-scale data sets through large-scale quantification and optimization of index structures. . Taking part in the world's first vector retrieval competition BIGANN and winning four first places proves Puck's competitive advantage in this area.

  4. Since 2022, the core algorithm has been innovated, new algorithms have been proposed to adapt to different data scenarios, more features have been added, and supporting facilities have been improved to prepare for external open source.

This is just a rough division. In fact, Puck's optimization is more composed of many tiny optimization points. We came up with a lot of interesting ideas and experimented a lot during the discussion. Overall, only one or two out of ten ideas end up becoming official features. These optimizations eventually came together to form the Puck we see today.

InfoQ: Can you introduce in detail the core advantages and application scenarios of Puck?

Ben : The Puck open source project includes two Baidu self-developed search algorithms and a series of additional functions. The core advantage is first and foremost performance. After years of polishing and tuning, it has achieved benchmark data of tens of millions, billions, billions, etc. On the set, Puck has obvious performance advantages, significantly surpassing competing products. In the BIGANN competition, the world's first vector retrieval competition held by Nerulps at the end of 2021, Puck won first place in all four projects it participated in.

Secondly, in terms of ease of use, Puck provides a series of functions suitable for various scenarios. For example, it also provides easy-to-use API access and exposes as few parameters as possible. Most parameters can achieve good performance by using default settings. .

Finally, Puck is a time-tested engine. After years of verification and polishing in actual large-scale scenarios, it is widely used in more than 30 product lines within Baidu, including search and recommendation, and supports trillions of indexed data and massive search requests. , with a very high guarantee of reliability.

The Puck engine has open sourced two retrieval algorithms, Puck and Tinker, which are more suitable for very large-scale data sets and small and medium-sized data sets respectively, and can cover almost most retrieval application scenarios. It has been widely used in Baidu's internal search, recommendation and other product lines, covering data scales from one million to one trillion.

InfoQ: Facing the new wave of AI, large models have become more and more popular in the industry. In your opinion, will the open source market become more popular in the future?

Ben : The emergence of large AI models has indeed made competition in the industry more intense, but this is not a bad thing. First of all, the development of large models has promoted the progress of AI technology and improved the performance and efficiency of AI. Secondly, large models bring more innovation space and possibilities to the industry and promote the development of the open source market.

In the future, competition in the industry's self-developed open source market will become more intense, but this does not mean it will become more complicated. On the contrary, it will bring unlimited possibilities. Because the characteristics of the open source market are openness and sharing, enterprises and individuals can obtain the latest AI technologies and models through the open source market without having to develop them from scratch. This helps the entire industry reduce R&D costs and improve R&D efficiency.

In addition, the open source market is also a platform for technical exchange and innovation, where people in the industry can share their research results, absorb the experience and knowledge of others, and jointly promote the development of AI technology. Therefore, although competition will be more intense, as long as we can adapt to this trend and actively participate in communication and innovation, we can benefit from it.

InfoQ: What do you think the future development trend of open source projects of Internet companies will be? In what direction will it develop?

Ben

  1. Deep specialization: With the subdivision of technology, open source projects may become more specialized and in-depth, solving more specific and in-depth problems. There will be more open source projects that are always focused on a specific problem. Puck is one of them. .

  2. Diversification: The open source projects developed by Internet companies may involve more industries and fields, realize cross-border integration of technology, and form open source projects for various industry solutions. This cross-border integration will help promote the application of technology in various industries. Wide range of applications in the industry.

  3. Greater practicality: Future open source projects may focus more on actual combat and application, rather than just theoretical research. Open source projects will provide more practical tools and frameworks to help developers better apply theory to practical work.

  4. Focus on the open source of data and algorithms: As the importance of data and algorithms becomes increasingly prominent, more data and algorithms may be open sourced in the future to accelerate the development of AI and other fields.

These changes will provide stronger impetus to promote technological development and solve practical problems.

InfoQ: You mentioned that Puck has been widely used internally. Are there any products or scenarios that everyone is familiar with? Can you give me an example?

Ben : The familiar Baidu search and information flow recommendations in mobile Baidu both use Puck technology.

InfoQ: Did you receive any feedback from the community after open source? What kind of inspiration did it give you?

Ben : Since Puck was open sourced, we have received a lot of feedback and suggestions from the community. These feedback and suggestions are very valuable to us. They not only help us discover some problems and shortcomings of Puck, but also provide us with directions for improvement and optimization.

For me personally, this feedback inspired me to realize that although we have rich experience using Puck internally, we still need to continue to learn and improve when facing a wider user group. The needs of each user may be different. We need to understand the needs of users more deeply in order to better optimize Puck and make it more suitable for different usage scenarios.

At the same time, these feedbacks also made me deeply feel the vitality and innovative spirit of the open source community. I’m inspired by the spirit of engagement and contribution that many community members have not only raised questions but actively offered solutions. I hope that in the future, we can work more closely with the community to jointly promote the development of Puck.

InfoQ: What does Puck mean to you personally, and what are your expectations for the future of Puck?

Ben : Puck is the result of long-term research and hard work by the team. As the person in charge of Puck, I have a deep love and persistence for this project. For me personally, it is not just a search engine, but a contribution on behalf of the team. It is the crystallization of our hard work and wisdom. It is our pursuit of technology, our persistence in innovation, and our expectations and vision for the future. Every upgrade and optimization of Puck records our growth and progress.

I have high expectations for Puck's future. First of all, I hope that Puck can be widely used in the developer community, and also receive feedback from the community for continuous optimization and improvement. I look forward to seeing more people participating in the development and use of Puck. Through everyone's joint efforts, Puck will become an influential tool in the AI ​​field. Secondly, I hope that Puck can continue to innovate and optimize, maintain its technological leadership, and not only adapt to existing technological needs, but also foresee and lead future technological trends. Finally, I hope that Puck can exert its value in more practical applications, provide strong support for the application of artificial intelligence in various industries, and promote the development of science and technology.

Introduction to the interview guests :

Ben , chief architect of Baidu Search Content Technology Department, is responsible for multi-modal content understanding, ultra-large-scale content relationship calculation, content processing and generation, model optimization and other directions.

Welcome to join Punk technical exchange group : 913964818

The department is actively recruiting for multiple positions, including ANN retrieval engineers, model optimization engineers, distributed computing R&D engineers, etc. Talents who are willing to embrace challenges and have excellent problem analysis and problem-solving abilities are welcome to join ~

Recruitment email: [email protected]

——END——

Recommended reading

A brief discussion on search presentation layer scenario technology-tanGo practice

First introduction to search: Baidu search product manager’s first lesson

Application of intelligent question and answer technology in Baidu search

Support OC code reconstruction practice through Python script (1): module calling relationship analysis

CVPR2023 Excellent Paper | Analysis of the Problem of Lack of Generalization in AIGC Forgery Image Identification Algorithm

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10140186