Baidu open source self-developed high-performance ANN search engine Puck

Baidu announced the open source and self-developed ANN search engine - Puck under the Apache 2.0 protocol. The full name of ANN is Approximate Nearest Neighbor. The goal is to find the nearest TopK vectors from the full amount of vector data, and at the same time, it is necessary to balance the retrieval effect and retrieval cost.

Advantages of Pucks

  • Ease of use: Provide easy-to-use API access, expose as few parameters as possible, and use default settings for most parameters to achieve good performance.
  • Scalability: It adopts a completely self-developed index structure, supports multiple function extensions, and adapts to multiple scenarios. The project modules are divided reasonably, which is convenient for transformation and optimization, and it is convenient for user interfaces to be added by themselves.
  • High performance: Puck has obvious performance advantages on benchmark data sets of 10 million, 100 million, and 1 billion, which significantly outperform competing products.
  • Reliability: After years of verification and polishing in actual large-scale scenarios, it is widely used in more than 30 product lines including search and recommendation within Baidu, supporting trillions of index data and massive retrieval requests.

Puck function expansion

  • Real-time insertion: Support real-time insertion without lock structure, so as to achieve real-time update of data.
  • Conditional query: Supports conditional query during the retrieval process, filters out unqualified results from the underlying index retrieval process, solves the truncation problem often encountered in multi-channel recall regression, and better meets the requirements of combined retrieval.
  • Distributed database building: The index building process supports distributed expansion. The full index can be built together through map-reduce without building by fragment, which greatly speeds up and simplifies the database building process.
  • Adaptive parameters: The ANN method has many retrieval parameters, and there are many thresholds for application. It is not easy for users who do not understand the technical details to find the optimal parameters. Puck provides parameter adaptive functions. In most cases, the default parameters can be used. Good effect.

The announcement pointed out that Baidu has long invested in the research of self-developed approximate nearest neighbor search algorithm (ANN). Continuous optimization and iteration, full technical research and development and testing have been carried out to ensure the leading and mature technology.

The Puck open source project includes two Baidu self-developed retrieval algorithms, Puck&Tinker, which aim at high recall, high accuracy, and high throughput, and perform well on large, medium, and small datasets. On benchmark data sets of 10 million, 100 million, and 1 billion, Puck has obvious performance advantages, significantly surpassing competing products. In the BIGANN competition, the world's first vector retrieval competition held by Nerulps at the end of 2021, all four projects that Puck participated won first place.

More detailed benchmarks can be viewed here .

Guess you like

Origin www.oschina.net/news/256905/puck-open-source