Datawhale dry goods
Source: WhalePaper, person in charge: Fu Qu
Introduction to WhalePaper
Initiated by the members of the Datawhale team, it will share mature topics and open source solutions in current academic papers, and help everyone to learn "efficiently + comprehensively + self-discipline" better by reading and sharing papers together, so that everyone can gain something and boost! The direction includes the interpretation and sharing of papers in natural language processing (NLP), computer vision (CV), recommendation (Res) and other related directions, and more directions will be incorporated in the future.
Open source address: https://datawhalechina.github.io/whale-paper
WhalePaper | Github
Current activities
Sharing time: July 29, 2023 (this Saturday) at 20:00
Sharing Direction: Vector Retrieval
Sharing tool: #腾讯会议: 815 -856-759
Paper agenda: 45 minutes for sharing, unlimited time for questions.
Sharing Outline:
Introduction and Latest Development of Vector Retrieval Algorithms
Algorithm and System Design of Vector Database
Guest & paper overview
Guest profile: Chen Qi, chief researcher of the System Research Group of Microsoft Research Asia. She received her BS and PhD degrees in computer science from Peking University in 2010 and 2016, where she conducted research on distributed systems, cloud computing, and parallel computing with her supervisor Prof. Zhen Xiao. From 2013 to 2014, she was a visiting student in the System Group of New York University, under the guidance of Professor Li Jinyang, engaged in the research of distributed array framework. She has published more than 20 papers in top conferences and journals, some of which have won important awards, such as OSDI Best Paper Award and NeurIPS Outstanding Paper Award. Her current research interests include distributed systems, cloud computing, and deep learning algorithms and frameworks.
Topic: Vector Search and Vector Database
Topic introduction: The latest advances in deep learning in recent years have enabled various types of data to be mapped into high-dimensional vectors. The current state-of-the-art vector search libraries mainly focus on how to perform fast and high-recall searches in memory. However, there are some challenges in extremely large-scale vector search scenarios. For example, tens of billions of vectors combined with limited memory can cause capacity issues. At the same time, scalability is also a problem. Increasing the number of server machines will increase query latency and computing costs. Furthermore, high-dimensional vector indexes do not possess monotonicity, which is a key property of traditional indexes. The lack of monotonicity makes existing vector systems have to rely on temporary indexes that maintain monotonicity, TopK nearest neighbors for target vectors, in order to achieve complex queries of approximate similarity searches and relational operations. This leads to a decrease in performance because it is difficult to predict the optimal K value.
In this talk, we introduce SPANN, a distributed disk-based ANNS system, which has been integrated into Bing, and can realize tens of billions of vector searches with millisecond-level response time. Additionally, we introduce VBASE, a vector database system that efficiently handles complex queries based on a common property called relaxed monotonicity. This approach unifies two seemingly incompatible systems, delivering three orders of magnitude better performance than existing state-of-the-art vector systems.
way of participation
Scan the QR code to join the WhalePaper group
If the group is full, please reply "paper" in the background of the official account
Contact information of the person in charge of WhalePaper:
Fuqu (WeChat ID: MePhyllis)
Hua Hui (WeChat ID: BuShouY)