Retrieval technology core learning summary

1. Analysis of the necessity of learning retrieval technology

(1) Analysis of key reasons

Learning retrieval technology (Information Retrieval, IR) has many important reasons, especially in today's digital era of information explosion.

Overall, learning retrieval techniques helps improve the efficiency of information processing and utilization, both in personal life and in professional and academic settings. These skills enhance the ability to search, analyze, and organize information to better meet a variety of needs and goals.

(2) Examples of modern business system applications

Retrieval technology is the underlying technology of many popular business systems, and they rely on these technologies to achieve efficient information retrieval and relevance ranking. Here are some common application areas:

  1. Database Management System : A database management system (DBMS) uses retrieval technology to process queries, allowing users to quickly retrieve and examine information in a database. This is very important in businesses and organizations for data storage and management.

  2. Search Engines : Search engines are outstanding examples of information retrieval. They use retrieval technology to provide users with web pages and documents relevant to their search queries. Search engines must be able to quickly index and retrieve the vast amounts of information on the Internet and rank it based on relevance.

  3. Advertising engines : Online advertising platforms use retrieval technology to determine where ads should appear and who they should target. This includes determining where your ads should appear and to which users to increase your ad's click-through rate and conversion rate.

  4. Recommendation engines : Recommendation engines use retrieval technology to analyze users' behavior and interests, and then recommend relevant products, content or services to them. Social media, e-commerce sites, and streaming platforms all use this technology to increase user engagement and satisfaction.

  5. Content Management System : A content management system (CMS) uses retrieval technology to help users manage and organize content on their website or application. This helps users create, edit and find information more easily.

  6. Knowledge graph : Knowledge graph is the underlying technology for organizing and retrieving knowledge, and is used to build intelligent search and question answering systems. They help machines understand and answer natural language questions.

In summary, retrieval technology plays a key role in many modern business systems, helping these systems process and serve information efficiently, thereby improving user experience, increasing revenue, and delivering more value. The continuous development of these technologies has also promoted the further development of the Internet and digital economy.

(3) Simple knowledge panorama analysis

We can quickly understand the panorama of retrieval learning knowledge by studying Mr. Chen Dong's "20 Core Retrieval Technology Lectures" in Geek Time. Many of the subsequent learning content also comes from this course.

Here is a detailed analysis of each level:

  1. Storage media layer : This is the basis of retrieval technology, because the way data is stored directly affects retrieval efficiency. Understanding the characteristics, advantages and disadvantages of different storage media, such as disk, memory, distributed storage, etc., is crucial to optimizing retrieval performance.

  2. Data structure and algorithm layer : Data structure and algorithm are the keys to improving retrieval efficiency. For different types of data and queries, choosing appropriate data structures and algorithms is crucial. This layer involves an in-depth understanding and proficient use of various data structures and algorithms.

  3. Retrieval expertise layer : This layer covers more advanced retrieval techniques, including engineering architecture and algorithm strategies. In terms of engineering architecture, it is crucial to understand how to build a scalable and highly available retrieval system. In terms of algorithm strategies, you need to understand various retrieval algorithms and technologies, such as inverted indexes, text analysis, sorting algorithms, etc.

  4. Application layer of retrieval technology : This layer applies retrieval technology to actual business scenarios, including search engines, advertising engines, recommendation engines, etc. Different application fields may have similar engineering architectures and algorithms, but also have their own unique business requirements and processing processes. Learning how to apply retrieval techniques to these business systems is very practical and useful.

Overall, this hierarchical structure provides clear guidance for learning retrieval technology, from basic knowledge to advanced applications, helping people build a comprehensive retrieval technology knowledge system.

2. Basic technical analysis

Retrieval is a technique for efficiently obtaining the required information from the place where the data is stored . There is a close connection between retrieval efficiency and data storage methods, and it is very important to study the impact of the storage characteristics of different data structures on retrieval efficiency.

  1. Data structure selection : Different data structures are suitable for different data storage and retrieval needs. For example, hash tables are suitable for fast lookups, but not for range queries. Tree structures (such as binary trees or B-trees) are suitable for range queries, but may not be as efficient as hash tables for single lookups. Therefore, it is crucial to understand the characteristics of different data structures and when to use them.

  2. Index structure : In databases and search engines, index structures are used to speed up the retrieval of data. Different index structures, such as inverted index, B-tree index, hash index, etc., are suitable for different types of queries and data. Choosing the correct index structure can significantly improve retrieval efficiency.

  3. Data encoding and compression : Data can be stored using different encoding and compression techniques. These techniques can reduce storage space and affect retrieval speed to a certain extent. Understanding how to select and apply data encoding and compression techniques is critical to optimizing storage and retrieval efficiency.

  4. Distributed storage : In large-scale systems, data is often distributed across multiple nodes. Understanding the principles of distributed storage and how to efficiently retrieve distributed data is important for building high-performance systems.

In summary, data structure and storage characteristics have a significant impact on retrieval efficiency, so a deep understanding of these concepts and techniques is crucial for designing and optimizing storage and retrieval systems. In different application scenarios, choosing appropriate data structures and storage methods can significantly improve system performance and efficiency.

The core idea of ​​retrieval is actually to reduce the query scope as quickly as possible by reasonably organizing the data. In other words, there are more retrieval algorithms and technologies. In fact, their essence is to organize data by flexibly applying the characteristics of various data structures, so as to quickly reduce the query scope.

(1) Linear structure retrieval of arrays and linked lists

basic analysis

Arrays and linked lists are two different linear data structures, and their retrieval efficiency differs in some aspects, depending on the specific operations and usage scenarios.

Array retrieval efficiency :

  • High random access efficiency : Arrays are stored continuously in memory, which makes random access to elements in the array very efficient. You only need to know the index to directly access the element at that position, with a time complexity of O(1).
  • Insertion and deletion are inefficient : If you want to insert or delete elements in an array, subsequent elements usually need to be moved to maintain continuity. The average time complexity of such an operation is O(n), where n is the number of elements in the array.

Retrieval efficiency of linked list :

  • Random access is inefficient : the elements of the linked list are not stored continuously, so to access an element in the linked list, you must traverse the linked list starting from the head node or other known position. Therefore, the average time complexity of random access is O(n), where n is the length of the linked list.
  • Efficient insertion and deletion : Linked lists are usually very efficient when inserting and deleting elements. Just modify the node's pointer, no need to move the element. The average time complexity of these operations is O(1), assuming that the node to be inserted or deleted is directly accessible.

To sum up, if you need to do frequent random access operations, arrays are generally more efficient. But if you need to perform frequent insertion and deletion operations, and the access efficiency requirements are not so high, a linked list may be more suitable. In practical applications, appropriate data structures are usually selected based on specific operational requirements, or higher-level data structures are considered to balance the performance of these operations when needed. For example, a balanced binary search tree can provide better insertion, deletion, and search efficiency.

Use binary search to improve array retrieval efficiency

Flexibly transform linked lists to improve retrieval efficiency

Learning linked lists is a data structure that learns how to organize "non-contiguous storage space" . The following is a simple modification example that shows how to design a variant of the linked list based on actual needs to improve retrieval efficiency.

Problem Background : Suppose you need to design a music playlist (or song library) where users can access songs randomly, but you want to minimize the memory footprint.

Traditional linked list : A traditional one-way linked list requires one node for each song, which wastes a lot of memory because each node also needs to store a pointer to the next node.

Improvement plan : In order to reduce memory usage and improve retrieval efficiency, you can design a variant linked list, in which each node not only stores song information, but also stores a certain number of songs. This variant linked list can be called a "Song Block Linked List".

Design of song block linked list :

  • Each node contains a small array (or list) that stores a certain number of songs. The size of the array can be adjusted according to actual needs to balance memory usage and retrieval efficiency.
  • Each node also contains a pointer to the next node so that the entire linked list of song blocks can be traversed.

Search operation :

  • When a user wants to randomly access a song, he first determines which node's small array it is in. You can use binary search and other methods to quickly locate.
  • Once the node is found, a linear search can be performed in the small array within the node to find the target song.

The design of this song block linked list allows full use of the non-continuous storage space characteristics of the linked list, reducing memory usage while still enabling faster song retrieval operations. This example shows how to design an appropriate data structure based on actual needs, combined with the core idea of ​​linked lists, to improve retrieval efficiency and save memory.

(2) Tree and skip list nonlinear structure retrieval

basic analysis

Trees and skip lists are non-linear data structures that have certain advantages in retrieval, but may be more suitable in different situations. The following is an analysis of trees and skip lists in nonlinear structure retrieval:

tree (usually a balanced binary search tree)

Advantages :

  • Efficient retrieval: Balanced binary search trees (such as AVL trees or red-black trees) can maintain the balance of the tree when data changes frequently, so they have efficient retrieval performance. The average retrieval time complexity is O(log n).
  • Insertion and deletion: Balanced trees are also more efficient for insertion and deletion operations.

Applicable scenarios :

  • Suitable for situations that require frequent insertion, deletion, and retrieval operations, such as database indexes and ordered collections.
  • When the requirements for data are high and the order of the data needs to be maintained, the balanced tree is a good choice.

Jump table

Advantages :

  • Efficient retrieval: Skip table is a data structure that achieves efficient skip retrieval through multi-level indexes. The average retrieval time complexity is O(log n), similar to a balanced tree.
  • Simple implementation: Compared with balanced trees, the implementation of skip tables is relatively simple and does not require automatic balancing.

Applicable scenarios :

  • It is suitable for scenarios that require efficient retrieval operations but have relatively low performance requirements for insertion and deletion operations.
  • Can be used to implement ordered collections, high-performance skip table indexes, etc.

Summary :

  • Trees and skip lists are non-linear data structures used for efficient retrieval. They have similar performance in average retrieval time complexity.
  • Trees are suitable for scenarios that require frequent insertion, deletion, and retrieval, as well as situations that require high orderliness of data.
  • Skip tables are suitable for situations where efficient retrieval operations are required, but performance requirements for insertion and deletion operations are relatively low. The implementation of skip table is relatively simple.

In practical applications, the choice of tree or skip list depends on specific needs and performance requirements. If insertion and deletion operations are frequent and data ordering needs to be maintained, a balanced tree may be more suitable. If your main concern is efficient retrieval operations, and you can tolerate lower insertion and deletion performance, then skip lists may be a better choice .

How to perform binary search in tree structure

Tree structures (especially binary trees) are retrieved through binary search. Binary search is an efficient search algorithm that is suitable for ordered data sets, such as ordered tree structures. The following is the basic principle of how binary search is performed on a binary tree:

A binary tree is a tree-like data structure. Each node has at most two child nodes, usually divided into left subtree and right subtree.

The nodes in the tree are arranged in some specific order, for example, the left subtree has a node with a smaller value than its parent, and the right subtree has a node with a larger value than its parent (or vice versa, depending on the nature of the tree).

Binary search :

  • Binary search is a divide-and-conquer strategy that starts at the root node of the tree and gradually reduces the search range by half until the target element is found or determined not to exist.
  • Starting from the root node, compare the value of the target element to the current node.
  • If the target element is less than the value of the current node, continue searching in the left subtree, because the values ​​of the left subtree are less than the current node.
  • If the target element is greater than the value of the current node, continue searching in the right subtree, because the values ​​of the right subtree are greater than the current node.
  • Repeat this process until the target element is found or a leaf node is reached. If it is still not found, it means that the target element does not exist in the tree.

Time complexity :

  • The time complexity of binary search in a balanced binary tree (such as an AVL tree) is O(log n), where n is the number of nodes in the tree. This is a very efficient retrieval algorithm.

In short, the tree structure organizes data in an orderly manner by using binary search to facilitate efficient retrieval operations. In an ordered binary tree, the search direction can be determined by comparing the target value with the value of the current node, and the search range is halved at each step, thus achieving fast search. This makes a binary tree a very useful data structure for efficient search and sort operations.

Search space balancing scheme for binary search trees

The retrieval performance of a Binary Search Tree (BST) depends largely on the balance of the tree. If the tree is well balanced, the average time complexity of the retrieval operation will remain at the O(log n) level. However, if the BST is unbalanced, the retrieval operation may take O(n) time in the worst case, which significantly reduces its performance.

In order to maintain the balance of BST, the following balancing scheme can be adopted:

Balanced binary search tree (AVL tree) :

  • An AVL tree is a self-balancing BST that remains balanced by performing a rotation operation after each insertion or deletion of a node.
  • Each node has a balancing factor that represents the difference between the height of its left subtree and the height of its right subtree. After an insertion or deletion operation, the balance factor is updated, and depending on the value of the balance factor, a single or double rotation is performed to restore balance.
  • The average retrieval time complexity of AVL tree is O(log n), which is suitable for scenarios with frequent insertion and deletion operations.

Red-black tree :

  • A red-black tree is another self-balancing BST that remains balanced by coloring its nodes and following a set of rules.
  • The balance of red-black trees is maintained through node colors and specific rules. These rules include that the colors of nodes cannot be adjacent, and that the path from any node to each of its leaves contains an equal number of black nodes.
  • The average retrieval time complexity of a red-black tree is O(log n), and its insertion and deletion operations may be slightly more efficient than an AVL tree.

Spreading tree :

  • The spread tree is an adaptive BST that moves the recently visited node to the position of the root node through a rotation operation after each retrieval operation. This helps speed up the retrieval of recently visited nodes.
  • The average retrieval time complexity of a stretched tree is O(log n), but it may have some performance overhead on insertion and deletion operations.

Choosing the right balance depends on your specific needs and performance requirements. AVL trees and red-black trees are usually used in scenarios where balance is required, while stretch trees are suitable for scenarios where retrieval operations for recently visited nodes need to be optimized. Different balancing options may have different trade-off points, so the specific needs of your application need to be considered when choosing.

How to perform binary search using skip table

A skip list is a data structure that is a way to perform efficient search, insertion, and deletion operations on an ordered collection of elements. The binary search of skip table is based on the idea of ​​multi-level index. The following is the basic principle of how to perform binary search of skip table:

multilevel index

  • A skip list contains multiple levels (layers), each level is an ordered linked list containing some of the original data elements. The underlying linked list contains all the elements, while the upper linked list contains a part of the elements in the underlying linked list.
  • Each level of the linked list is ordered, which means that binary searches can be performed at each level.

Find operation

  • The search operation of the skip list starts from the head of the top-level linked list and moves downward step by step. At each level, it compares the current node's value with the target value.
  • If the value of the current node is less than the target value, it will continue to move to the right until it finds a node that is greater than or equal to the target value.
  • If the current node's value is greater than the target value, it moves down to the next level and continues looking.

Advantage

  • Multi-level indexes in skip lists allow some elements to be quickly skipped, thereby narrowing the search scope to a smaller area, similar to a binary search.
  • The average retrieval time complexity of a skip list is O(log n), where n is the number of elements. This makes it more efficient than a traditional linked list in some situations.

In short, skip tables organize data into multiple ordered linked lists through multi-level indexes, thereby achieving efficient search operations, similar to the idea of ​​binary search. The average retrieval time complexity of a skip list is O(log n), which makes it an efficient data structure in certain situations, especially when frequent lookup operations need to be performed on an ordered collection of elements. Skip lists are also relatively easy to implement and do not require complex balancing algorithms like balanced trees, so they have certain advantages in practical applications.

Recall delete and insert operations

The insertion and deletion operations of skip lists are relatively complex because they not only need to perform insertions and deletions on the underlying linked list, but also need to maintain the balance of multi-level indexes.

insert operation

  1. First, to insert a new element, you need to find the insertion position. Start at the head of the top-level linked list and move down level by level until you find the position to be inserted.

  2. After finding the insertion location, perform the insertion operation. This involves inserting the new element into the underlying linked list at the appropriate location.

  3. Next, you need to consider the balance of maintaining multi-level indexes. To maintain balance, here are some steps you can take:

    • Randomly decides whether new elements should be promoted to a higher level index. This can be done by flipping a coin or other random methods. If you decide to upgrade, add the new elements to the previous level index and repeat this step until you no longer want to upgrade.
    • At each level, make sure there are enough elements to the left and right of the insertion position so that the index still works. If the linked list is too short at a certain level, you can split it at that level, add the new element in the appropriate position, and re-index.
  4. After completing the insertion operation, the structure of the skip table should still be in order, and the multi-level index should remain balanced.

Delete operation

  1. The deletion operation also requires first finding the location of the element to be deleted. Start at the head of the top-level linked list and move down level by level until you find the element you want to delete.

  2. After you find the location you want to delete, perform the delete operation. This involves removing elements from the underlying linked list, and may require merging or deleting associated indexes.

  3. It is also necessary to maintain the balance of multi-level indexes. To maintain balance, here are some steps you can take:

    • At each level, check whether some elements on the level need to be removed to maintain a balanced index. If a linked list at a certain level is too short, it can be deleted or merged with the next level.
  4. After the delete operation is completed, the structure of the skip table should still be in order, and the multi-level index should remain balanced.

It should be noted that the implementation of insertion and deletion operations may involve some details, such as how to handle duplicate elements or how to handle boundary cases at insertion and deletion positions. Maintaining the balance of jump tables also requires careful consideration to ensure efficient and correct operations. But in general, jump table insertion and deletion operations can be achieved by carefully performing the above steps.

(three)

Recommended reading: Spring Boot source code interpretation and principle analysis

The predecessor of this book is the top-selling booklet in the Nuggets community - "Spring Boot Source Code Interpretation and Principle Analysis". More than 3,600 developers in the entire community have chosen this booklet, making it the leading booklet in the Nuggets community. The trump card Spring tutorial is very good!

This booklet has made the author ranked in the Top 40 of the 2020 Popularity List, and he has been awarded 8 medals of honor. The sales volume on the site is far ahead. Readers call it a conscientious work, and they like and call it.

However, due to the limited volume and length of the booklet, readers have expressed that they are still unfinished and would like more useful information. They hope that the author can explain it in more detail and thoroughness.

If you want to have a relatively reasonable, smooth, and systematic learning experience, this book is perfect.

Since this book is an upgrade based on the booklet, the content of the book is more systematic, and it is optimized based on the feedback from readers of the booklet, and the explanation is more in-depth and detailed. It’s not just an upgrade, it’s a refresh!

Different from the centralized knowledge explanation in the booklet, Linked-Bear has reorganized the content into the following four parts to explain the knowledge from the shallower to the deeper.
 

Reference articles and techniques

Geek Time-Chen Dong, "20 Key Lectures on Search Technology"

Guess you like

Origin blog.csdn.net/xiaofeng10330111/article/details/132866513