marisa-trie - a fast and efficient string compression storage and query matching tool based on efficient Trie tree implementation

In the previous article, I mentioned that due to the needs of actual projects, I investigated the current string query matching algorithm that is better. If you are interested, you can take a look directly:

"pyahocorasick - python efficient string matching practice based on AC automata"

The main purpose of this article is the same as the previous article. Here we mainly introduce another useful string query matching module-marisa-trie. The official project address is here , as shown below:

It can be seen that the current amount of stars is nearly 1k, which is slightly higher than the previous AC automaton algorithm.

The marisa-trie module is a Python library that provides data structures for efficient storage and retrieval of string keys. It is based on the Trie (dictionary tree) data structure and uses a highly compressed representation to provide fast lookups and low memory consumption.

The detailed features of the marisa-trie module are as follows:

  1. Efficient string retrieval: marisa-trie uses the Trie data structure, which can effectively store and retrieve strings. It supports efficient prefix search, fuzzy search and range search operations.

  2. Compressed storage: This module uses a highly compressed representation that can significantly reduce storage space requirements. This is especially useful in scenarios where a large number of string keys need to be stored, saving memory and disk space.

  3. Fast search: Due to the use of Trie data structure, marisa-trie can perform search operations at close to constant time complexity. This makes it ideal for string key lookup tasks that require high performance.

  4. Support for sortable keys: marisa-trie allows you to store and retrieve sortable keys. This makes it useful in applications that require sorting and range queries on string keys.

  5. Easy to use: This module provides a simple and intuitive API that is easy to use and integrate into your Python projects.

The marisa-trie algorithm is developed and built based on the Trie (dictionary tree) data structure and uses a highly compressed representation. The following is a detailed explanation of the algorithm construction principle of marisa-trie:

  1. Trie data structure: Trie is a tree data structure used to store and retrieve collections of strings. It starts from the root node, each node represents a character, and each path represents a string. Searching for a string is accomplished by moving along a path in the tree.

  2. Building the Trie: In order to build the marisa-trie, you first need to insert the string keys into the Trie. For each string key, starting from the root node, traverse the Trie in character order downwards, creating new nodes if needed. When the end character of the string is encountered or the match cannot continue, the node is marked as a terminating node.

  3. Sorting nodes: After building the Trie, the nodes need to be sorted. The purpose of sorting is to prepare for subsequent compression steps. marisa-trie uses a specific sorting algorithm to maintain the order of the string keys in the Trie.

  4. Compressed representation: Once node sorting is complete, marisa-trie uses a highly compressed representation. It merges nodes with the same prefix into a shared prefix node. This can greatly reduce storage space requirements, especially for string keys with a large number of shared prefixes.

  5. Completing the build: Once the compressed representation is complete, the build of marisa-trie is complete. The marisa-trie data structure constructed in this way can perform efficient string key lookup in near-constant time complexity.

The algorithmic building blocks of marisa-trie make it an efficient data structure suitable for string key storage and retrieval tasks that require high performance and low memory consumption.

That’s it for the brief introduction. Next, let’s practice and analyze the application performance of the module. Using the same data as above, the core code implementation is as follows:

patterns = ['an', 'un', 'is']
trie = marisa_trie.Trie(patterns)
text = 'In quiet solitude, peace is found,Where thoughts can wander, unbound.'
results = []
for i in range(len(text)):
    matches = trie.prefixes(text[i:])
    for matche in matches:
        results.append((matche,i,i+len(matche)))
print(results)

The resulting output looks like this:

[('is', 25, 27), ('un', 30, 32), ('an', 50, 52), ('an', 54, 56), ('un', 61, 63), ('un', 65, 67)]

Each sub-object target here is a triplet of data. The first element is the found target object string. The next two numbers represent the starting index and sum of the currently matched string in the original string text. end index.

Next, we want to test the query matching efficiency of this module on the spot. We write test code to generate a random string as shown below:

base_list=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 
        'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 
        'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
print(len(base_list))
all_list=[1,10,100,1000,10000,100000]
t_list=[]
for one_all in all_list:
    string=""
    for i in range(one_all):
        string+="".join(random.sample(base_list,10))
    print("string_length: ", len(string))

The query target is randomly constructed and queried 1000 times, as shown below:

for i in range(1000):
    one_str="".join(random.sample(base_list, one_num))
    one_patterns.append(one_str)

Build a marisa_trie matching query as follows:

trie = marisa_trie.Trie(one_patterns)
results = []
for i in range(len(string)):
    matches = trie.prefixes(string[i:])
    for matche in matches:
        results.append((matche,i,i+len(matche)))

Next, we conduct experiments. In the experiment, we also want to explore and analyze the relationship between the length and time of the query substring. Here we conduct experiments separately, as follows:

[Query substring length is 1]

[Query substring length is 2]

[Query substring length is 3]

[Query substring length is 4]

[Query substring length is 5]

[Query substring length is 6]

[Query substring length is 7]

[Query substring length is 8]

[Query substring length is 9]

[Query substring length is 10]

Judging from the experimental test results: there is no obvious relationship between the query performance of marisa_trie and the length of the target query substring. We tested the query performance of 10 consecutive groups of substrings of different lengths from 1 to 10 under the same query conditions and found that the overall The trend is the same.

For intuitive comparison analysis, here I draw the overall comparison visualization curve, as shown below:

If you are interested, you can give it a try and I believe you will find out how powerful it is.

Guess you like

Origin blog.csdn.net/Together_CZ/article/details/132823574