Data structure: string

1. String

String: referred to as a string , is a finite sequence of zero or more characters. Generally recorded as .

  • The length of the string: The number n of characters in the string is called the length of the string.

  • Empty string: A string composed of zero characters is also called " empty string ". Its length is 0 and can be expressed as " ".

  • Substring: A subsequence composed of any consecutive characters in a string is called a " substring " of the string. And there are two special substrings, the substring starting at position 0 and length k is called " prefix ". A substring of length k that ends at position n -1 is called a " suffix ".

  • Main string: The string containing substrings is called " main string " accordingly.

for example:

str="Hello world"

In the sample code, str is the variable name of a string, Hello World is the value of the string, and the length of the string is 11. The representation of the string is shown in the following figure:

2. Comparison of strings

2.1 Comparison operation of strings

The size between strings depends on the order in which they order the characters before and after. For example, the strings str1 = "abc" and str2 = "acc",

Their first letter is a, and the second letter, since the letter b is earlier than the letter c, so b < c, so we can

Say "abc" < "acd" and also say str1 < str2.

Comparisons between strings are determined by the " character encoding " between the characters that make up the string. The character encoding refers to the characters in the pair

The sequence number in the corresponding character set.

For two unequal strings, we can define the size of the two strings according to the following rules:

  • Starting from the 0th position of the two strings, compare the character encoding sizes at the corresponding positions in turn.

  • If the character encoding corresponding to str1[i] is equal to the character encoding corresponding to str2[i], compare the next character.

  • If the character encoding corresponding to str1[i] is less than the character encoding corresponding to str2[i], then str1 < str2. For example: "abc" <

"acc"。

  • If the character code corresponding to str1[i] is greater than the character code corresponding to str2[i], it means str1 > str2. For example: "bcd" >

"bad"。

  • If the comparison reaches the end of one string and the other string remains:

  • If the length of the string str1 is less than the string str2, that is, len(str1) < len(str2). Then str1 < str2. For example: "abc" < "abcde".

  • If the string str1 is longer than the string str2, that is, len(str1) > len(str2). Then str1 > str2. For example: "abcde" > "abc".

  • If the character codes corresponding to the characters at each position of the two strings are equal and the lengths are the same, it means that str1 == str2, for example: "abcd" == "abcd".

According to the above rules, we can define a strcmp method and specify:

  • When str1 < str2, the strcmp method returns -1.

  • When str1 == str2, the strcmp method returns 0.

  • The strcmp method returns 1 when str1 > str2.

2.2 String comparison code

def strcmp (str1, str2):
    index1, index2 = 0,0
    while index1< len(str1 ) and index2< len(str2):
        if ord(str1 [index1] )== ord(str2 [ index2 ]):
            index1 += 1
            index2 += 1
    elif ord(str1[ index1])< ord(str2 [index2]):
        return -1
    else:
        return 1

    if len (str1)< len(str2) :
        return-1
    elif len(str1)> len (str2):
        return 1
    else:
        return 0

2.3 Character encoding of strings

Take the ASCII encoding used by commonly used characters in computers as an example. At the earliest time, people formulated a code table containing 127 characters

ASCII into the computer system. The characters in the ASCII code table include uppercase and lowercase English letters, numbers and some symbols. each character

Corresponding to a code, for example, the code of the uppercase letter A is 65, and the code of the lowercase letter a is 97. Later, for special characters, the

ASCII extended to 256 bits.

ASCII encoding can solve English-based languages, but it cannot satisfy Chinese encoding. In order to solve the Chinese encoding, our country has formulated

GB2312, GBK, GB18030 and other Chinese coding standards, compile Chinese into it. But there are hundreds of languages ​​and characters in the world, each

State-owned standards of various countries will inevitably cause conflicts, so there is Unicode encoding . The most commonly used Unicode encoding is

It is UTF-8 encoding . UTF-8 encoding encodes a Unicode character into 1 ~ 6 bytes according to different number sizes. Commonly used

English letters are encoded into 1 byte, and Chinese characters are usually 3 bytes.

2.4 String storage structure

The storage structure of the string is the same as that of the linear table, which is divided into " sequential storage structure " and " chain storage structure ".

3. String matching problem

String matching: also known as " pattern matching ". It can be simply understood as, given strings T and p, find the substring p in the main string T. host

The string T is also called a " text string ", and the substring p is also called a " pattern string ".

Among string problems, one of the most important problems is the string matching problem. According to the number of pattern strings, we can match strings

Matching problems are divided into: " single pattern string matching problem " and " multiple pattern string matching problem ".

3.1 Single pattern string matching problem

Single pattern matching problem: given a text string T = tt2….tn, and given a specific pattern string p = PPp2..pn. It is required to find all occurrences of a specific pattern string p from a text string T.

According to the different ways of searching for pattern strings in text, single pattern matching algorithms can be divided into the following three types:

  • Prefix-based search method: Read text characters one by one from front to back (along the forward direction of the text) in the search window, and search for the longest common prefix of text and pattern strings in the search window. The well-known KMP algorithm and the faster Shift-Or algorithm use this method.

  • Suffix-based search method: read text characters one by one from the back to the front (along the reverse direction of the text) in the search window, and search for the longest common suffix of the text and the pattern string in the search window. Using this search algorithm skips some text characters and thus has sub-linear average time complexity. The most famous BM algorithm , as well as Horspool algorithm and Sunday algorithm , all use this method.

  • Substring-based search method: read text characters one by one from the back to the front (along the reverse direction of the text) in the search window, and search for the longest string that satisfies "both the suffix of the text in the window and the substring of the pattern string" . Like the suffix search method, using this search method also has sublinear average time complexity. The main disadvantage of this method is that all substrings of the pattern string need to be identified, which is a very complicated problem. Rabin-Karp algorithm , BDM algorithm , BNDM algorithm and BOM algorithm use this idea. Among them, the Rabin-Karp algorithm uses a hash-based substring search algorithm.

3.2 Multi-pattern string matching problem

Multi-pattern matching problem: Given a text stringand a set of pattern strings, each pattern stringis a string defined on a limited alphabet. It is required to find all occurrences ofall pattern strings in the pattern string set Р from the text string T.

Some strings in the set of pattern strings Р may be substrings, prefixes, suffixes, or exactly equal to other strings in the set. The easiest way to solve the multi-pattern string matching problem is to use the " single-pattern string matching algorithm " to search r times. This would result in a worst-case time complexity for the preprocessing phase and a worst-case time complexity for the search phase .

If the "single-pattern string matching algorithm" is used to solve the multi-pattern matching problem, then we can also divide the multi-pattern string matching algorithm into the following three types according to the different ways of searching for pattern strings in the text:

  • Prefix-based search method: search from front to back (along the forward direction of the text), read text characters one by one, and use the automaton built on P for recognition. For each text position, compute the longest string that is both a suffix of the read text and a prefix of some pattern string in P. This method is used by the famous AC automata algorithm , Multiple Shift-And algorithm .

  • Based on the suffix search method: the search is performed from the back to the front (along the reverse of the text), and the suffix of the pattern string is searched. Moves the current text position according to the next occurrence of the suffix. This method avoids reading all text characters. Set Horspool algorithm and wu-Manber algorithm use this method.

  • Based on the substring search method: the search is performed from the back to the front (along the reverse of the text), and the substring is searched in the prefix whose length is min(len(p')) in the pattern string, so as to determine the movement of the current text position . This approach also avoids reading all text characters. Multiple BNDM algorithm , SBDM algorithm , and SBOM algorithm all use this method.

Guess you like

Origin blog.csdn.net/m0_64087341/article/details/129678959