Master Regular Expressions - Practical Tips on Regular Expressions

Table of contents

1. Match consecutive lines

1. Use dotall mode

2. Use non-dotall mode

2. Match IP address

1. Match numbers from 0-255

2. The first paragraph requires non-zero

3. Merge four sections

4. Determine application scenarios

3. Process file names

1. Remove the path at the beginning of the file name

2. Get the file name from the path

3. Path and file name

4. Match symmetrical brackets

5. Guard against unexpected matches

6. Match text within delimiters

7. Remove blank characters at the beginning and end of the text

8. HTML related examples

1. Match HTML Tag

3. Check the HTTP URL

4. Verify hostname

5. Extracting URLs in the real world

9. Maintain data coordination

1. Maintain alignment with expectations

2. Coordination should be ensured even when there is mismatch.

3. Use \G to ensure coordination

4. The significance of this example

10. Parse CSV files

1. Decompose the driving process

2. Another way

3. Further improve efficiency

4. Other formats


1. Match consecutive lines

        If you want to match multiple consecutive lines of text, a common situation is that a logical line can be divided into many real lines, each line ending with a backslash.

mysql> set @s:=
    -> 'SRC=array.c buildin.c eval.c field.c gawkmisc.c io.c main.c\\
    '>          missing.c msg.c node.c re.c version.c';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s\G
*************************** 1. row ***************************
@s: SRC=array.c buildin.c eval.c field.c gawkmisc.c io.c main.c\
         missing.c msg.c node.c re.c version.c
1 row in set (0.00 sec)

1. Use dotall mode

        It's simple, because the dotall mode dot can match newlines.

mysql> set @r:='^\\w+=.*';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, 'n') c, regexp_extract(@s, @r, 'n') s\G
*************************** 1. row ***************************
@r: ^\w+=.*
 c: 1
 s: SRC=array.c buildin.c eval.c field.c gawkmisc.c io.c main.c\
         missing.c msg.c node.c re.c version.c
1 row in set (0.00 sec)

2. Use non-dotall mode

        Think about it another way: focus on the characters that are actually allowed to match at a given moment. When matching a line of text, the expected match is either a normal (except backslash and newline) character, a combination of backslash and other characters, or a backslash plus newline. Note that in MySQL, each backslash must be escaped with two consecutive backslashes.

mysql> set @r:='^\\w+=([^\\n\\\\]|\\\\.|\\\\\\n)*';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: ^\w+=([^\n\\]|\\.|\\\n)*
 c: 1
 s: SRC=array.c buildin.c eval.c field.c gawkmisc.c io.c main.c\
         missing.c msg.c node.c re.c version.c
1 row in set (0.01 sec)

2. Match IP address

        Analyze IP address rules:

  • Four numbers separated by dots.
  • Each number is between 0-255 (inclusive).
  • The first number cannot be 0.

1. Match numbers from 0-255

([01]?\d\d?|2[0-4]\d|25[0-5])

        The first branch can match the one-digit number 0-9, the two-digit number 01-99, and the three-digit number 000-199 starting with 0 or 1; the second branch matches 200-249; and the third branch matches 250-255.

2. The first paragraph requires non-zero

(?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])

        Use sequential negation to look around, specifying that 0., 00., 000., etc. cannot appear.

3. Merge four sections

mysql> set @r:='^(?!0+\\.)([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.((([01]?\\d\\d?|2[0-4]\\d|25[0-5]))\\.){2}(([01]?\\d\\d?|2[0-4]\\d|25[0-5]))$'; 
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='0.1.1.1';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: ^(?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])\.((([01]?\d\d?|2[0-4]\d|25[0-5]))\.){2}(([01]?\d\d?|2[0-4]\d|25[0-5]))$
 c: 0
 s: 
1 row in set (0.01 sec)

mysql> set @s:='255.255.255.255';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: ^(?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])\.((([01]?\d\d?|2[0-4]\d|25[0-5]))\.){2}(([01]?\d\d?|2[0-4]\d|25[0-5]))$
 c: 1
 s: 255.255.255.255
1 row in set (0.00 sec)

mysql> set @s:='001.255.255.255';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: ^(?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])\.((([01]?\d\d?|2[0-4]\d|25[0-5]))\.){2}(([01]?\d\d?|2[0-4]\d|25[0-5]))$
 c: 1
 s: 001.255.255.255
1 row in set (0.00 sec)

mysql> set @s:='001.255.255.256';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: ^(?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])\.((([01]?\d\d?|2[0-4]\d|25[0-5]))\.){2}(([01]?\d\d?|2[0-4]\d|25[0-5]))$
 c: 0
 s: 
1 row in set (0.00 sec)

4. Determine application scenarios

        The above regular expression must use the anchor points ^ and $ to work properly, otherwise it may match incorrectly.
 

mysql> set @r:='(?!0+\\.)([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.((([01]?\\d\\d?|2[0-4]\\d|25[0-5]))\\.){2}(([01]?\\d\\d?|2[0-4]\\d|25[0-5]))';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='ip=72123.3.21.993';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: (?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])\.((([01]?\d\d?|2[0-4]\d|25[0-5]))\.){2}(([01]?\d\d?|2[0-4]\d|25[0-5]))
 c: 1
 s: 123.3.21.99
1 row in set (0.01 sec)

mysql> set @s:='ip=123.3.21.223';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: (?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])\.((([01]?\d\d?|2[0-4]\d|25[0-5]))\.){2}(([01]?\d\d?|2[0-4]\d|25[0-5]))
 c: 1
 s: 123.3.21.22
1 row in set (0.00 sec)

        In order to avoid matching such embedded text, you must ensure that there are at least no numbers or periods on either side of the matched text, which can be achieved by using negative lookaround.
 

mysql> set @r:='(?<![\\d.])((?!0+\\.)([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.((([01]?\\d\\d?|2[0-4]\\d|25[0-5]))\\.){2}(([01]?\\d\\d?|2[0-4]\\d|25[0-5])))(?![\\d.])';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='ip=72123.3.21.993';
Query OK, 0 rows affected (0.01 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: (?<![\d.])((?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])\.((([01]?\d\d?|2[0-4]\d|25[0-5]))\.){2}(([01]?\d\d?|2[0-4]\d|25[0-5])))(?![\d.])
 c: 0
 s: 
1 row in set (0.00 sec)

mysql> set @s:='ip=123.3.21.223';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: (?<![\d.])((?!0+\.)([01]?\d\d?|2[0-4]\d|25[0-5])\.((([01]?\d\d?|2[0-4]\d|25[0-5]))\.){2}(([01]?\d\d?|2[0-4]\d|25[0-5])))(?![\d.])
 c: 1
 s: 123.3.21.223
1 row in set (0.00 sec)

3. Process file names

1. Remove the path at the beginning of the file name

        For example, change /usr/local/bin/gcc to gcc.

mysql> set @s:='/usr/local/bin/gcc';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='^.*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s,@r,'');
+--------------------------+
| regexp_replace(@s,@r,'') |
+--------------------------+
| gcc                      |
+--------------------------+
1 row in set (0.00 sec)

        Taking advantage of the match-first feature, .* can match an entire line and then backtrack (that is, backtrack) to the last slash to complete the match. Don't forget to always think about what might happen if a match fails. In this case, a failed match means there are no slashes in the string, so there is no replacement and the string does not change, which is what is needed.

        To ensure efficiency, you need to remember how the NFA engine works. Imagine that if you forget to add the ^ symbol at the beginning of the regular expression to match a string that happens to have no slash, the execution process of NFA is as follows.

        The regex engine will start searching at the beginning of the string. .* reaches the end of the string, but must keep backing up to find a slash or backslash. Until finally it handed back all the characters it matched and still couldn't match. At this point, the regex engine knows that there is no match at the beginning of the string, but that's far from the end. Next the gearing starts working, starting from the second character of the target string and trying to match the entire regular expression. In fact, it requires scanning - backtracking - at every position in the string (theoretically).

        If the string is very long, there may be a lot of backtracking. DFA doesn't have this problem. The regular engine of MySQL 8 uses traditional NFA. In practice, a reasonably optimized gear will recognize that for almost any regular expression beginning with .*, if it doesn't match at the beginning of a string, it won't match anywhere else, so It will only try once at the beginning of the string. But it's more sensible to indicate this in the regular expression, which is exactly what we do in this example.

2. Get the file name from the path

mysql> set @s:='/usr/local/bin/perl';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='([^/]*)$';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_substr(@s, @r);
+-----------------------+
| regexp_substr(@s, @r) |
+-----------------------+
| perl                  |
+-----------------------+
1 row in set (0.00 sec)

        This time the anchor point is not just an optimization measure, it is indeed necessary to set an anchor point at the end to ensure correct matching. This regular expression can always match, its only requirement is that the string has an ending position that $ can match.

        In NFA, ([^/]*)$ is inefficient. Even the short '/usr/local/bin/perl' requires more than forty backtraces before a match is obtained. Consider trying to start from local. ([^/]*)$ matches until the second l, and then fails to match. Then it tries $ to store the states of l, o, c, a, and l in sequence, but fails to match. Then the process will be repeated starting from ocal, then cal, and so on.

        This example uses the functions provided by MySQL to achieve better implementation:

mysql> select substring_index('/usr/local/bin/perl','/',-1);
+-----------------------------------------------+
| substring_index('/usr/local/bin/perl','/',-1) |
+-----------------------------------------------+
| perl                                          |
+-----------------------------------------------+
1 row in set (0.00 sec)

3. Path and file name

mysql> set @r:='^(.*)/([^/]*)$';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='/usr/local/bin/perl';
Query OK, 0 rows affected (0.00 sec)

mysql> select if (t,regexp_replace(@s, @r, '$1'),'.') path, 
    ->        if (t,regexp_replace(@s, @r, '$2'),@s) filename 
    ->   from (select instr(@s,'/') t) t;
+----------------+----------+
| path           | filename |
+----------------+----------+
| /usr/local/bin | perl     |
+----------------+----------+
1 row in set (0.00 sec)

        The complete path must be divided into two parts: the path and the file name. .* will capture all text first, leaving no characters for / and $2. The only reason .* gives back characters is because of the backtracking it does when trying to match /([^/]*)$. This will leave the "returned" part to the following [^/]*. So $1 is the path where the file is located, and $2 is the file name.

        There is a problem with this expression. It requires that at least one slash must appear in the string. If you use it to match file.txt, because it cannot be matched, the path and file name will return the original string. Therefore, use the instr function in the subquery to first determine whether there is a slash.

4. Match symmetrical brackets

        To match parentheses, try the following regular expressions:

  1. \(.*\) Brackets and any characters inside the brackets.
  2. \([^)]*\) from one opening bracket to the nearest closing bracket.
  3. \([^()]*\) From an open bracket to the nearest closing bracket, but no open brackets are allowed.

        The results of applying these expressions to a simple string are shown below.

mysql> set @s:='var = foo(bar(this), 3.7) + 2 * (that - 1);';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r1:='\\(.*\\)';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r2:='\\([^)]*\\)';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r3:='\\([^()]*\\)';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_substr(@s,@r1) s1,regexp_substr(@s,@r2) s2,regexp_substr(@s,@r3) s3;
+-----------------------------------+------------+--------+
| s1                                | s2         | s3     |
+-----------------------------------+------------+--------+
| (bar(this), 3.7) + 2 * (that - 1) | (bar(this) | (this) |
+-----------------------------------+------------+--------+
1 row in set (0.00 sec)

        The part that needs to be matched is (bar(this), 3.7). As you can see, the first regular expression matches too much. .* is prone to problems, so you must be careful when using .* to know whether you really need to use an asterisk to constrain the period. Usually .* is not a suitable choice. The second regex matches too little, and the third regex is able to match (this), but not what is needed.

        None of these three expressions are appropriate. The real problem is that on most systems regular expressions cannot match arbitrarily deep nested structures. Regular expressions can be used to match nested brackets of a specific depth. For example, the regular expression for processing single-level nesting is:

\([^()]*(\([^()]*\)[^()]*)*\)

        test:
 

mysql> set @s:='var = foo(bar(this), 3.7) + 2 * (that - 1);';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='\\([^()]*(\\([^()]*\\)[^()]*)*\\)';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_substr(@s,@r);
+----------------------+
| regexp_substr(@s,@r) |
+----------------------+
| (bar(this), 3.7)     |
+----------------------+
1 row in set (0.00 sec)

        By analogy, deeper nesting becomes horribly complex.

5. Guard against unexpected matches

        Use a regular expression to match a number, either an integer or a floating point number, which may start with a negative sign. '-?[0-9]*\.?[0-9]*' can match numbers like 1, -272.37, 129238843., 191919, or even -.0. However, this expression can also match 'this has no number', 'nothing here' or the empty string.

mysql> set @r:='-?[0-9]*\\.?[0-9]*';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s1:='1';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2:='-272.37';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s3:='129238843.';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s4:='191919';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s5:='-.0';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s6:='this has no number';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s7:='nothing here';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s8:='';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s1, @r, '') c, regexp_extract(@s1, @r, '') s;
+-------------------+------+------+
| @r                | c    | s    |
+-------------------+------+------+
| -?[0-9]*\.?[0-9]* |    2 | 1,   |
+-------------------+------+------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s2, @r, '') c, regexp_extract(@s2, @r, '') s;
+-------------------+------+----------+
| @r                | c    | s        |
+-------------------+------+----------+
| -?[0-9]*\.?[0-9]* |    2 | -272.37, |
+-------------------+------+----------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s3, @r, '') c, regexp_extract(@s3, @r, '') s;
+-------------------+------+-------------+
| @r                | c    | s           |
+-------------------+------+-------------+
| -?[0-9]*\.?[0-9]* |    2 | 129238843., |
+-------------------+------+-------------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s4, @r, '') c, regexp_extract(@s4, @r, '') s;
+-------------------+------+---------+
| @r                | c    | s       |
+-------------------+------+---------+
| -?[0-9]*\.?[0-9]* |    2 | 191919, |
+-------------------+------+---------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s5, @r, '') c, regexp_extract(@s5, @r, '') s;
+-------------------+------+------+
| @r                | c    | s    |
+-------------------+------+------+
| -?[0-9]*\.?[0-9]* |    2 | -.0, |
+-------------------+------+------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s6, @r, '') c, regexp_extract(@s6, @r, '') s;
+-------------------+------+--------------------+
| @r                | c    | s                  |
+-------------------+------+--------------------+
| -?[0-9]*\.?[0-9]* |   19 | ,,,,,,,,,,,,,,,,,, |
+-------------------+------+--------------------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s7, @r, '') c, regexp_extract(@s7, @r, '') s;
+-------------------+------+--------------+
| @r                | c    | s            |
+-------------------+------+--------------+
| -?[0-9]*\.?[0-9]* |   13 | ,,,,,,,,,,,, |
+-------------------+------+--------------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s8, @r, '') c, regexp_extract(@s8, @r, '') s;
+-------------------+------+------+
| @r                | c    | s    |
+-------------------+------+------+
| -?[0-9]*\.?[0-9]* |    1 |      |
+-------------------+------+------+
1 row in set (0.00 sec)

        Look carefully at this expression - not every part is required to match, and if there is a number at the beginning of the string, the regular expression will indeed match. But because there are no required elements to match, this regular expression can match the null character at the beginning of the string in each example. In fact it can even match the null character at the beginning of 'num 123' because this null character appears earlier than the number.

mysql> set @s9:='num 123';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s9, @r, '') c, regexp_extract(@s9, @r, '') s;
+-------------------+------+----------+
| @r                | c    | s        |
+-------------------+------+----------+
| -?[0-9]*\.?[0-9]* |    6 | ,,,,123, |
+-------------------+------+----------+
1 row in set (0.00 sec)

        A floating point number must have at least one digit, otherwise it is not a legal value. First assume that there is at least one digit before the decimal point (this condition will be removed later), you need to use the plus sign to control these numbers '-?[0-9]+'.

        If you want to use a regular expression to match a possible decimal point and the digits that follow it, you must realize that the decimal part must come immediately after the decimal point. If you simply use '\.?[0-9]*', then '[0-9]*' will match regardless of whether the decimal point is present or not.

        The solution is to use question marks to limit the decimal point and the following decimal part, instead of just the decimal point: '(\.[0-9]*)?'. Within this structure, the decimal point must appear. If there is no decimal point, '[0-9]*' will not match at all.

        Combining them, you get '-?[0-9]+(\.[0-9]*)?'.

mysql> set @r:='-?[0-9]+(\\.[0-9]*)?';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s1:='1';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2:='-272.37';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s3:='129238843.';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s4:='191919';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s5:='-.0';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s6:='this has no number';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s7:='nothing here';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s8:='';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s1, @r, '') c, regexp_extract(@s1, @r, '') s;
+---------------------+------+------+
| @r                  | c    | s    |
+---------------------+------+------+
| -?[0-9]+(\.[0-9]*)? |    1 | 1    |
+---------------------+------+------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s2, @r, '') c, regexp_extract(@s2, @r, '') s;
+---------------------+------+---------+
| @r                  | c    | s       |
+---------------------+------+---------+
| -?[0-9]+(\.[0-9]*)? |    1 | -272.37 |
+---------------------+------+---------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s3, @r, '') c, regexp_extract(@s3, @r, '') s;
+---------------------+------+------------+
| @r                  | c    | s          |
+---------------------+------+------------+
| -?[0-9]+(\.[0-9]*)? |    1 | 129238843. |
+---------------------+------+------------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s4, @r, '') c, regexp_extract(@s4, @r, '') s;
+---------------------+------+--------+
| @r                  | c    | s      |
+---------------------+------+--------+
| -?[0-9]+(\.[0-9]*)? |    1 | 191919 |
+---------------------+------+--------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s5, @r, '') c, regexp_extract(@s5, @r, '') s;
+---------------------+------+------+
| @r                  | c    | s    |
+---------------------+------+------+
| -?[0-9]+(\.[0-9]*)? |    1 | 0    |
+---------------------+------+------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s6, @r, '') c, regexp_extract(@s6, @r, '') s;
+---------------------+------+------+
| @r                  | c    | s    |
+---------------------+------+------+
| -?[0-9]+(\.[0-9]*)? |    0 |      |
+---------------------+------+------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s7, @r, '') c, regexp_extract(@s7, @r, '') s;
+---------------------+------+------+
| @r                  | c    | s    |
+---------------------+------+------+
| -?[0-9]+(\.[0-9]*)? |    0 |      |
+---------------------+------+------+
1 row in set (0.00 sec)

mysql> select @r, regexp_count(@s8, @r, '') c, regexp_extract(@s8, @r, '') s;
+---------------------+------+------+
| @r                  | c    | s    |
+---------------------+------+------+
| -?[0-9]+(\.[0-9]*)? |    0 |      |
+---------------------+------+------+
1 row in set (0.00 sec)

        This expression cannot match '.007' because it requires a single digit in the integer part. If the integer part is allowed to be empty, the decimal part must be modified at the same time, otherwise this expression can match the null character (this is the problem that was planned to be solved at the beginning).

        The solution is to add a multi-select branch for cases that cannot be covered: '-?([0-9]+(\.[0-9]*)?|\.[0-9]+)'.

mysql> set @r:='-?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s5:='-.0';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s5, @r, '') c, regexp_extract(@s5, @r, '') s;
+--------------------------------+------+------+
| @r                             | c    | s    |
+--------------------------------+------+------+
| -?([0-9]+(\.[0-9]*)?|\.[0-9]+) |    1 | -.0  |
+--------------------------------+------+------+
1 row in set (0.00 sec)

        While this expression is much better than the original, it will still match numbers like '2003.04.12'. In order to achieve a balance between matching the desired text and ignoring the undesired text, it is necessary to understand the actual text to be matched. The regular expression used to extract floating point numbers must be contained within a large regular expression, such as '^...$' or 'num\s*=\s*...$'.

mysql> set @s10:='2003.04.12';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s10, @r, '') c, regexp_extract(@s10, @r, '') s;
+----------------------------------+------+------+
| @r                               | c    | s    |
+----------------------------------+------+------+
| ^-?([0-9]+(\.[0-9]*)?|\.[0-9]+)$ |    0 |      |
+----------------------------------+------+------+
1 row in set (0.00 sec)

mysql> set @r:='^-?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)$';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s10, @r, '') c, regexp_extract(@s10, @r, '') s;
+----------------------------------+------+------+
| @r                               | c    | s    |
+----------------------------------+------+------+
| ^-?([0-9]+(\.[0-9]*)?|\.[0-9]+)$ |    0 |      |
+----------------------------------+------+------+
1 row in set (0.00 sec)

6. Match text within delimiters

        Matching text such as delimiters (indicated by certain characters) is a common task. In addition to matching text and IP addresses within double quotes, two typical examples include:

  • Matches C comments between '/*' and '*/'.
  • Matches an HTML tag, which is the text within angle brackets, such as <CODE>.
  • Extract the text marked by the HTML tag, such as 'super exciting' in the HTML code 'a<I>super exciting</I>offer!'.
  • Matches a line of content in the .mailrc file. Each line of this file is organized according to the following data format:  
    alias 简称 电子邮件地址

    For example 'alias jeff [email protected]' (here the delimiters are whitespace and newline between each part).

  • Matches quoted strings, but allows escaped quotes. For example 'a passport needs a "2\"x3\" likeness" of the holder'.
  • Parse CSV (comma-separated values) files.

        In summary, the steps for handling these tasks are:

  1. Matches the opening delimiter.
  2. Matches main text (all text before the closing delimiter).
  3. Match the closing delimiter.

        Let's look at the example of 2\"x3\". The ending delimiter here is a quotation mark. It is easy to match the starting and ending delimiters. The regular expression that can be written in one go is: '".*"'. In this example, it happens to take advantage of the default greedy feature of the quantifier to directly match the double quotes in the text.

mysql> set @s:='a passport needs a "2\\"x3\\" likeness" of the holder';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='".*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+-----------------------------------------------------+------+--------------------+
| @s                                                  | c    | s                  |
+-----------------------------------------------------+------+--------------------+
| a passport needs a "2\"x3\" likeness" of the holder |    1 | "2\"x3\" likeness" |
+-----------------------------------------------------+------+--------------------+
1 row in set (0.00 sec)

        Consider a more general approach below. Think carefully about the characters that can appear in the text. If a character is not a quotation mark, that is, if this character can be matched by '[^"]', then it must belong to the text. If this character is a quotation mark, and it is preceded by A backslash, then the quotation mark also belongs to the text. Express this meaning and use the look-around function to handle the "if there is a backslash before" situation, you will get '"([^"]|(?<=\\ )")*"', this expression can completely match 2\"x3\".

mysql> set @s:='a passport needs a "2\\"x3\\" likeness" of the holder';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='"([^"]|(?<=\\\\)")*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+-----------------------------------------------------+--------------------+------+--------------------+
| @s                                                  | @r                 | c    | s                  |
+-----------------------------------------------------+--------------------+------+--------------------+
| a passport needs a "2\"x3\" likeness" of the holder | "([^"]|(?<=\\)")*" |    1 | "2\"x3\" likeness" |
+-----------------------------------------------------+--------------------+------+--------------------+
1 row in set (0.00 sec)

        However, this example can also be used to illustrate how seemingly correct expressions can match unexpected text. For example text: Darth Symbol: "/-|-\\" or "[^-^]"

        Expecting it to match "/-|-\\", but it matches "/-|-\\" or ".

mysql> set @s:='"/-|-\\\\" or "[^-^]"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='"([^"]|(?<=\\\\)")*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+---------------------+--------------------+------+---------------+
| @s                  | @r                 | c    | s             |
+---------------------+--------------------+------+---------------+
| "/-|-\\" or "[^-^]" | "([^"]|(?<=\\)")*" |    1 | "/-|-\\" or " |
+---------------------+--------------------+------+---------------+
1 row in set (0.00 sec)

        This is because there is indeed a backslash before the first quotation mark, but the backslash itself is escaped, it is not used to escape the double quotation mark after it, that is to say, the quotation mark actually indicates the quoted text end. The lookahead does not recognize the escaped backslash, and if there are any number of '\\'s before the quotation mark, it will only be worse with the lookahead. In this example, the lazy feature of the quantifier can be used to directly match the desired result.

mysql> set @s:='"/-|-\\\\" or "[^-^]"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='".*?"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+---------------------+-------+------+------------------+
| @s                  | @r    | c    | s                |
+---------------------+-------+------+------------------+
| "/-|-\\" or "[^-^]" | ".*?" |    2 | "/-|-\\","[^-^]" |
+---------------------+-------+------+------------------+
1 row in set (0.00 sec)

        A more detailed way of writing is to list all the text that may appear in the body, which can include escaped characters ('\\.'), and can also include any characters that are not quotation marks '[^"]', so Get '"(\\.|[^"])*"'.

mysql> set @s:='"/-|-\\\\" or "[^-^]"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='"(\\\\.|[^"])*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+---------------------+---------------+------+------------------+
| @s                  | @r            | c    | s                |
+---------------------+---------------+------+------------------+
| "/-|-\\" or "[^-^]" | "(\\.|[^"])*" |    2 | "/-|-\\","[^-^]" |
+---------------------+---------------+------+------------------+
1 row in set (0.00 sec)

        Now that the problem is solved, there is still a problem with the expression and the unexpected match still occurs. For example, consider the following text: "You need a 2\"x3\" Photo.

        It should not match because there is no closing delimiter, but it does.

mysql> set @s:='"You need a 2\\"x3\\" Photo.';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='"(\\\\.|[^"])*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+----------------------------+---------------+------+---------------------+
| @s                         | @r            | c    | s                   |
+----------------------------+---------------+------+---------------------+
| "You need a 2\"x3\" Photo. | "(\\.|[^"])*" |    1 | "You need a 2\"x3\" |
+----------------------------+---------------+------+---------------------+
1 row in set (0.00 sec)

        This expression matches the text after the quotation mark at the beginning, but does not find the closing quotation mark, so it will backtrack, and when it reaches the backslash after 3, '[^"]' matches the backslash, and the subsequent quotation mark Considered a closing quotation mark.

        The important implication of this example is that if backtracking leads to undesired matching results with respect to multiple-choice structures, the problem is likely to be that any successful matching is simply an accidental result of the ordering of the multiple-choice branches.

        In fact, if you reverse the multi-select branch of this regular expression, it will incorrectly match any string that contains escaped double quotes.

mysql> set @s:='"You need a 2\\"x3\\" Photo.';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='"([^"]|\\\\.)*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+----------------------------+---------------+------+-----------------+
| @s                         | @r            | c    | s               |
+----------------------------+---------------+------+-----------------+
| "You need a 2\"x3\" Photo. | "([^"]|\\.)*" |    1 | "You need a 2\" |
+----------------------------+---------------+------+-----------------+
1 row in set (0.00 sec)

        The real problem is that the content that each multi-select branch can match overlaps. The solution is to ensure that the content matched by each multi-selection branch is mutually exclusive. In this case, you must ensure that the backslash cannot be matched in other ways, that is, change '[^"]' to '[^\\"]'. This recognizes double quotes and "special" backslashes in text, which must be handled separately. The result is '"(\\.|[^\\"])*"'.

mysql> set @s:='"You need a 2\\"x3\\" Photo.';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='"(\\\\.|[^\\\\"])*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+----------------------------+-----------------+------+------+
| @s                         | @r              | c    | s    |
+----------------------------+-----------------+------+------+
| "You need a 2\"x3\" Photo. | "(\\.|[^\\"])*" |    0 |      |
+----------------------------+-----------------+------+------+
1 row in set (0.00 sec)

        If there is a possessive quantifier that takes precedence or solidified grouping, this expression can be rewritten as '"(\\.|[^"])*+"' or '"(?>(\\.|[^"])* )"'. These two expressions disable the engine from backtracking to where the problem may have occurred, so they both suffice.

mysql> set @s:='"You need a 2\\"x3\\" Photo.';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='"(\\\\.|[^"])*+"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+----------------------------+----------------+------+------+
| @s                         | @r             | c    | s    |
+----------------------------+----------------+------+------+
| "You need a 2\"x3\" Photo. | "(\\.|[^"])*+" |    0 |      |
+----------------------------+----------------+------+------+
1 row in set (0.00 sec)

mysql> set @r:='"(?>(\\\\.|[^"])*)"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+----------------------------+-------------------+------+------+
| @s                         | @r                | c    | s    |
+----------------------------+-------------------+------+------+
| "You need a 2\"x3\" Photo. | "(?>(\\.|[^"])*)" |    0 |      |
+----------------------------+-------------------+------+------+
1 row in set (0.00 sec)

        Possessive quantifiers and solidified grouping solve this problem more efficiently because match failures are reported faster.

7. Remove blank characters at the beginning and end of the text

        Removing whitespace characters from the beginning and end of text is a frequently performed task. Overall the best approach is to use two substitutions.

mysql> set @s1:='';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2:=' ';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s3:='  aaa bbb  ccc';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s4:='aaa bbb  ccc   ';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s5:='  aaa bbb  ccc   ';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(regexp_replace(@s1,'^\\s+',''),'\\s+$','') s1,
    ->        regexp_replace(regexp_replace(@s2,'^\\s+',''),'\\s+$','') s2,
    ->        regexp_replace(regexp_replace(@s3,'^\\s+',''),'\\s+$','') s3,
    ->        regexp_replace(regexp_replace(@s4,'^\\s+',''),'\\s+$','') s4,
    ->        regexp_replace(regexp_replace(@s5,'^\\s+',''),'\\s+$','') s5;
+------+------+---------------+---------------+---------------+
| s1   | s2   | s3            | s4            | s5            |
+------+------+---------------+---------------+---------------+
|      |      | aaa bbb  ccc  | aaa bbb  ccc  | aaa bbb  ccc  |
+------+------+---------------+---------------+---------------+
1 row in set (0.00 sec)

        For efficiency, '+' is used here instead of '*', because if there are actually no whitespace characters to be removed, there is no need to do the replacement.

        For some reason, people seem to prefer a regular expression to solve the entire problem. Methods are provided here for comparison, aiming to understand how these regular expressions work and their problems. These methods are not recommended. In MySQL 8.0.16, an error occurs when replacing an empty string with this regular expression:

mysql> set @r:='^\\s*(.*?)\\s*$';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s1:='';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s1,@r,'$1') s1;
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> set @r:='^\\s*(.*?)\\s*$';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2:=' ';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s3:='  aaa bbb  ccc';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s4:='aaa bbb  ccc   ';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s5:='  aaa bbb  ccc   ';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s2,@r,'$1') s2,
    ->        regexp_replace(@s3,@r,'$1') s3,
    ->        regexp_replace(@s4,@r,'$1') s4,
    ->        regexp_replace(@s5,@r,'$1') s5;
+------+---------------+---------------+---------------+
| s2   | s3            | s4            | s5            |
+------+---------------+---------------+---------------+
|      | aaa bbb  ccc  | aaa bbb  ccc  | aaa bbb  ccc  |
+------+---------------+---------------+---------------+
1 row in set (0.00 sec)

        This expression is much slower than the normal approach (5 times slower in Perl). The reason this is so inefficient is that dots that ignore precedence (lazy matching) constraints have to check for '\s*$' every time they are applied, which requires a lot of backtracking. In MySQL 8.0.16, an error occurs when replacing an empty string with this regular expression:

mysql> set @r:='^\\s*((?:.*\\S)?)\\s*$';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s1:='';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s1,@r,'$1') s1;
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> set @r:='^\\s*((?:.*\\S)?)\\s*$';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2:=' ';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s3:='  aaa bbb  ccc';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s4:='aaa bbb  ccc   ';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s5:='  aaa bbb  ccc   ';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s2,@r,'$1') s2,
    ->        regexp_replace(@s3,@r,'$1') s3,
    ->        regexp_replace(@s4,@r,'$1') s4,
    ->        regexp_replace(@s5,@r,'$1') s5;
+------+---------------+---------------+---------------+
| s2   | s3            | s4            | s5            |
+------+---------------+---------------+---------------+
|      | aaa bbb  ccc  | aaa bbb  ccc  | aaa bbb  ccc  |
+------+---------------+---------------+---------------+
1 row in set (0.01 sec)

        This expression looks more complicated than the previous one, but it takes only twice as long as the normal method. After '^\s*' matches a space at the beginning of the text, '.*' matches the end of the text immediately. The following '\S' forces it to backtrack until it finds a non-blank character, leaving the remaining whitespace characters for the final '\s*$', which captures outside of the parentheses. The question mark outside the non-capturing group is necessary here, because if a row of data contains only whitespace characters, the question mark must appear for the expression to work properly. If there is no question mark, it may fail to match and miss this kind of line with only blank characters.

mysql> set @r:='^\\s+|\\s+$';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s1:='';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2:=' ';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s3:='  aaa bbb  ccc';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s4:='aaa bbb  ccc   ';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s5:='  aaa bbb  ccc   ';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s1,@r,'') s1,
    ->        regexp_replace(@s2,@r,'') s2,
    ->        regexp_replace(@s3,@r,'') s3,
    ->        regexp_replace(@s4,@r,'') s4,
    ->        regexp_replace(@s5,@r,'') s5;
+------+------+---------------+---------------+---------------+
| s1   | s2   | s3            | s4            | s5            |
+------+------+---------------+---------------+---------------+
|      |      | aaa bbb  ccc  | aaa bbb  ccc  | aaa bbb  ccc  |
+------+------+---------------+---------------+---------------+
1 row in set (0.00 sec)

        This is the easiest regular expression to think of, but this top-level multi-select branch arrangement severely affects the optimization measures that might otherwise be used. This expression takes 4 times as long as the simple method.

        A simple double substitution is almost always the fastest, and obviously the easiest to understand.

8. HTML related examples

1. Match HTML Tag

        The most common way is to use '<[^>]+>' to match HTML tags. It usually works, for example to remove tags:

mysql> set @s:='<tag> aaa </tag>';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='<[^>]+>';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s,@r,'');
+--------------------------+
| regexp_replace(@s,@r,'') |
+--------------------------+
|  aaa                     |
+--------------------------+
1 row in set (0.00 sec)

        If the tag contains '>', it will not match normally, but the HTML language does allow unescaped '<' and '>' in the tag attribute within quotes: <input name=dir value=">" >. In this way, a simple '<[^>]+>' cannot be matched.

mysql> set @s:='<input name=dir value=">">';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='<[^>]+>';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s,@r,'');
+--------------------------+
| regexp_replace(@s,@r,'') |
+--------------------------+
| ">                       |
+--------------------------+
1 row in set (0.00 sec)

        '<...>' can contain both quoted text and unquoted 'other stuff', which includes any character except '>' and quotes. HTML quotes can use single quotes or double quotes, but nested quotes are not allowed to be escaped, so you can directly use '"[^"]"*' and ''[^']*'' to match. Combining these with the "other text" expression '[^'">]' gives: '<("[^"]"*|'[^']*'|[^'">])*>' .

mysql> set @s:='<input name=dir value=">">';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='<("[^"]"*|\'[^\']*\'|[^\'">])*>';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_replace(@s,@r,'');
+----------------------------+-----------------------------+--------------------------+
| @s                         | @r                          | regexp_replace(@s,@r,'') |
+----------------------------+-----------------------------+--------------------------+
| <input name=dir value=">"> | <("[^"]"*|'[^']*'|[^'">])*> |                          |
+----------------------------+-----------------------------+--------------------------+
1 row in set (0.00 sec)

        This expression treats each quoted part as a single unit and clearly states what characters are allowed at which positions in the match. Parts of this expression will not match repeated characters, so there is no ambiguity, and no need to worry about "sneaking in" unexpected matches in the previous example.

        The first two multiple-choice branches use * instead of + in quotation marks. Quoted strings may be empty (eg 'alt=""'), so use * to handle this case. And the third branch '[^\'">]' only accepts the limitation of * outside the brackets, add a plus sign to it to get '([^\'">]+)*', which may lead to very strange results .

        There are also efficiency issues to consider when using NFA (such as MySQL) engines: since the text matched by brackets is not used, they can be changed to non-capturing brackets '(?:...)'. Because there is no overlap in the multi-select branches, if the final '>' cannot be matched, it will be futile to go back and try other multi-select branches. If one multi-choice branch can be matched at a certain position, then other multi-choice branches must not be matched here. So, it doesn't matter if you don't save the state, which will also cause faster failure if no match is found. You can use solid grouping '(?>...)' instead of non-capturing brackets, or qualify '*+' with a preemptive asterisk to avoid backtracking.

        Suppose you need to extract URL and link text from a document, for example, http://www.oreilly.com and O'Reilly Media from the following text:

...<a href="http://www.oreilly.com">O'Reilly Media</a>...

        The content of the <A> tag can be quite complex, so it can be implemented in two steps. The first step is to extract the content inside the <A> tag, which is the link text, and then extract the URL address from the <A> tag.

        The regular expression to achieve the first step is:

'<a\b([^>]+)>(.*?)</a>'

        It will put the content of <A> into $1 and the link text into $2. Ignore quantifier priority is used here. Of course you should use the expression described in the previous section to match tags, here the simple form '[^>]+' is used simply because it is easier to explain.

mysql> set @s:='<a href="http://www.oreilly.com">O\'Reilly Media</a>';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='<a\\b([^>]+)>(.*?)</a>';
Query OK, 0 rows affected (0.01 sec)

mysql> select @s, @r, regexp_replace(@s, @r, '$2', 1, 0, 'n');
+-----------------------------------------------------+-----------------------+-----------------------------------------+
| @s                                                  | @r                    | regexp_replace(@s, @r, '$2', 1, 0, 'n') |
+-----------------------------------------------------+-----------------------+-----------------------------------------+
| <a href="http://www.oreilly.com">O'Reilly Media</a> | <a\b([^>]+)>(.*?)</a> | O'Reilly Media                          |
+-----------------------------------------------------+-----------------------+-----------------------------------------+
1 row in set (0.00 sec)

        The match type here uses dotall. MySQL's regular expressions do not provide a method to obtain a single capturing group. It can only be obtained indirectly by substitution using the regexp_replace function. And to ensure that only capturing groups are returned, it is best to return only one capturing group each time regexp_replace is called. Obviously, using this method to obtain all capture groups has low performance, because the engine has already obtained the values ​​of all capture groups by applying the regular expression once, but MySQL does not provide users with the corresponding functions.

        If you like, you can use delimiters to get all capturing groups at once, such as regexp_replace(@s, @r, '$1|$2', 1, 0, 'n'), and use the | symbol as a delimiter to connect multiple capturing groups. . However, for subsequent processing, it is necessary to ensure that there is no | character in the original string.

        Once the contents of <A> are stored in $1, it can be checked using a standalone regular expression. URL is the value of the href attribute. HTML allows whitespace characters on either side of the equal sign, and values ​​can appear in quoted or unquoted form. So the regular expression to match the URL is as follows:

\bhref\s*=\s*(?:"([^"]*)"|'([^']*)'|([^'">\s]+))

        illustrate:

  • \bhref matches the "href" attribute.
  • \s*=\s* matches "=" and whitespace characters may appear at both ends.
  • "([^"]*)" matches a double-quoted string.
  • '([^']*)' matches single quoted strings.
  • ([^'">\s]+) Other text, matches any character except single and double quotes, > and whitespace.

        Each multi-select structure that matches a value is bracketed to capture the exact text. The outermost grouping does not require capturing, so using non-capturing brackets ?: is both clear and efficient. Because the entire href value needs to be captured, + is used here to limit other text multi-selection branches. This plus sign will not lead to strange results, because there is no quantifier directly acting on the entire multiple selection structure.

mysql> set @s:='<a href="http://www.oreilly.com">O\'Reilly Media</a>';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r1:='<a\\b([^>]+)>(.*?)</a>';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r2:='\\bhref\\s*=\\s*(?:"([^"]*)"|\'([^\']*)\'|([^\'">\\s]+))';
Query OK, 0 rows affected (0.00 sec)

mysql> select if(regexp_like(url,@r2),regexp_replace(url, @r2, '$1$2$3', 1, 0, 'n'),'') url, link
    ->   from (select trim(regexp_replace(@s, @r1, '$1', 1,0,'n')) url, regexp_replace(@s, @r1, '$2', 1,0,'n') link) t;
+------------------------+----------------+
| url                    | link           |
+------------------------+----------------+
| http://www.oreilly.com | O'Reilly Media |
+------------------------+----------------+
1 row in set (0.00 sec)

        The inner subquery performs the first step of processing, in which the trim function removes the blank characters matched at the \b position in the @r1 expression. The outer query performs the second step of extracting the URL. Depending on the specific text, the final URL may be stored in $1, $2, or $3. At this time, other capturing brackets are empty or undefined. The feature of MySQL string connection is used here to write the reference variables of the three mutually exclusive branches together and replace them, and what is returned is the URL that is really needed. There is another problem to note. When using regexp_replace to indirectly obtain the capturing group, if there is no match, the original string will be returned. This is obviously an incorrect result. To eliminate this exception, use MySQL's if expression, first use the regexp_like function to determine whether there is a match, and then use regexp_replace to return the capture group when the match is successful, otherwise an empty string will be returned. This operation involves a lot of repetitive work and is very inefficient, simply because MySQL's regular expression cannot return the capturing group.

3. Check the HTTP URL

        See if the URL address obtained is an HTTP URL. If so, break it into two parts: hostname and path. The hostname is what comes after '^http://' and before the first backslash (if any), and the path is what's beyond that: '^http://([^/]+ )(/.*)?$'.

        The URL may contain a port number, which is between the hostname and the path, starting with a colon: '^http://([^/:]+)(:(\d+))?(/.*)?$' .

mysql> set @s:='http://www.oreilly.com:8080/book/details/130986791?spm=1001.2014.3001.5501';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='^http://([^/:]+)(:(\\d+))?(/.*)?$';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_replace(@s, @r, '$1') host, regexp_replace(@s, @r, '$3') port, regexp_replace(@s, @r, '$4') path;
+-----------------+------+-------------------------------------------------+
| host            | port | path                                            |
+-----------------+------+-------------------------------------------------+
| www.oreilly.com | 8080 | /book/details/130986791?spm=1001.2014.3001.5501 |
+-----------------+------+-------------------------------------------------+
1 row in set (0.00 sec)

4. Verify hostname

        Extract hostname from known text (such as a ready-made URL):

mysql> set @r:='https?://([^/:]+)(:(\\d+))?(/.*)?';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='http://www.google.com/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_replace(@s, @r, '$1') hostname;
+------------------------+----------------------------------+----------------+
| @s                     | @r                               | hostname       |
+------------------------+----------------------------------+----------------+
| http://www.google.com/ | https?://([^/:]+)(:(\d+))?(/.*)? | www.google.com |
+------------------------+----------------------------------+----------------+
1 row in set (0.00 sec)

        Accurately extract hostnames from random text:
 

mysql> set @r:='https?://([-a-z0-9]+(\\.[-a-z0-9]+)*\\.(com|edu|info))(:(\\d+))?(/.*)?';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='http://www.google.com/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_replace(@s, @r, '$1') hostname;
+------------------------+---------------------------------------------------------------------+----------------+
| @s                     | @r                                                                  | hostname       |
+------------------------+---------------------------------------------------------------------+----------------+
| http://www.google.com/ | https?://([-a-z0-9]+(\.[-a-z0-9]+)*\.(com|edu|info))(:(\d+))?(/.*)? | www.google.com |
+------------------------+---------------------------------------------------------------------+----------------+
1 row in set (0.00 sec)

        Regular expressions can be used to verify hostnames. As a rule, a hostname consists of dot-separated parts, each part cannot exceed 63 characters, and can include ASCII characters, numbers, and hyphens, but cannot start or end with a hyphen. So you can use this regex in case-insensitive mode: '[a-z0-9]|[a-z0-9][-a-z0-9]{0,61}[a-z0- 9]'. There are only a finite number of possible ending suffixes (com, edu, uk, etc.). Combined, the following regular expression matches a semantically correct hostname:
'^(?i)(?:[a-z0-9]\.|[a-z0-9][-a-z0- 9]{0,61}[a-z0-9]\.)*(?:com|edu|gov|int|mil|net|org|biz|info|name|museum|coop|aero|[az] [az])$'.

mysql> set @r:='^(?i)(?:[a-z0-9]\\.|[a-z0-9][-a-z0-9]{0,61}[a-z0-9]\\.)*(?:com|edu|gov|int|mil|net|org|biz|info|name|museum|coop|aero|[a-z][a-z])$';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s1:='ai';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2:='www.google';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s3:='google.com';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s4:='www.google.com';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s1 hostname, regexp_like(@s1, @r) isvalid
    ->  union all
    -> select @s2, regexp_like(@s2, @r)
    ->  union all
    -> select @s3, regexp_like(@s3, @r)
    ->  union all
    -> select @s4, regexp_like(@s4, @r);
+----------------+---------+
| hostname       | isvalid |
+----------------+---------+
| ai             |       1 |
| www.google     |       0 |
| google.com     |       1 |
| www.google.com |       1 |
+----------------+---------+
4 rows in set (0.00 sec)

5. Extracting URLs in the real world

        Recognizing hostnames and URLs from plain text is much more difficult than validating them. The following regular expression extracts several types of URLs such as mailto, ftp, http, https, etc. from the text. If you find 'http://' in the text, you know that this must be the beginning of a URL, so you can directly use 'http://[-\w]+(\.\w[-\w]*)+ ' to replace '-a-z0-9'. '\w' also matches underscores.

        However, the URL usually does not start with http:// or mailto:. In this case, the regular expression to match the host name is: '(?
i:[a-z0-9](?:[-a-z0-9 ]*[a-z0-9])?\.)+(?-i:com\b|edu\b|biz\b|gov\b|in(?:t|fo)\b|mil\b |net\b|org\b|[az][az]\b)'

Following the hostname is the path part, which uses reverse lookup to ensure that the URL does not end with a period at the end of the sentence.

mysql> set @r_protocol:='(ftp|https?)://';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r_hostname:='[-\\w]+(\\.\\w[-\\w]*)+|(?i:[a-z0-9](?:[-a-z0-9]*[a-z0-9])?\\.)+(?-i:com\\b|edu\\b|biz\\b|gov\\b|in(?:t|fo)\\b|mil\\b|net\\b|org\\b|[a-z][a-z]\\b)';
Query OK, 0 rows affected (0.01 sec)

mysql> set @r_port:='(:\\d+)?';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r_path:='(/[-a-z0-9_:\\@&?=+,.!/~*\'%\\$]*(?<![.,?!]))?';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:=concat('\\b(',@r_protocol,@r_hostname,')',@r_port,@r_path);
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='https://www.tetet.com:8080/index.html?q=1';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, 'n') c, regexp_extract(@s, @r, 'n') s;
+------+-------------------------------------------+
| c    | s                                         |
+------+-------------------------------------------+
|    1 | https://www.tetet.com:8080/index.html?q=1 |
+------+-------------------------------------------+
1 row in set (0.00 sec)

9. Maintain data coordination

        Assume that the data to be processed is a series of consecutive 5-digit U.S. ZIP Codes, and the codes that need to be extracted are those starting with 44. The following is a little sampling, the values ​​​​that need to be extracted are 44182 and 44272:
03824531449411615213441829505344272752010217443235

        The easiest one to think of is '\d{5}', which matches all postal codes. In MySQL, you only need to call the regexp_substr function in a loop. The focus here is on the regular expression itself, not the implementation mechanism of the language.

        Assuming all data is canonical (this assumption is very case-specific), '\d{5}' will match at any time during the entire parsing process, with absolutely no gearing and retries.

set @s:='03824531449411615213441829505344272752010217443235';
set @r:='\\d{5}';
with t1 as
(select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
t2 as
(with recursive tab1(lv) as
(select 1 lv union all select t1.lv + 1 from tab1 t1 where lv < length(@s)/5)
select lv from tab1),
t3 as
(select substring_index(substring_index(s,',',lv),',',-1) s from t1,t2)
select * from t3 where s like '44%';

        Changing '\d{5}' to '44\\d{3}' to find zip codes starting with 44 will not work. After a match fails, the gearing will drive forward one character, and the match for '44' will no longer start at the first bit of each zip code, so '44\\d{3}' will incorrectly match 44941:
 

mysql> set @r:='44\\d{3}';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+------+-------------------------+
| c    | s                       |
+------+-------------------------+
|    4 | 44941,44182,44272,44323 |
+------+-------------------------+
1 row in set (0.00 sec)

        Here you need to manually maintain the coordination of the regular engine to ignore unnecessary zip codes. The key is to skip the complete zip code, rather than using the gearing's drive process (bump-along) for the movement of individual characters.

1. Maintain alignment with expectations

        Listed below are several ways to skip unnecessary zip codes. Adding them before the regular expression '44\d{3}' will give you the desired result. Non-capturing brackets are used to match unexpected zip codes so that they can be quickly skipped and the matching zip code is found within the capturing brackets of $1.

mysql> set @s:='03824531449411615213441829505344272752010217443235';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='\\d{5}';
Query OK, 0 rows affected (0.00 sec)

mysql> with t1 as
    -> (select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
    -> t2 as
    -> (with recursive tab1(lv) as
    -> (select 1 lv union all select t1.lv + 1 from tab1 t1 where lv < length(@s)/5)
    -> select lv from tab1),
    -> t3 as
    -> (select substring_index(substring_index(s,',',lv),',',-1) s from t1,t2)
    -> select * from t3 where s like '44%';
+-------+
| s     |
+-------+
| 44182 |
| 44272 |
+-------+
2 rows in set (0.00 sec)

        This brute-force method automatically ignores postal codes that do not begin with 44. Note that you cannot use '(?:[1235-9][1235-9]\d{3})*', because it will not match (and therefore cannot be skipped) 43210 such an unexpected zip code.

mysql> set @r:='(?:(?!44)\\d{5})*(44\\d{3})';
Query OK, 0 rows affected (0.00 sec)

mysql> with t1 as
    -> (select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
    -> t2 as
    -> (with recursive tab1(lv) as 
    -> (select 1 lv union all select t.lv + 1 from tab1 t,t1 where lv < t1.c)
    -> select lv from tab1)
    -> select regexp_replace(regexp_substr(@s, @r, 1, lv),@r,'$1') zip_44
    ->   from t1,t2;
+--------+
| zip_44 |
+--------+
| 44182  |
| 44272  |
| 44323  |
+--------+
3 rows in set (0.01 sec)

        This method skips postal codes that do not begin with 44, and the idea is no different from the previous method. Here, the expected zip code (starting with 44) causes the negative lookahead (?!44) to fail, so the stop is skipped.

mysql> set @r:='(?:\\d{5})*?(44\\d{3})';
Query OK, 0 rows affected (0.00 sec)

mysql> with t1 as
    -> (select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
    -> t2 as
    -> (with recursive tab1(lv) as 
    -> (select 1 lv union all select t.lv + 1 from tab1 t,t1 where lv < t1.c)
    -> select lv from tab1)
    -> select regexp_replace(regexp_substr(@s, @r, 1, lv),@r,'$1') zip_44
    ->   from t1,t2;
+--------+
| zip_44 |
+--------+
| 44182  |
| 44272  |
| 44323  |
+--------+
3 rows in set (0.00 sec)

        This method uses ignore precedence quantifiers to skip certain text only when necessary. Put it in front of the regular expression that actually needs to be matched, and if that expression fails, it will match a zip code. Ignoring the precedence '(...)*?' causes this to happen. Because of the ignored precedence quantifier, '(?:\d{5})' won't even try to match until the following expression fails. The asterisk ensures that it will fail repeatedly until it finally finds matching text, thus only skipping the text it wishes to skip.

        Combining this expression with '(44\d{3})', you can extract the postal codes starting with 44 and automatically skip other postal codes. This expression can be applied to strings repeatedly, because the "start matching position" of each match is the beginning of a zip code, which means that the next match is guaranteed to start from a zip code, which is exactly what the regular expression expects of.

        The first two methods essentially take advantage of the * quantifier's feature of greedy matching (matching priority) by default, and will not match 44941 incorrectly. Because the third method is lazy matching (ignoring priority), it will only skip the unexpected zip codes in groups of 5 characters in sequence, and it will not match 44941 by mistake. However, the three methods have a common problem, which is that 44323 is incorrectly matched due to backtracking. Let’s look at the specific analysis and how to solve it.

2. Coordination should be ensured even when there is mismatch.

        The previous regular expression manually skipped the zip code that does not meet the requirements, but once there is no need to continue matching, after the current round of matching fails, it will naturally be the driving process and retry (backtracking), so that it will start from one of the zip code strings position starts.

        Looking at the data sample again, after 44272 matches, no more matches were found in the target text, so this round of attempts failed. But the overall attempt did not fail. The gearing would drive to apply the regular expression starting from the next character in the string, thus breaking coordination. After the fourth drive, the regex skips 10217 and incorrectly matches 44323.

        All three expressions are fine if applied at the beginning of the string, but the actuation of the gearing destroys coordination. One way is to disable the driver process, that is, add the '?' quantifier after '(44\d{3})' in the first two methods and change it to an optional matching priority. In this way, the deliberately arranged '(?:(?!44)\d{5})*...' or '(?:[1235-9]\d{4}|\d[1235-9]\d {3})*...' will only stop under two conditions: a satisfactory match occurs, or the end of the postal code string. This way, '(44\d{3})' will match if a matching postal code exists, without forcing a backtracking.

mysql> set @r:='(?:[1235-9]\\d{4}|\\d[1235-9]\\d{3})*(44\\d{3})?';
Query OK, 0 rows affected (0.00 sec)

mysql> with t1 as
    -> (select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
    -> t2 as
    -> (with recursive tab1(lv) as 
    -> (select 1 lv union all select t.lv + 1 from tab1 t,t1 where lv < t1.c-1)
    -> select lv from tab1)
    -> select regexp_replace(regexp_substr(@s, @r, 1, lv),@r,'$1') zip_44
    ->   from t1,t2;
+--------+
| zip_44 |
+--------+
| 44182  |
| 44272  |
|        |
+--------+
3 rows in set (0.00 sec)

mysql> set @r:='(?:(?!44)\\d{5})*(44\\d{3})?';
Query OK, 0 rows affected (0.00 sec)

mysql> with t1 as
    -> (select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
    -> t2 as
    -> (with recursive tab1(lv) as 
    -> (select 1 lv union all select t.lv + 1 from tab1 t,t1 where lv < t1.c-1)
    -> select lv from tab1)
    -> select regexp_replace(regexp_substr(@s, @r, 1, lv),@r,'$1') zip_44
    ->   from t1,t2;
+--------+
| zip_44 |
+--------+
| 44182  |
| 44272  |
|        |
+--------+
3 rows in set (0.01 sec)

        t1.c-1 is to remove the empty match at the end of the string. This method is still not perfect. One reason is that even if there is no zip code that meets the requirements in the target string, or even an empty string, the match will be successful, and the subsequent processing will become more complicated. However, its advantage is that it is fast, because there is no need to backtrack, and no transmission device is required to perform any driving process.

        This method does not work for the third expression, '(?:\d{5})*?' * The quantifier is ignored and takes precedence, '(44\d{3})?' ? The quantifier is optionally matched, so there are many gaps match.
 

mysql> set @r:='(?:\\d{5})*?(44\\d{3})?';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+------+--------------------------------------------------------+
| c    | s                                                      |
+------+--------------------------------------------------------+
|   35 | ,,,,,,,,44941,,,,,,,,44182,,,,,,44272,,,,,,,,,,44323,, |
+------+--------------------------------------------------------+
1 row in set (0.00 sec)

3. Use \G to ensure coordination

        A more general approach is to add '\G' at the beginning of these three expressions. Because if each match of the expression ends with a qualified zip code, the next match will not be driven at the beginning. And if there is a driver process, the '\G' at the beginning will immediately cause the match to fail, because in most genres, it can only successfully match if the driver process does not occur.

mysql> set @r:='\\G(?:[1235-9]\\d{4}|\\d[1235-9]\\d{3})*(44\\d{3})';
Query OK, 0 rows affected (0.00 sec)

mysql> with t1 as
    -> (select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
    -> t2 as
    -> (with recursive tab1(lv) as 
    -> (select 1 lv union all select t.lv + 1 from tab1 t,t1 where lv < t1.c)
    -> select lv from tab1)
    -> select regexp_replace(regexp_substr(@s, @r, 1, lv),@r,'$1') zip_44
    ->   from t1,t2;
+--------+
| zip_44 |
+--------+
| 44182  |
| 44272  |
+--------+
2 rows in set (0.00 sec)

mysql> set @r:='\\G(?:(?!44)\\d{5})*(44\\d{3})';
Query OK, 0 rows affected (0.01 sec)

mysql> with t1 as
    -> (select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
    -> t2 as
    -> (with recursive tab1(lv) as 
    -> (select 1 lv union all select t.lv + 1 from tab1 t,t1 where lv < t1.c)
    -> select lv from tab1)
    -> select regexp_replace(regexp_substr(@s, @r, 1, lv),@r,'$1') zip_44
    ->   from t1,t2;
+--------+
| zip_44 |
+--------+
| 44182  |
| 44272  |
+--------+
2 rows in set (0.00 sec)

mysql> set @r:='\\G(?:\\d{5})*?(44\\d{3})';
Query OK, 0 rows affected (0.00 sec)

mysql> with t1 as
    -> (select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s),
    -> t2 as
    -> (with recursive tab1(lv) as 
    -> (select 1 lv union all select t.lv + 1 from tab1 t,t1 where lv < t1.c)
    -> select lv from tab1)
    -> select regexp_replace(regexp_substr(@s, @r, 1, lv),@r,'$1') zip_44
    ->   from t1,t2;
+--------+
| zip_44 |
+--------+
| 44182  |
| 44272  |
+--------+
2 rows in set (0.01 sec)

4. The significance of this example

        This example is a bit extreme, but it contains a lot of knowledge to ensure the consistency between regular expressions and data. If you need to deal with such a problem in practice, you may not use regular expressions to solve it. For example, in MySQL8, recursive query is directly used to construct the numeric auxiliary table, and then the substring function is called in the Cartesian connection to obtain each zip code, and then it is determined whether it starts with 44.
 

mysql> -- MySQL解法
mysql> select s 
    ->   from (select substring(@s,(lv-1)*5+1,5) s 
    ->           from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 where lv < length(@s)/5) select lv from tab1) t) t 
    ->  where s like '44%';
+-------+
| s     |
+-------+
| 44182 |
| 44272 |
+-------+
2 rows in set (0.00 sec)

10. Parse CSV files

        Comma-separated values ​​are either "pure", containing just before the comma, or between double quotes, in which case double quotes in the data are represented by a pair of double quotes. Below is an example:

Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K

        This line contains seven fields:

Ten Thousand
10000
 2710 
空字段
10,000
It's "10 Grand", baby
10K

        In order to parse out the fields from this line, the regular expression needs to be able to handle both formats. Unquoted format contains any characters except quotes and commas, can be matched with '[^",]+'.

        A double-quoted field can contain any character other than double quotes, including commas and spaces, and can contain two double quotes concatenated. Therefore, a double-quoted field can be matched by any number of [^"]|"" between "...", that is, '"(?:[^"]|"")"'.

        Taken together, '[^",]+|"(?:[^"]|"")*"' can match a field. Now this expression can actually be applied to a string containing a CSV text line. For a double-quoted string, you also need to remove the first and last double quotes and replace the two double quotes next to each other with a single double quote.

        In MySQL, you don't need to know which multiple-choice branch matches, and use the trim function to replace the double quotes at the beginning and end. For non-double quote strings, this function will return the field value as it is.
 

mysql> set @s:='Ten Thousand,10000, 2710 ,,"10,000","It\'s ""10 Grand"", baby",10K';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='[^",]+|"(?:[^"]|"")*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
6 rows in set (0.00 sec)

        There are only six lines of output, and there is no empty fourth field, which is obviously wrong. Changing '[^",]+' to '[^",]*' will not work.

mysql> set @r:='[^",]*|"(?:[^"]|"")*"';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+--------------+
| s            |
+--------------+
| Ten Thousand |
|              |
| 10000        |
|              |
|  2710        |
|              |
|              |
|              |
| 10           |
|              |
| 000          |
|              |
|              |
|              |
| It's         |
|              |
|              |
| 10 Grand     |
|              |
|              |
|              |
|  baby        |
|              |
|              |
| 10K          |
|              |
+--------------+
26 rows in set (0.00 sec)

        Consider the situation after the first field match, when there is no element in the expression to match the comma (in this case), and a successful match of length 0 occurs. So there's an empty match between each valid match, an empty match before each quoted field, and an empty match at the end of the string.

        In fact, there may be an infinite number of such matches, because the regex engine may repeat such a match at the same position. Modern regex engines will force the driving process so that two matches of length 0 will not occur at the same position.

1. Decompose the driving process

        To solve the problem, you cannot rely on the driving process of the transmission mechanism to cross the comma, but require manual control. There are two ways that I can think of:

  1. Manually match commas. If you take this approach, you need to "pace ourselves" in the string by using commas as part of normal field matching.
  2. Make sure every match starts where the field can start. Fields can start at the beginning of the line, or with a comma.

        Perhaps a better approach would be to combine the two. Starting from the first approach (matching the comma itself), just ensure that the comma appears at the end of all fields except the last field. You can add '^|,' in front of the expression, or add '$|,' after it, and use parentheses to control the scope.

mysql> set @r:='(?:^|,)(?:[^",]*|"(?:[^"]|"")*")';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s,@r;
+-------------------------------------------------------------------+----------------------------------+
| @s                                                                | @r                               |
+-------------------------------------------------------------------+----------------------------------+
| Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K | (?:^|,)(?:[^",]*|"(?:[^"]|"")*") |
+-------------------------------------------------------------------+----------------------------------+
1 row in set (0.00 sec)

mysql> select regexp_substr(@s,@r,1,lv) s, length(convert(regexp_substr(@s,@r,1,lv) using utf8mb4)) l
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+--------------+------+
| s            | l    |
+--------------+------+
| Ten Thousand |   12 |
| ,10000       |    6 |
| , 2710       |    7 |
| ,            |    1 |
| ,            |    1 |
| ,000         |    4 |
| ,            |    1 |
| , baby       |    6 |
| ,10K         |    4 |
+--------------+------+
9 rows in set (0.00 sec)

        The result is wrong. If multiple multiple-choice branches can match at the same position, the order must be carefully arranged. The first multi-choice branch '[^",]*' does not need to match any characters to succeed, unless forced by the following elements, the second multi-choice branch will not get a chance to try. After these two multi-choice branches There are no elements, so the second multi-select branch never gets a chance to be tried, and that's the problem!

        OK, exchange the order of the multi-select branches:
 

mysql> set @r:='(?:^|,)(?:"(?:[^"]|"")*"|[^",]*)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

        This is now true at least for the test data. A safer way is to use '\G' to ensure that each match starts from the position where the previous match ended.

mysql> set @r:='\\G(?:^|,)(?:"(?:[^"]|"")*"|[^",]*)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

        Take another look at the matching result of adding '\G' before exchanging the order of the multi-select branches:
 

mysql> set @r:='\\G(?:^|,)(?:[^",]*|"(?:[^"]|"")*")';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s,@r;
+-------------------------------------------------------------------+------------------------------------+
| @s                                                                | @r                                 |
+-------------------------------------------------------------------+------------------------------------+
| Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K | \G(?:^|,)(?:[^",]*|"(?:[^"]|"")*") |
+-------------------------------------------------------------------+------------------------------------+
1 row in set (0.00 sec)

mysql> select regexp_substr(@s,@r,1,lv) s, length(convert(regexp_substr(@s,@r,1,lv) using utf8mb4)) l
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+--------------+------+
| s            | l    |
+--------------+------+
| Ten Thousand |   12 |
| ,10000       |    6 |
| , 2710       |    7 |
| ,            |    1 |
| ,            |    1 |
+--------------+------+
5 rows in set (0.00 sec)

        When the double quotes of "10,000" are matched, this round of attempts fails, and the transmission mechanism is driven to apply the regular expression starting from the next character of the string. And if there is a driver process, the '\G' at the beginning will immediately cause the entire matching to fail.

2. Another way

        As mentioned at the beginning of this section, the second way to correctly match each field is to ensure that the match only starts where the field is allowed to appear. On the surface, this is similar to adding '^|,', but using reverse lookup '(?<=^|,)'. 

mysql> set @r:='(?:(?<=^|,))(?:"(?:[^"]|"")*"|[^",]*)';
ount(@s, @r, '')) select lv from tab1) t;
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

        Note that '\G' cannot be added to the beginning of this expression. Lookaround is a zero-width assertion and does not consume characters. Therefore, every time a comma match fails, the transmission mechanism will be triggered to drive, which will cause '\G' to fail to match and return immediately.

mysql> set @r:='\\G(?:(?<=^|,))(?:"(?:[^"]|"")*"|[^",]*)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+--------------+
| s            |
+--------------+
| Ten Thousand |
+--------------+
1 row in set (0.00 sec)

        Some regular engines only allow fixed-length reverse lookups, so you can replace '(?<=^|,)' with '(?:^|(?<=,))'.

mysql> set @r:='(?:(?:^|(?<=,)))(?:"(?:[^"]|"")*"|[^",]*)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

        Compared with the first method, this implementation is too troublesome. Also, it still relies on the drive process of the transmission past the comma, and if something else goes wrong, it will allow a match at the comma in '..."10,000"...' . Generally speaking, it is not as safe as the first method.

        However, you can add '(?=$|,)' at the end of the expression, which requires it to end before a comma or before the end of a line. A simple understanding is that there must be commas on both sides of the field content, so as to ensure that there will be no incorrect matching.

mysql> set @r:='(?:(?<=^|,))(?:"(?:[^"]|"")*"|[^",]*)(?=$|,)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

3. Further improve efficiency

        You can use solidified grouping to improve efficiency, such as changing the subexpression matching double-quoted fields from '(?:[^"]|"")*' to '(?>[^"]+|"")*'

mysql> set @r:='\\G(?:^|,)(?:"(?>[^"]+|"")*"|[^",]*)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

        You can also use possessive priority quantifiers to improve efficiency.
 

mysql> set @r:='\\G(?:^|,)(?:"(?>[^"]++|"")*+"|[^",]*+)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

4. Other formats

  • Use any character such as ';' or tab as delimiter.

        Just replace the commas with the corresponding delimiters.

mysql> set @s:='Ten Thousand;10000; 2710 ;;"10,000";"It\'s ""10 Grand"", baby";10K';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='\\G(?:^|;)(?:"(?>[^"]++|"")*+"|[^";]*+)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ';' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)
  • Spaces after delimiters are allowed, but they are not considered part of the value.

        You need to add '\s*' after the delimiter, for example starting with '(?:^|,\s*+)'.

mysql> set @s:='Ten Thousand,    10000, 2710 ,   ,   "10,000",   "It\'s ""10 Grand"", baby",   10K';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='\\G(?:^|,\\s*+)(?:"(?>[^"]++|"")*+"|[^",]*+)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (regexp_replace(regexp_substr(@s,@r,1,lv),'^,\\s*',''))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
| 2710                  |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)
  • Use backslashes to escape quotes, such as \" instead of "" to represent quotes inside a value.

        Normally this means that a backslash can appear before any character and be ignored, replacing '[^"]+|""' with '[^\\"]+|\\.'.

mysql> set @s:='Ten Thousand,    10000, 2710 ,   ,   "10,000",   "It\'s \\"10 Grand\\", baby",   10K';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='\\G(?:^|,\\s*+)(?:"(?>[^\\\\"]++|\\\\.)*+"|[^",]*+)';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (regexp_replace(regexp_substr(@s,@r,1,lv),'^,\\s*',''))),'\\"','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
| 2710                  |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

Guess you like

Origin blog.csdn.net/wzy0623/article/details/131548866