Application of C++ regular expressions. Although this question is a dynamic programming question, it is really convenient to use the library. You can't always reinvent the wheel, right?

The sword points to Offer 19. Regular expression matching
Please implement a function to match regular expressions containing '.' and '*'. The character '.' in the pattern represents any character, and '*' represents that the character before it can appear any number of times (including 0 times). In this question, matching means that all characters of the string match the entire pattern. For example, the string "aaa" matches the patterns "aa" and "ab*ac*a", but neither "aa.a" nor "ab*a".

Example 1:

Input:
s = "aa"
p = "a"
Output: false
Explanation: "a" cannot match the entire string "aa".
Example 2:

Input:
s = "aa"
p = "a*"
Output: true
Explanation: Because '*' means that it can match zero or more previous elements, the previous element here is 'a'. Therefore, the string "aa" can be considered as 'a' repeated once.
Example 3:

Input:
s = "ab"
p = ".*"
Output: true
Explanation: ".*" means that it can match zero or more ('*') any characters ('.').
Example 4:

Input:
s = "aab"
p = "c*a*b"
Output: true
Explanation: Because '*' means zero or more, here 'c' is 0, 'a' is repeated once. So the string "aab" can be matched.
Example 5:

Input:
s = "mississippi"
p = "mis*is*p*."
Output: false
s may be empty and contain only lowercase letters from az.
p may be empty and contain only lowercase letters from az and the characters . and *, without consecutive '*'s.

Author: Krahets
Link: https://leetcode.cn/leetbook/read/illustration-of-algorithm/9a1ypc/
Source: LeetCode
The copyright belongs to the author. For commercial reprinting, please contact the author for authorization. For non-commercial reprinting, please indicate the source.

Solution: as follows.

class Solution {
public:
    bool isMatch(string s, string p) {
        std::regex reg(p);
        return std::regex_match(s, reg);
    }
};

Results of the:

pass

Show details

Execution time: 172 ms, beating 5.32% of all C++ submissions

Memory consumption: 9 MB, beats 11.58% of users among all C++ submissions

Passed test cases: 448/448

It seems that the running time of using library functions is a bit slow. It doesn't matter, it's also great to learn C++ regular expressions.

Regular program library (regex)

"Regular expression" is a set of formulas representing rules, specially used to handle various complex operations.

std::regex is a library used by C++ to represent "regular expressions". It was added in C++11. It is a specialization of class std::basic_regex<> for the char type, and another for wchar_t The type specialization is std::wregex.

Regular grammar (regex syntaxes)

std::regex uses ECMAScript grammar by default. This grammar is easy to use and powerful. The meanings of common symbols are as follows:

symbol	significance
^	Matches the beginning of a line
$	Match end of line
.	Match any single character
[…]	Match any character in []
(…)	Set up groups
\	escape character
\d	Match numbers [0-9]
\D	\d negate
\w	Matches letters [az], numbers, and underscores
\W	\w Retaliation
\s	Match spaces
\S	\s negate
+	The previous element is repeated one or more times
*	Repeat the previous element any number of times
?	The previous element is repeated 0 or 1 times
{n}	The previous element is repeated n times
{n,}	The previous element is repeated at least n times
{n,m}	The previous element is repeated at least n times and at most m times
\|	logical or

The symbols listed above are very commonly used and are enough to solve most problems.

Match

A commonly used operation in string processing is "matching", that is, the string and the rule exactly correspond, and the function used for matching is std::regex_match(), which is a function template. Let's look at the example directly:

std::regex reg("<.*>.*</.*>");
bool ret = std::regex_match("<html>value</html>", reg);
assert(ret);

ret = std::regex_match("<xml>value<xml>", reg);
assert(!ret);

std::regex reg1("<(.*)>.*</\\1>");
ret = std::regex_match("<xml>value</xml>", reg1);
assert(ret);

ret = std::regex_match("<header>value</header>", std::regex("<(.*)>value</\\1>"));
assert(ret);

// 使用basic文法
std::regex reg2("<\\(.*\\)>.*</\\1>", std::regex_constants::basic);
ret = std::regex_match("<title>value</title>", reg2);
assert(ret);

This small example uses regex_match() to match a string in xml format (or html format). If the match is successful, it will return true. The meaning is very simple. If you don’t understand the meaning, you can refer to the previous grammar part.

\\ appears in the statement because \ needs to be escaped. C++11 and later support native characters, so it can also be used like this:

std::regex reg1(R"(<(.*)>.*</\1>)");
auto ret = std::regex_match("<xml>value</xml>", reg1);
assert(ret);

However, it was not supported before C++03, so you need to pay attention when using it.

If you want to get matching results, you can use another overloaded form of regex_match():

std::cmatch m;
auto ret = std::regex_match("<xml>value</xml>", m, std::regex("<(.*)>(.*)</(\\1)>"));
if (ret)
{
	std::cout << m.str() << std::endl;
	std::cout << m.length() << std::endl;
	std::cout << m.position() << std::endl;
}

std::cout << "----------------" << std::endl;

// 遍历匹配内容
for (auto i = 0; i < m.size(); ++i)
{
	// 两种方式都可以
	std::cout << m[i].str() << " " << m.str(i) << std::endl;
}

std::cout << "----------------" << std::endl;

// 使用迭代器遍历
for (auto pos = m.begin(); pos != m.end(); ++pos)
{
	std::cout << *pos << std::endl;
}

The output is:

<xml>value</xml>
16
0
----------------
<xml>value</xml> <xml>value</xml>
xml xml
value value
xml xml
----------------
<xml>value</xml>
xml
value
xml

cmatch is a specialized version of class template std::match_result<> for C characters. If it is string, you have to use the specialized version of smatch for string. Its corresponding wide character versions wcmatch and wsmatch are also supported.

Pass in match_result as the second parameter of regex_match() to get the matching result. In the example, the result is stored in cmatch, and cmatch provides many functions to operate on these results. Most of the methods are the same as those of string. The methods are similar, so it's easier to use.

m[0] stores all the characters of the matching result. If you want to save a substring in the matching result, you have to use () to mark the substring in the "regular expression", so a few more brackets are added here:

std::regex("<(.*)>(.*)</(\\1)>")

In this way, these substrings will be stored after m[0] in sequence, and each substring can be accessed sequentially through m[1], m[2],...

Search

"Search" is very similar to "match". Its corresponding function is std::regex_search, which is also a function template. Its usage is the same as regex_match. The difference is that "search" will return as long as the target appears in the string. A perfect "match".

Let’s take an example:

std::regex reg("<(.*)>(.*)</(\\1)>");
std::cmatch m;
auto ret = std::regex_search("123<xml>value</xml>456", m, reg);
if (ret)
{
	for (auto& elem : m)
		std::cout << elem << std::endl;
}

std::cout << "prefix:" << m.prefix() << std::endl;
std::cout << "suffix:" << m.suffix() << std::endl;

The output is:

<xml>value</xml>
xml
value
xml
prefix:123
suffix:456

If you change to regex_match here, the match will fail, because regex_match is a complete match, but a few more characters are added before and after the string here.

For "search", you can obtain the prefix and suffix respectively through prefix and suffix in the matching results. The prefix is the content before the matching content, and the suffix is the content after the matching content.

So if there are multiple sets of content that meet the conditions, how do you get all the information? Let’s look at it through a small example:

std::regex reg("<(.*)>(.*)</(\\1)>");
std::string content("123<xml>value</xml>456<widget>center</widget>hahaha<vertical>window</vertical>the end");
std::smatch m;
auto pos = content.cbegin();
auto end = content.cend();
for (; std::regex_search(pos, end, m, reg); pos = m.suffix().first)
{
	std::cout << "----------------" << std::endl;
	std::cout << m.str() << std::endl;
	std::cout << m.str(1) << std::endl;
	std::cout << m.str(2) << std::endl;
	std::cout << m.str(3) << std::endl;
}

The output is:

----------------
<xml>value</xml>
xml
value
xml
----------------
<widget>center</widget>
widget
center
widget
----------------
<vertical>window</vertical>
vertical
window
vertical

Another overloaded form of the regex_search function is used here (the regex_match function also has the same overloaded form). In fact, all substring objects are derived from std::pair<>, and its first (that is, here prefix) is the position of the first character, and second (the suffix here) is the next position of the last character.

After a set of searches is completed, the search can be continued from suffix, so that all information that matches the content can be obtained.

Tokenize

There is another operation called "cutting". For example, if there is a set of data that stores many email accounts and is separated by commas, then you can specify commas as the separator to cut the contents to obtain each account.

In C++'s regular rules, this operation is called Tokenize, and the template class regex_token_iterator<> is used to provide a word segmentation iterator. Let's still look at the example:

std::string mail("[email protected],[email protected],[email protected],[email protected]");
std::regex reg(",");
std::sregex_token_iterator pos(mail.begin(), mail.end(), reg, -1);
decltype(pos) end;
for (; pos != end; ++pos)
{
	std::cout << pos->str() << std::endl;
}

In this way, you can get all the mailboxes separated by commas:

[email protected]
[email protected]
[email protected]
[email protected]

sregex_token_iterator is a specialization for the string type. What needs to be noted is the last parameter. This parameter can specify a series of integer values to represent the content you are interested in. The -1 here means that the substring before the matching regular expression The sequence is of interest; and if you specify 0, it means you are interested in the matching regular expression, and you will get "," here; you can also group the regular expressions, and then you can enter any number corresponding to the specified group. You can Give it a try.

Replace

The last operation is called "replacement", that is, replacing the regular expression content with the specified content. The regex library uses the template function std::regex_replace to provide the "replacement" operation.

Now, given a data of "he...ll..o, worl..d!", think about it, how to remove the "." accidentally typed in it?

Got an idea? Let’s take a look at the regular solution:

char data[] = "he...ll..o, worl..d!";
std::regex reg("\\.");
// output: hello, world!
std::cout << std::regex_replace(data, reg, "");

We can also use the grouping function:

char data[] = "001-Neo,002-Lucia";
std::regex reg("(\\d+)-(\\w+)");
// output: 001 name=Neo,002 name=Lucia
std::cout << std::regex_replace(data, reg, "$1 name=$2");

When using the grouping function, you can get the grouping content through $N. This function is quite useful.

Examples

1. Verify email

This requirement is often used when registering and logging in to detect the legitimacy of user input.

If the matching accuracy is not high, you can write like this:

std::string data = "[email protected],[email protected],[email protected],[email protected]";
std::regex reg("\\w+@\\w+(\\.\\w+)+");

std::sregex_iterator pos(data.cbegin(), data.cend(), reg);
decltype(pos) end;
for (; pos != end; ++pos)
{
	std::cout << pos->str() << std::endl;
}

Another method of traversing regular search is used here. This method uses regex iterator to iterate, which is more efficient than using match. The regular expression here is a weak match, but it is not a problem for general user input. The key is simplicity. The output is:

[email protected]
[email protected]
[email protected]
[email protected]

But if I enter "[email protected]", it can still match successfully. This is obviously an illegal email address. A more precise regular expression should be written like this:

std::string data = "[email protected], \
	       [email protected], \
           [email protected], \
           [email protected], \
           [email protected] \
           [email protected]";
std::regex reg("[a-zA-z0-9_]+@[a-zA-z0-9]+(\\.[a-zA-z]+){1,3}");

std::sregex_iterator pos(data.cbegin(), data.cend(), reg);
decltype(pos) end;
for (; pos != end; ++pos)
{
	std::cout << pos->str() << std::endl;
}

The output is:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

2. Match IP

There is such a string of IP addresses, 192.68.1.254 102.49.23.013 10.10.10.10 2.2.2.2 8.109.90.30.
The requirement is to take out the IP addresses and output the IP addresses in the order of the address segments.

It's a bit late, so I won't explain it in detail. Here is the answer directly for your reference:

std::string ip("192.68.1.254 102.49.23.013 10.10.10.10 2.2.2.2 8.109.90.30");

std::cout << "原内容为：\n" << ip << std::endl;

// 1. 位数对齐
ip = std::regex_replace(ip, std::regex("(\\d+)"), "00$1");

std::cout << "位数对齐后为：\n" << ip << std::endl;

// 2. 有0的去掉
ip = std::regex_replace(ip, std::regex("0*(\\d{3})"), "$1");

std::cout << "去掉0后为：\n" << ip << std::endl;

// 3. 取出IP
std::regex reg("\\s");
std::sregex_token_iterator pos(ip.begin(), ip.end(), reg, -1);
decltype(pos) end;

std::set<std::string> ip_set;
for (; pos != end; ++pos)
{
	ip_set.insert(pos->str());
}

std::cout << "------\n最终结果：\n";

// 4. 输出排序后的数组
for (auto elem : ip_set)
{
	// 5. 去掉多余的0
	std::cout << std::regex_replace(elem, 
		std::regex("0*(\\d+)"), "$1") << std::endl;
}

The output is:

原内容为：
192.68.1.254 102.49.23.013 10.10.10.10 2.2.2.2 8.109.90.30
位数对齐后为：
00192.0068.001.00254 00102.0049.0023.00013 0010.0010.0010.0010 002.002.002.002 008.00109.0090.0030
去掉0后为：
192.068.001.254 102.049.023.013 010.010.010.010 002.002.002.002 008.109.090.030
------
最终结果：
2.2.2.2
8.109.90.30
10.10.10.10
102.49.23.13
192.68.1.254

Thanks to the bosses of Blog Park, I learned a lot! !