Introduction to Regular Expressions and Simple Use in C++11

 A regular expression is a text pattern. Regular expressions are powerful, convenient and efficient text processing tools. Regular expressions themselves, together with general pattern notation like a pocket programming language, give users the ability to describe and analyze text. With additional support provided by specific tools, regular expressions can add, delete, split, overlay, insert and trim various types of text and data.

         A complete regular expression consists of two kinds of characters: special characters are called "meta characters", others are "literal", or normal text characters, such as letters, numbers, Chinese characters, underscores). Regular expression metacharacters provide more powerful descriptive capabilities.

         Like text editors, regular expressions are supported in most high-level programming languages, such as Perl, Java, Python, C/C++, and these languages ​​have their own regular expression packages.

         A regular expression is just a string, it has no length limit. A "subexpression" refers to a part of the entire regular expression, usually an expression enclosed in parentheses, or a multiple-choice branch separated by "|".

By default, letters in expressions are case-sensitive.

         Common metacharacters:

1. ".": matches any single character except "\n", to match any character including "\n", you need to use a pattern such as "[\s\S]";

2. "^": matches the starting position of the input string, and does not match any characters. To match the "^" character itself, you need to use "\^";

3. "$": matches the position at the end of the input string, does not match any character, to match the "$" character itself, you need to use "\$";

4. "*": Match the preceding character or subexpression zero or more times, "*" is equivalent to "{0,}", such as "\^*b" can match "b", "^b" , "^^b", ...;

5. "+": Match the preceding character or subexpression one or more times, equivalent to "{1,}", such as "a+b" can match "ab", "aab", "aaab", ... ;

6. "?": zero or one match the preceding character or subexpression, equivalent to "{0,1}", such as "a[cd]?" can match "a", "ac", "ad" "; When this character immediately follows any other qualifiers "*", "+", "?", "{n}", "{n,}", "{n,m}", the matching pattern is " non-greedy". The "non-greedy" pattern matches the searched string as short as possible, while the default "greedy" pattern matches the searched for the longest possible string. For example, in the string "oooo", "o+?" matches only a single "o", while "o+" matches all "o"s;

7. "|": Perform a logical "or" (Or) operation on two matching conditions, such as the regular expression "(him|her)" matches "itbelongs to him" and "it belongs to her", but cannot match" it belongs to them.";

8. "\": mark the next character as a special character, text, backreference or octal escape character, for example, "n" matches the character "n", "\n" matches a newline, and the sequence "\\" matches "\", "\("matches "(";

9. "\w": matches letters or numbers or underscores, any letter or number or underscore, that is, any one of A~Z, a~z, 0~9, _;

10. "\W": matches any character that is not a letter, number, or underscore;

11. "\s": matches any blank character, including any one of blank characters such as space, tab, form feed, etc., equivalent to "[ \f\n\r\t\v]";

12. "\S": matches any character that is not a whitespace, equivalent to "[^\f\n\r\t\v]";

13. "\d": match a number, any number, any one of 0~9, equivalent to "[0-9]";

14. "\D": matches any non-digital character, equivalent to "[^0-9]";

15. "\b": Match a word boundary, that is, the position between the word and the space, that is, the position between the word and the space, does not match any characters, for example, "er\b" matches "er" in "never" ", but does not match "er" in "verb";

16. "\B": non-word boundary match, "er\B" matches "er" in "verb", but does not match "er" in "never";

17. "\f": Match a form feed, equivalent to "\x0c" and "\cL";

18. "\n": matches a newline, equivalent to "\x0a" and "\cJ";

19. "\r": matches a carriage return, equivalent to "\x0d" and "\cM";

20. "\t": matches a tab, equivalent to "\x09" and "\cI";

21. "\v": matches a vertical tab, equivalent to "\x0b" and "\cK";

22. "\cx": matches the control character indicated by "x", for example, \cM matches Control-M or carriage return, the value of "x" must be between "AZ" or "az", if not, then assume that c is the "c" character itself;

23. "{n}": "n" is a non-negative integer that matches exactly n times, eg, "o{2}" does not match the "o" in "Bob", but matches the two " in "food" o" matches;

24. "{n,}": "n" is a non-negative integer that matches at least n times, for example, "o{2,}" does not match "o" in "Bob", but matches all "foooood" o", "o{1,}" is equivalent to "o+", "o{0,}" is equivalent to "o*";

25. "{n,m}": "n" and "m" are non-negative integers, where n<=m, match at least n times and at most m times, for example, "o{1,3}" matches "fooooood" The first three o's in 'o{0,1}' are equivalent to 'o?'. Note that spaces cannot be inserted between commas and numbers; for example, "ba{1,3}" can match "ba" or "baa" or "baaa";

26. "x|y": matches "x" or "y", for example, "z|food" matches "z" or "food"; "(z|f)ood" matches "zood" or "food";

27. "[xyz]": character set, matches any character contained, for example, "[abc]" matches "a" in "plain";

28. "[^xyz]": Reverse character set, match any character not included, match any character except "xyz", for example, "[^abc]" matches "p" in "plain";

29. "[az]": character range, matches any character within the specified range, for example, "[az]" matches any lowercase letter within the range of "a" to "z";

30. "[^az]": Reverse range character, matches any character not in the specified range, for example, "[^az]" matches any character not in the range of "a" to "z";

31. "( )": Define the expression between "("and")" as a "group" group, and save the characters matching this expression to a temporary area, a regular expression can save up to 9 , they can be referenced with symbols from "\1" to "\9";

32. "(pattern)": matches pattern and captures the matched subexpression, the captured matches can be retrieved from the resulting "matches" set using the $0…$9 attributes;

33. "(?:pattern)": Matches pattern but does not capture the matched subexpression, i.e. it is a non-capturing match and does not store the match for later use, which is the case with the "or" character "(|)" Useful when combining pattern components, e.g. "industr(?:y|ies)" is a shorter expression than "industry|industries";

34. "(?=pattern)": Non-acquisition matching, positive positive pre-check, matching the search string at the beginning of any string matching pattern, the match does not need to be acquired for later use. For example, "Windows(?=95|98|NT|2000)" can match "Windows" in "Windows2000", but cannot match "Windows" in "Windows3.1". Lookahead does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, not after the character containing the lookahead;

35. "(?!pattern)": non-acquisition matching, forward negative pre-check, match the search string at the beginning of any string that does not match pattern, the match does not need to be acquired for later use. For example, "Windows(?!95|98|NT|2000)" can match "Windows" in "Windows3.1", but cannot match "Windows" in "Windows2000";

To match some special characters, you need to add "\" before this special character, for example, to match the characters "^", "$", "()", "[]", "{}", ".", "?", "+", "*", "|", use " \^", " \$", " \ (", "\)", " \ [", "\]", " \ {", "\}", " \.", " \?", " \+", " \*", " \|".

In C++/C++11, when the GCC version is 4.9.0 and above, and the VS version is VS2013 and above, there will be a regex header file. In this header file, there will be functions such as regex_match, regex_search, and regex_replace for calling. The following is the test code:

#include "regex.hpp"  
#include <regex>  
#include <string>  
#include <vector>  
#include <iostream>  
  
int test_regex_match()  
{  
    std::string pattern{ "\\d{3}-\\d{8}|\\d{4}-\\d{7}" }; // fixed telephone  
    std::regex re(pattern);  
  
    std::vector<std::string> str{ "010-12345678", "0319-9876543", "021-123456789"};  
  
    /* std::regex_match: 
        判断一个正则表达式(参数re)是否匹配整个字符序列str,它主要用于验证文本 
        注意,这个正则表达式必须匹配被分析串的全部,否则返回false;如果整个序列被成功匹配,返回true 
    */  
  
    for (auto tmp : str) {  
        bool ret = std::regex_match(tmp, re);  
        if (ret) fprintf(stderr, "%s, can match\n", tmp.c_str());  
        else fprintf(stderr, "%s, can not match\n", tmp.c_str());  
    }  
  
    return 0;  
}  
  
int test_regex_search()  
{  
    std::string pattern{ "http|hppts://\\w*$" }; // url  
    std::regex re(pattern);  
  
    std::vector<std::string> str{ "http://blog.csdn.net/fengbingchun", "https://github.com/fengbingchun",  
        "abcd://124.456", "abcd https://github.com/fengbingchun 123" };  
  
    /* std::regex_search: 
        类似于regex_match,但它不要求整个字符序列完全匹配 
        可以用regex_search来查找输入中的一个子序列,该子序列匹配正则表达式re 
    */  
  
    for (auto tmp : str) {  
        bool ret = std::regex_search(tmp, re);  
        if (ret) fprintf(stderr, "%s, can search\n", tmp.c_str());  
        else fprintf(stderr, "%s, can not search\n", tmp.c_str());  
    }  
  
    return 0;  
}  
  
int test_regex_search2()  
{  
    std::string pattern{ "[a-zA-z]+://[^\\s]*" }; // url  
    std::regex re(pattern);  
  
    std::string str{ "my csdn blog addr is: http://blog.csdn.net/fengbingchun , my github addr is: https://github.com/fengbingchun " };  
    std::smatch results;  
    while (std::regex_search(str, results, re)) {  
        for (auto x : results)  
            std::cout << x << " ";  
        std::cout << std::endl;  
        str = results.suffix().str();  
    }  
  
    return 0;  
}  
  
int test_regex_replace()  
{  
    std::string pattern{ "\\d{18}|\\d{17}X" }; // id card  
    std::regex re(pattern);  
  
    std::vector<std::string> str{ "123456789012345678", "abcd123456789012345678efgh",  
        "abcdefbg", "12345678901234567X" };  
    std::string fmt{ "********" };  
  
    /* std::regex_replace: 
        在整个字符序列中查找正则表达式re的所有匹配 
        这个算法每次成功匹配后,就根据参数fmt对匹配字符串进行替换 
    */  
  
    for (auto tmp : str) {  
        std::string ret = std::regex_replace(tmp, re, fmt);  
        fprintf(stderr, "src: %s, dst: %s\n", tmp.c_str(), ret.c_str());  
    }  
  
    return 0;  
}  
  
int test_regex_replace2()  
{  
    // reference: http://www.cplusplus.com/reference/regex/regex_replace/  
    std::string s("there is a subsequence in the string\n");  
    std::regex e("\\b(sub)([^ ]*)");   // matches words beginning by "sub"  
  
    // using string/c-string (3) version:  
    std::cout << std::regex_replace(s, e, "sub-$2");  
  
    // using range/c-string (6) version:  
    std::string result;  
    std::regex_replace(std::back_inserter(result), s.begin(), s.end(), e, "$2");  
    std::cout << result;  
  
    // with flags:  
    std::cout << std::regex_replace(s, e, "$1 and $2", std::regex_constants::format_no_copy);  
    std::cout << std::endl;  
  
    return 0;  
} 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325339384&siteId=291194637