C ++: 57 --- positive standard library special facility expressions (regex)

First, regular expressions Overview

  • C ++ provides a regular expression library (RE library) , which is part of the standard library. RE libraries defined in the header file in the regex , comprising a plurality of components (see below)

regex class

  • regex class represents a regular expression. In addition to the initialization and assignment, regex also supports a number of other operations (see description regex options below)

regex_match function, regex_search function

  • Function and function regex_match regex_search: determining whether a given character sequence with a matching regex Godin
    • If the whole input sequence and expression match, return true regex_match
    • If a substring match the expression of the input sequence, the return true regex_search
  • There is also a function described later regex_replace
  • The following is regex_search regex_match and function parameters:

Second, use a regular expression library

Case presentation

  • Below we give a demo Case: Find "i except after c, otherwise it must be before e" word
  • code show as below:
    • [^ c]: means match any character other than c
    • [^ c] ei: expressed the hope that the previous character c not now ei
    • [[: alpha:]]: indicates the matching any letter. * And + represent desired "one or more" or "zero or more" matches. Here we choose + will match zero for more than more letters
#include <iostream>
#include <string>
#include <regex>
using namespace std;

int main()
{
    //查找不在字符c之后的字符串ei
    std::string pattern("[^c]ei");
    //我们需要包含pattern的整个单词
    pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";

    std::regex r(pattern); //构造一个用于查找模式的regex
    std::smatch results;   //保存正则表达的搜索结果

    //测试字符串
    std::string test_str = "reslut freind theif receive";

    //在输入序列test_str中寻找匹配的子串r,并保存在results中
    if (regex_search(test_str, results, r))
        std::cout << results.str() << std::endl;

    return 0;
}
  • regex_search search function, once the search to the first match to find the exit . (So the above freind and theif match, but only printed freind)

Option specifies the regex object

  • Here are some of the operation and some flags regex
  • When you define a regex or assign its call for a regex given a new value, you can specify a number of flags to affect how the regex operators
  • Under the lower half of the figure shows the nine flags:
    • The first three markers allow us to develop positive aspects of the process of language-independent expression handling
    • Behind six signs pointed to write regular expression language used. These six flags, we have to set one of them , and can only set one up. By default, ECMAScript flag is set , so the regex using the ECMA-262 specification (which is a lot positive Web browser used expression language)

  • Case presentation:
    • Find below using icase flag with a specific file name extension
    • Most operating systems are by case-insensitive way to identify extension - for example, you can save the file as a .cc .Cc, .cC or .CC, the effect is the same
    • The following write a regular expression to identify any of the above extensions and other common file extensions:
      • Character dot (.) Normally matches any character, because it has special meaning, so you want to put a backslash before the character \, but in C ++ because the backslash is a special character, so we need to use two backslashes \\ to indicate this is a backslash \
#include <iostream>
#include <string>
#include <regex>
using namespace std;

int main()
{
    //一个或多个字母或数字字符后接一个“.”,再接“cpp”或“cxx”或“cc”
    std::regex r("[[:alnum:]]+\\.(cpp|cxx|cc)$", regex::icase);
    std::smatch results;  //保存搜索结果
    std::string filename; //保存输入的内容

    while (std::cin >> filename)
        if (regex_search(filename, results, r)) //如果搜索到,打印结果
            std::cout << results.str() << std::endl;

    return 0;
}

Specify or use a regular error (regex_error) during the expression

  • Regular expressions at run time will be "compiled." So write regular expressions will not correct an error

  • If you write regular expression there is an error, then the standard run-time library will throw an exception of type regex_error:
    • The anomaly has a what () operation to describe what error occurred
    • Also called code () member, to return an error value corresponding to the type of coding. code value returned by a specific definition is implemented (as shown below, numbered from 0)

  • Case presentation:
    • Less alnum right below a right bracket]
    • Therefore throws an exception, and the code is 4, and corresponds to the above figure corresponding error_brack
#include <iostream>
#include <string>
#include <regex>
using namespace std;

int main()
{
    try{
        //alnum右侧少了一个右括号,regex构造函数会抛出异常
        std::regex r("[[:alnum:]+\\.(cpp|cxx|cc)$", regex::icase);
    }
    catch (regex_error e){
        std::cout << e.what() << "\ncode: " << e.code() << std::endl;
    }

    return 0;
}

Regular Expressions class and type of the input sequence

  • We can search for multiple types of input sequences . Input data may be plain char or wchar_t data, a character string may be stored in a standard library or in the char array (or wide-character version, or wstring wchar_t array). RE for these different types of input sequence defines the corresponding type
  • E.g:
    • Save as type char class regex regular expression
    • wchar_t type of wregex class save regular expressions
  • Matching and iterator type (described below) more special . These types of differences not only in the type of character, in that sequence is in the standard library or in the string array:
    • smatch indicates the type of input string sequence
    • cmatch sequence represents the array of characters
    • It represents wsmatch wide string (wstring) input
    • wcmatch indicates a broad array of characters
  • RE focuses on the type of library we use must match the input sequence type. The diagram indicates the type library RE correspondence between the type of the input sequence:

  • Case presentation:
std::regex r("[[:alnum:]]+\\.(cpp|cxx|cc)$", regex::icase);
std::smatch results;
if (regex_search("myfile.cc", results, r))
    std::cout << results.str() << std::endl;

  • The above code is wrong, since the type as the input sequence does not match the match parameters . If we want to search an array of characters, you must use cmatch objects:
std::regex r("[[:alnum:]]+\\.(cpp|cxx|cc)$", regex::icase);
std::cmatch results;
if (regex_search("myfile.cc", results, r))
    std::cout << results.str() << std::endl;

Third, matching (match) with Regex iterator

  •  Find the top "i except after c, otherwise it must be before e" word in the presentation of cases, there are multiple matches, but only print the first match of the word. We can use the match to get all sregex_iterator
  • regex iterator is an iterator adapter, it is bound to an input sequence and a regex object
  • As shown above, each of the different types of the input sequence has a corresponding special regex iterator
  • Iterative operation is as follows:

Use sregex_iterator

  • As an example, we modify the above program (Find all violations "i before e, except after c" word in a regular article)
  • Procedures are as follows:
    • for r and loops through each sub-string matches in the file. The initial value for the statement and it defines end_it
    • When we define it, sregex_iterator constructor calls regex_search it to the location of the first file with a matching r
    • And an empty end_it sregex_iterator, functions after the end of the iterator

#include <iostream>
#include <string>
#include <regex>
using namespace std;

int main()
{
    //查找不在字符c之后的字符串ei
    std::string pattern("[^c]ei");
    //我们需要包含pattern的整个单词
    pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";

    std::regex r(pattern, regex::icase);             //进行匹配时忽略大小写
    std::string file = "reslut freind theif receive";//测试字符串

    for (sregex_iterator it(file.begin(), file.end(), r), end_it; it != end_it;++it)
        std::cout << it->str() << std::endl;
    return 0;
}

Using the matching data (match description)

  • match data types in addition to allowing portion prints the input string match outside matches other operations are also provided (as shown below, as well as part of the sub matching described below)

  • We will introduce more about smatch and ssub_match type in "four" in
  • match this class has two members:
    • prefix: Returns ssub_match previous object matches the current input sequence
    • suffix: Returns ssub_match after the object matches the current input sequence
  • ssub_match this object has two members:
    • str: Returns the matching string
    • length: the size of the string
  • Case presentation:
    • We call prefix, it returns a ssub_match object that represents the part before the file in the current match
    • We call ssub_match object's length (), to get the number of characters in the prefix portion
    • Next adjust pos, to point to the location of the end of the forward portion of the prefix 40 characters
      • If the length of the prefix portion is less than 40 characters, we will pos is set to 0, the prefix indicates the entire portion to be printed
    • Use substr () to print the contents of the specified position to the end of the prefix portion
    • After printing the character before the current match, we then use a special format print matching word itself, so that it can be displayed in the output
    • Before (last) 40 characters after promise after matching words, we print the file matches part
#include <iostream>
#include <string>
#include <regex>
#include <vector>
using namespace std;

int main()
{
    std::string pattern("[^c]ei");
    pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";

    std::regex r(pattern, regex::icase);
    std::string file = "reslut freind theif receive";

    //每次循环it返回一个smatch
    for (sregex_iterator it(file.begin(), file.end(), r), end_it; it != end_it; ++it)
    {
        auto pos = it->prefix().length();  //前缀的大小
        pos = pos > 40 ? pos - 40 : 0;     //我们想要最多40个字符
        std::cout << it->prefix().str().substr(pos) << "\n";       //前缀的最后一部分
        std::cout << "\t\t>>>" << it->str() << " <<<\n";           //匹配的单词
        std::cout << it->suffix().str().substr(0, 40) << std::endl;//后缀的第一部分
        std::cout << "****************************" << std::endl;
    }
    return 0;
}

The use of sub-expressions

  • Regular expression patterns typically comprise one or more sub-expressions . A sub-expression is part of the pattern, itself has a meaning. Regular expression syntax is usually brackets indicate sub-expressions with
  • For example, the following string is to have two sub-expressions:
    • ([[: alnum:]] +): matches one or more sequences of characters
    • (cpp | cxx | cc): matching file extension
std::regex r("([[:alnum:]])+\\.(cpp|cxx|cc)$", regex::icase);
  • We can rewrite the above program, by modifying the output statement so that it prints only the file name. E.g:
std::regex r("([[:alnum:]])+\\.(cpp|cxx|cc)$", regex::icase);
std::cmatch results;
if (regex_search("myfile.cc", results, r))
    std::cout << results.str(1) << std::endl; //打印第二个子表达式

str match class () function

  • Str related classes match () function is used to return a portion subexpression matches
  • str (0) returns the entire matched portion; str (1) returns to the first sub-expression; str (2) ...... and so returned to the second subexpression
  • For example: If the file name has two sub-expression is foo.cpp, then:
    • str (0): Back foo.cpp
    • str (1): return foo (here subexpression)
    • str (2): Returns the cpp (here subexpression)

For data verification subexpression

  • A common use of sub-expressions are verify data must match specific format
  • For example, the United States has a ten-digit telephone number, include a local area code and a seven-digit number of
    • Area code usually placed in brackets, but this is not necessary
    • The remaining seven digits may be a dash, a point, or separated by a space, but may not completely full delimiter
  • We want to reject phone format other formats accept the above format. We divided into two steps to achieve this goal:
    • First, we will use a regular expression is a sequence might find phone numbers
    • Then call a function to do data validation
  • Before coding, first tell us about some of the features ECMAScript regular expression language:

  • Since the backslash \ is C ++ special characters, so the mode each occurrence \ place, must use an extra backslash to represent a backslash, ultimately requires two backslash
  • In order to verify the phone number, we need an integral part of access patterns. In order to obtain the matching part, we use sub-expressions needed in the definition of regular expressions. Each sub-expression is surrounded by a pair of parentheses:
//整个正则表达式包含7个子表达式,格式例如:( ddd )分隔符ddd分隔符dddd
//子表达式1、3、4、6是可选部分,2、5、7保存号码
std::string phone = "(\\()?(\\d{3})(\\))?([-. ])?(\\d{3})([-. ]?)(\\d{4})";
  • Regular expression is more complex, explained as follows:
    • (\\ ()? Represents part of the area code optional left parenthesis
    • (\\ d {3} represents a code
    • (\\))? Represents part of the area code optional closing parenthesis
    • ([-])? Represents an optional part of the area code delimiters
    • (\\ d {3}) represents a three-digit number
    • ([-])? Is the optional delimiter
    • (\\ d {4}) represents the last four digits of the number
  • The following code reads a file, and calls the function named valid number format to check the legality of:
bool valid(const smatch& m)
{
    //检测号码是否合法,代码在文章下面介绍
}

int main()
{
    std::string phone = "(\\()?(\\d{3})(\\))?([-. ])?(\\d{3})([-. ]?)(\\d{4})";
    std::regex r(phone);  //正则表达式类
    std::string s;        //保存输入的号码

    //输入电话号码
    while (getline(std::cin, s))
    {
        //每次循环it返回一个smatch,代表在正则表达式中匹配的部分
        for (sregex_iterator it(s.begin(), s.end(), r), end_it; it != end_it; ++it) {
            //迭代器每次返回smatch类型
            if (valid(*it))
                std::cout << "valid: " << it->str() << std::endl;
            else
                std::cout << "not valid: " << it->str() << std::endl;
        }
    }
    return 0;
}

Using the sub-matching operation (sub_match)

  • Let's now write the above valid () function . Due to the above case demonstrates, our regular expression seven sub-expressions , so each smatch objects will contain eight ssub_match elements (position [0] element represents the entire match; element [1] ... [7 ] represent individual sub-expression)
  • When you call valid, we already know that there is a complete match, but do not know whether each optional sub-expression is part of a match. If a member of matched sub-expression is part of a complete match, then its corresponding object to true ssub_match
  • In a legitimate phone number, area code surrounded by parenthesis is either complete or no brackets . Therefore, valid what to do jobs depend on whether the number starts with a parenthesis. code show as below:
bool valid(const smatch& m)
{
    //如果区号前有一个左括号
    if (m[1].matched)
        //则区号后面必须有一个右括号,之后紧跟剩余号码或一个空格
        return m[3].matched && (m[4].matched == 0 || m[4].str() == " ");
    else
        //否则,区号后不能有右括号
        //另两个组成部分见的分隔符必须匹配
        return !m[3].matched&&m[4].str() == m[6].str();
}
  • Now we run the program, to detect what the legality of the program:

Fifth, use regex_replace

  • Regular expressions can be used to find not only a given sequence, it can also be used to find the serial replaced with another sequence
  • When we want when a search and replace regular expressions in the input sequence, may be the regex_replace whine, the figure shows the related operations:
  • It accepts an input sequence of characters and a regex object, it takes a description of output in the form we want the string

Case presentation

  • For example, using the above telephone number presentation case, we want to change the telephone number format "ddd.dddd.dddd". We want to replace the string 2,5,7 the first sub-expression, while ignoring the first sub-expression 1,3,4,6
  • We use a $ symbol with the sub-index numbers for the expression of a particular sub-expression. code show as below:
#include <iostream>
#include <string>
#include <regex>
using namespace std;

int main()
{
    std::string phone = "(\\()?(\\d{3})(\\))?([-. ])?(\\d{3})([-. ]?)(\\d{4})";
    std::regex r(phone);

    std::string fmt = "$2.$5.$7";

    std::string number = "(908) 555-1800";
    std::cout << regex_replace(number, r, fmt) << std::endl;
    return 0;
}

Alternatively only a portion of the input sequence

  • Use a regular expression more interesting is to replace a large file in the phone number
  • For example, we have a save file name and phone number:

  • We want to convert it to:

  • Then the code is defined as follows:
int main()
{
    std::string phone = "(\\()?(\\d{3})(\\))?([-. ])?(\\d{3})([-. ]?)(\\d{4})";
    std::regex r(phone);
    std::string s;
    std::string fmt = "$2.$5.$7"; //将号码格式改为ddd.ddd.dddd

    //逐渐读取
    while (getline(std::cin, s))
        std::cout << regex_replace(s, r, fmt) << std::endl;
    return 0;
}
  • Each record is read, saved to the s, and pass it to regex_replace. This function converts all find and match strings in the input sequence

Control flag for format matching and

  • The library also defines a flag for controlling or matching process in alternative formats . The map shows these values, these values can be used in () function or the regex_match () member function or a format of the class smatch regex_search:

  • The flags of the data type match_flag_type
  • These values are defined in a namespace called regex_constants , similar placeholders bind function, also defined in regex_constants namespace namespace std. In order to use regex_constants in the name, we can use the following two formats:
using std::regex_constants::format_no_copy; //只引入format_no_copy标志

using namespace std::regex_constants;       //引入regex_constants中的所有标志

Use the format flag (to format_no_copy for example)

  • By default, regex_replace output the entire input sequence. Regular expression matching is not to be outputted as a portion; partially match the specified output format according to the format string
  • We can format_no_copy flag to change this default behavior:
#include <iostream>
#include <string>
#include <regex>

using namespace std;
using std::regex_constants::format_no_copy;

int main()
{
    std::string phone = "(\\()?(\\d{3})(\\))?([-. ])?(\\d{3})([-. ]?)(\\d{4})";
    std::regex r(phone);
    std::string s;
    std::string fmt2 = "$2.$5.$7 "; //在最后一部分号码后放置空格作为分隔符

    //通知regex_replace只拷贝它替换的文本
    while (getline(std::cin, s))
        std::cout << regex_replace(s, r, fmt2,format_no_copy) << std::endl;
    return 0;
}
  • The above given the same input, the output results are as follows:

 

 

Released 1504 original articles · won praise 1063 · Views 430,000 +

Guess you like

Origin blog.csdn.net/qq_41453285/article/details/104671121