Java regular expression grouping and replacement

The sub-expression (grouping) of regular expressions is not very easy to understand, but it is a very powerful text processing tool.

1 Regular expression warm-up

Match phone number
// 电话号码匹配
// 手机号段只有 13xxx 15xxx 18xxxx 17xxx
System.out.println("18304072984".matches("1[3578]\\d{9}"));   // true

// 座机号:010-65784236,0316-3312617,022-12465647,03123312336
String regex = "0\\d{2}-?\\d{8}|0\\d{3}-?\\d{7}";
String telStr = "010-43367458";
System.out.println(telStr.matches(regex));  // true
Match mailbox
String mail = "[email protected]";
String reg = "[a-zA-Z_0-9]+@[a-zA-Z0-9]+(\\.[a-zA-Z]+){1,2}";
System.out.println(mail.matches(reg));  // true
Special character replacement

Replace non-Chinese characters with empty:

String input = "神探狄仁&*%$杰之四大天王@bdfbdbdfdgds23532";
String reg = "[^\\u4e00-\\u9fa5]";
input = input.replaceAll(reg, "");
System.out.println(input);   // 神探狄仁杰之四大天王

The Unicode encoding range of Chinese characters is:\u4e00-\u9fa5

2 Group

A group is a regular expression divided by parentheses, and a group can be referenced according to its number. The group number 0 represents the entire expression, the group number 1 represents the group enclosed by the first pair of parentheses, and so on.
See the description in Pattern in the Java API:
Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:

1. ((A)(B(C)))
2. (A)
3. (B(C))
4. (C)

Another example A(B(C))Dhas three groups: Group 0 is ABCD, group 1 is BC, Group 2 is C,
depending on how many parentheses to the left to determine how many packets parentheses expressions are called sub-expression.

Eg1:

The Matcher object provides many methods:

  1. goupCount() Returns the number of groups in the regular expression pattern, corresponding to the number of "left brackets"
  2. group(int i) Returns the matching character of the corresponding group, or null if no match is found
  3. start(int group) Returns the starting index of the matching character of the corresponding group
  4. end(int group) Returns the value of the last character index of the matched character in the corresponding group plus one
// 这个正则表达式有两个组,
// group(0) 是 \\$\\{([^{}]+?)\\}
// group(1) 是 ([^{}]+?)
String regex = "\\$\\{([^{}]+?)\\}";
Pattern pattern = Pattern.compile(regex);
String input = "${name}-babalala-${age}-${address}";

Matcher matcher = pattern.matcher(input);
System.out.println(matcher.groupCount());
// find() 像迭代器那样向前遍历输入字符串
while (matcher.find()) {
    
    
    System.out.println(matcher.group(0) + ", pos: "
            + matcher.start() + "-" + (matcher.end() - 1));
    System.out.println(matcher.group(1) + ", pos: " +
            matcher.start(1) + "-" + (matcher.end(1) - 1));
}

Output:

1
${name}, pos: 0-6
name, pos: 2-5
${age}, pos: 17-22
age, pos: 19-21
${address}, pos: 24-33
address, pos: 26-32

group, Translated into Chinese means grouping.
group()Or group(0)corresponding to the content matched by the entire regular expression each time,
group(1)indicating the content matched in the brackets (a sub-expression group).

Eg2:

In order to see the grouping more intuitively, add a pair of parentheses to the regular expression of Eg1:

String regex = "(\\$\\{([^{}]+?)\\})";
Pattern pattern = Pattern.compile(regex);
String input = "${name}-babalala-${age}-${address}";

Matcher matcher = pattern.matcher(input);
// matcher.find() 方法会对 input 这个字符串多次进行匹配,如果能匹配到,这个匹配结果里就会包含多个分组,我们可以从分组里提取我们想要的结果
while (matcher.find()) {
    
    
    System.out.println(matcher.group(0) + ", pos: " + matcher.start());
    System.out.println(matcher.group(1) + ", pos: " + matcher.start(1));
    System.out.println(matcher.group(2) + ", pos: " + matcher.start(2));
}

Output:

${name}, pos: 0
${name}, pos: 0
name, pos: 2
${age}, pos: 17
${age}, pos: 17
age, pos: 19
${address}, pos: 24
${address}, pos: 24
address, pos: 26

From this, a pair of parentheses can be obtained, and the number of left parentheses can be used to determine how many groups there are.

The group()application scenarios of obtaining matching strings in the grouping are very wide.
In one of the author's projects, we have realized very interesting wildcard replacement by using this feature, so moved!

Eg3 (to extract the desired data by grouping):
        // 这个正则表达式会提取字符串中的「数字」和「字母」
        Pattern pattern = Pattern.compile("([0-9]+).*?([a-zA-Z]+)");
        String input = "那就20200719这样吧sunny。。。。。。。122432该拿什么与眼泪抗衡twinkle";
        Matcher matcher = pattern.matcher(input);
        // 每个匹配到的子串分组的个数
        int group = matcher.groupCount();
        // 如果输入串有多个可被匹配的子串,这里会多次进行匹配
        while (matcher.find()) {
    
    
            System.out.println("匹配到的子串:" + matcher.group());  // 匹配到的子串
            for (int i = 1; i <= group; i++) {
    
    
                System.out.println("分组" + i + ": " + matcher.group(i));
            }
        }

Output:

匹配到的子串:20200719这样吧sunny
分组1: 20200719
分组2: sunny
匹配到的子串:122432该拿什么与眼泪抗衡twinkle
分组1: 122432
分组2: twinkle

3 Group replacement

Eg1 :
String tel = "18304072984";
// 括号表示组,被替换的部分$n表示第n组的内容
tel = tel.replaceAll("(\\d{3})\\d{4}(\\d{4})", "$1****$2");
System.out.print(tel);   // output: 183****2984

replaceAll is a method of replacing strings. The parentheses in the regular expression indicate a grouping. In the parameter 2 of replaceAll, you can use $n (n is a number) to refer to the grouping strings matched in the sub-expression in turn "(\\d{3})\\d{4}(\\d{4})", "$1****$2", divided into the front (前三个数字)中间四个数字(最后四个数字)Replace with (第一组数字保持不变 $1)(中间为 * )(第二组数字保持不变 $2).

Eg2:
String one = "hello girl hi hot".replaceFirst("(\\w+)\\s+(\\w+)", "$2 $1"); 
String two = "hello girl hi hot".replaceAll("(\\w+)\\s+(\\w+)", "$2 $1"); 
System.out.println(one);   // girl hello hi hot
System.out.println(two);   // girl hello hot hi

After understanding Eg1, this example will naturally understand.

Eg3 :

Here is a practical example, repeated punctuation replacement:

String input = "假如生活欺骗了你,,,相信吧,,,快乐的日子将会来临!!!…………";

// 重复标点符号替换
String duplicateSymbolReg = "([。?!?!,]|\\.\\.\\.|……)+";
input = input.replaceAll(duplicateSymbolReg, "$1");
System.out.println(input);

Output:

假如生活欺骗了你,相信吧,快乐的日子将会来临!……

Regular expression:, ([。?!?!,]|\\.\\.\\.|……)+in parentheses is a group: represents a punctuation mark, which +means that this group appears one or more times, $1the content of the group (a punctuation mark). replaceAll is used $1to replace the string.

Eg4:

IP address sorting

String ip = "192.68.1.254 102.49.23.013 10.10.10.10 2.2.2.2 8.109.90.30";
ip = ip.replaceAll("(\\d+)", "00$1");
System.out.println(ip);

ip = ip.replaceAll("0*(\\d{3})", "$1");
System.out.println(ip);
String[] strs = ip.split(" ");

Arrays.sort(strs);
for (String str : strs) {
    str = str.replaceAll("0*(\\d+)", "$1");
    System.out.println(str);
}

Output:

00192.0068.001.00254 00102.0049.0023.00013 0010.0010.0010.0010 002.002.002.002 008.00109.0090.0030
192.068.001.254 102.049.023.013 010.010.010.010 002.002.002.002 008.109.090.030
2.2.2.2
8.109.90.30
10.10.10.10
102.49.23.13
192.68.1.254
  1. Let each segment of the IP address be 3 bits, and there are 4 bits after replacement
  2. Ensure that each segment of the IP address is 3 bits
  3. Sort of

At this point, the author can't help but sigh, really powerful!

4 Back references

After using parentheses to specify a sub-expression group, the text matching this sub-expression can be further processed in the expression or other programs. By default, each group will automatically have a group number. The rule is: use the left parenthesis of the group as a mark, from left to right, the group number of the first group is 1, the second is 2, and so on.

Eg:
/* 这个正则表达式表示 安安静静 这样的叠词 */
String regex = "(.)\\1(.)\\2";  
System.out.println("安安静静".matches(regex));   // true
System.out.println("安静安静".matches(regex));   // false

The above (.)represents a packet, which .represents any character, each character is a packet,
\\1represents a group ( ) again appeared, \\2representing the group 2 ( ) appeared again.

安静安静How to write regular expressions for matching ? According to the above example, it will be 安静divided into a group, and then this group will appear again 安静安静:

String regex = "(..)\\1";  
System.out.println("安静安静".matches(regex));   // true
System.out.println("安安静静".matches(regex));   // false

5 Backreference substitution

Eg1:
String str = "我我...我我...我要..要要...要要...找找找一个....女女女女...朋朋朋朋朋朋...友友友友友..友.友...友...友友!!!";
        
/*将 . 去掉*/
str = str.replaceAll("\\.+", "");
System.out.println(str);

str = str.replaceAll("(.)\\1+", "$1");
System.out.println(str);

Output:

我我我我我要要要要要找找找一个女女女女朋朋朋朋朋朋友友友友友友友友友友!!!
我要找一个女朋友!

(.)Indicates that any character will become a group; \\1+quoting a group (a character) means that this group appears one or more times. $1Quotation grouping (.)replaces multiple repeated characters with one character.

Eg2:

Replace the content between repeated two digits:

"xx12abdd12345".replaceAll("(\\d{2}).+?\\1", "");  //结果为 xx345

Does it feel amazing!

replaceOne exception to pay attention to when using a series of methods: Java replaceAll() method reports an error Illegal group reference
reference:
Java Regular Chapter-27-Regular Replacement and Grouping Function
Regular Expressions 30-minute Introductory Tutorial
Regular Basics-Capture Group (capture group )
Java regular expression Pattern and Matcher class
Java learning regular expression
Java advanced-use regular expression to retrieve and replace specific characters in String and everything about regular expression
java remove spaces and punctuation

Guess you like

Origin blog.csdn.net/jiaobuchong/article/details/81257570