How does this group() catch the text?

Dev_noob :

I've come across this Hackerrank problem and the regex should match string between the HTML tags. The regex and the string is

String str="<h1>Hello World!</h1>";
String regex="<(.+)>([^<]+)</\\1>";

Also what if the 'str' has more than one HTML tags like String str="<h1><h1>Hello World!</h1></h1>" and how ([^<]+) catches this 'str'.

My question is how ([^<]+) matches the 'str' and not ([a-zA-Z]+).

Here if the full source code :

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/* Solution assumes we can't have the symbol "<" as text between tags */
public class Solution{
    public static void main(String[] args){
        Scanner scan = new Scanner(System.in);
        int testCases = Integer.parseInt(scan.nextLine());

        while (testCases-- > 0) {
            String line = scan.nextLine();

            boolean matchFound = false;
            Pattern r = Pattern.compile(regex);
            Matcher m = r.matcher(line);

            while (m.find()) {
                System.out.println(m.group(2));
                matchFound = true;
            }
            if ( ! matchFound) {
                System.out.println("None");
            }
        }
    }
}

Don't mind if I'm stupid to ask this question and thank you in advance!

Mad Physicist :

This regex guarantees that your string only contains one tag, assuming well formed HTML input.

The initial <(.+)> captures the name of your tag. The capture group will also get any attributes it can. Since + is a greedy quantifier, it will capture multiple tags if it can.

The trailing </\\1> matches against whatever the first group captured. That's why, if your HTML is well formed, the expression won't capture multiple tags or tags with attributes:

  • Opening tag <h1>, closing tag </h1>
  • Opening tag <h1 attr="value">, closing tag </h1>, but expecting </h1 attr="value">
  • Opening tag <h1><h2>, closing tag </h2></h1>, but expecting </h1><h2>

That's why the tag can be matche with .+ rather safely, while the contents must be matched with [^<]+. You want to make sure you don't grab any stay tags in the content, but any other character at all is allowed. [^<]+ (pronounced. "not <, at least once) allows things like !, while [A-za-z] certainly would not.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=91152&siteId=1