I've come across this Hackerrank problem and the regex should match string between the HTML tags. The regex and the string is
String str="<h1>Hello World!</h1>";
String regex="<(.+)>([^<]+)</\\1>";
Also what if the 'str' has more than one HTML tags like String str="<h1><h1>Hello World!</h1></h1>"
and how ([^<]+)
catches this 'str'.
My question is how ([^<]+)
matches the 'str' and not ([a-zA-Z]+)
.
Here if the full source code :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/* Solution assumes we can't have the symbol "<" as text between tags */
public class Solution{
public static void main(String[] args){
Scanner scan = new Scanner(System.in);
int testCases = Integer.parseInt(scan.nextLine());
while (testCases-- > 0) {
String line = scan.nextLine();
boolean matchFound = false;
Pattern r = Pattern.compile(regex);
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(2));
matchFound = true;
}
if ( ! matchFound) {
System.out.println("None");
}
}
}
}
Don't mind if I'm stupid to ask this question and thank you in advance!
This regex guarantees that your string only contains one tag, assuming well formed HTML input.
The initial <(.+)>
captures the name of your tag. The capture group will also get any attributes it can. Since +
is a greedy quantifier, it will capture multiple tags if it can.
The trailing </\\1>
matches against whatever the first group captured. That's why, if your HTML is well formed, the expression won't capture multiple tags or tags with attributes:
- Opening tag
<h1>
, closing tag</h1>
✓ - Opening tag
<h1 attr="value">
, closing tag</h1>
, but expecting</h1 attr="value">
- Opening tag
<h1><h2>
, closing tag</h2></h1>
, but expecting</h1><h2>
That's why the tag can be matche with .+
rather safely, while the contents must be matched with [^<]+
. You want to make sure you don't grab any stay tags in the content, but any other character at all is allowed. [^<]+
(pronounced. "not <
, at least once) allows things like !
, while [A-za-z]
certainly would not.