Negation of regex for finding HTML tags and their content - java

Ondřej Burda :

I'm doing a project at uni where I have to clean some HTML code with using regex (I know, not the best approach...)

Input of body:

<h1>This is heading 1</h1>
<h2 style="color: aqua">This is heading 2</h2>
<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<a href="https://www.w3schools.com">This is a link</a>
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>

I get a list of allowed tags and I have to remove every other tag with it's content as well. for example {h3, p, ul}

First I strip all parameters (theyre not allowed), then I came up with this regex, that removes tags and content.

String regex = "(?i)<([h3|ul|p]+)>\\n?.*\\n?<\\/\\1>";

It works, but now I have to negate it and remove all tags and content except those given in...

I tried this, but doesn't work :

`...[?!h3|ul|p]...`

Desired outcome for this example:

<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<ul>
</ul>

Don't really understand the Negative Lookahead and how to apply it to my problem, so I'll be thankful for any advice.

Pushpesh Kumar Rajwanshi :

The negative look ahead you are trying to use needs to be written as (?!(?:h3|ul|p)\b) which will not select either h3 or ul or p tag. Notice the use of word boundary \b after it so as to reject exact matches of those tags. And besides removing those tags, you will also have to remove the whitespaces left behind after removing those tags, hence overall the regex you need to use is this,

\h*<(?!(?:h3|ul|p)\b)([^>]+).*?>[\w\W]*?</\1>\s*

Regex Explanation:

  • \h* - Matches zero or more horizontal whitespace (space and tabs and may be other that exists) before the tag
  • < - Start of tag
  • (?!(?:h3|ul|p)\b) - Negative lookahead to exactly reject h3 ul and p tags
  • ([^>]+) - Matches a tag name one or more characters and captures in group1 for back referencing it later. You can use something like \w+ or a character set with allowed characters to only match what you want.
  • .*?> - Optionally match zero or more characters (basically attributes) followed by closing of start tag with >
  • [\w\W]*? - Matches any character zero or more including newlines in non-greedy way
  • </\1> - Matches the closing of tag where \1 represents what matched earlier as tag name
  • \s* - Matches zero or more whitespace which basically consumes the empty space created by removal of tags

Regex Demo

Java Code demo,

String s = "<h1>This is heading 1</h1>\r\n" + 
        "<h2 style=\"color: aqua\">This is heading 2</h2>\r\n" + 
        "<h3>This is heading 3</h3>\r\n" + 
        "<p>This is a paragraph.</p>\r\n" + 
        "<p>This is another paragraph.</p>\r\n" + 
        "<a href=\"https://www.w3schools.com\">This is a link</a>\r\n" + 
        "<ul>\r\n" + 
        "  <li>Coffee</li>\r\n" + 
        "  <li>Tea</li>\r\n" + 
        "  <li>Milk</li>\r\n" + 
        "</ul>";

System.out.println("Before:\n" + s);
System.out.println("\nAfter:\n" + s.replaceAll("\\h*<(?!(?:h3|ul|p)\\b)([^>]+).*?>[\\w\\W]*?</\\1>\\s*", ""));

Output,

Before:
<h1>This is heading 1</h1>
<h2 style="color: aqua">This is heading 2</h2>
<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<a href="https://www.w3schools.com">This is a link</a>
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>

After:
<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<ul>
</ul>

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=149357&siteId=1