Java filters html tags but does not filter img tags whose src attribute value is a specific path in img tags

Background :

Recently, I made an article management module, which has an interface for querying the article list, which requires the first 300 words of the article content to be displayed. Because the article content may contain multiple pictures, attachments, etc. Therefore, the article content is stored in the database in blob format. The front-end passes the HTML format of the article to the back-end, and the back-end stores the entire article into a blob.

If the user enters the emoticons provided by the front-end editor [we use tinymce editor] when typing the article content, the editor will also convert the emoticons into img tags. It's just that the first part of the src path is fixed.

Requirements :

The first 300 words of the article content returned in the background are plain text with HTML tags filtered, such as <p> <a> <img> and other tags replaced with empty ones. However, do not filter emoticons. If there are emoticons in the first 300 words, emoticons will be displayed. Other pictures must be filtered.

Solution :

Use regular rules to replace the html tags of the article content with empty ones, but cannot replace specific img emoticons, and then just take the first 300 remaining text.

//content是文章内容的html
//static/tinymce4.7.5/plugins/img/01.gif  是表情符路径,所有表情符号前面路径都一样,
//只有名字不一样
//正则:(?!<(img|IMG) src=\"static/tinymce4.7.5/plugins/img/.*?/>)<.*?>
content = content.replaceAll("(?!<(img|IMG) src=\"static/tinymce4.7.5/plugins/img/.*?/>)<.*?>","");

//过滤tab符号、回车、换行html
content = content.replaceAll("\t|\n|\r","");

//双引号替换为“”
content = content.replaceAll("&ldquo;","\"");
content = content.replaceAll("&rdquo;","\"");

//空格去掉
content = content.replaceAll("&nbsp;","");

if(content.length()>300){
    content = content.substring(0,300);
}
System.out.printf("content=" + content);

 

Guess you like

Origin blog.csdn.net/dhklsl/article/details/115620016
Recommended