[Java usage] Java filters html tags to obtain plain text information

Contents of this article

Solution one, Hutool tools

Solution 2: Spring's own tools

Solution three: write regular tools by yourself


This function is quite simple and can be realized by using regular rules, but many excellent people have done it. Do you think we need to repeat the wheel? Filtering HTML tags is also for security reasons, it can effectively place XSS attacks.

Solution one, Hutool tools

cn.hutool.http.HtmlUtil can realize the filtering of Html tags. There are many methods, and the specific use needs to be tested by yourself.

package com.soft.practice.javacolume1;

import cn.hutool.http.HtmlUtil;
import cn.hutool.json.JSONUtil;
import com.soft.practice.domain.dataobject.MainsiteQuoteDO;

public class HtmlFilterTest {

    public static void main(String[] args) {
        MainsiteQuoteDO mainsiteQuoteDO = new MainsiteQuoteDO();
        mainsiteQuoteDO.setArea("<div><p><span class='width:60px'>这是面积<span/></p></div>");
        mainsiteQuoteDO.setCity("<div class=\"article-content\"><p><span class=\"bjh-p\"感知</span></p><p><span class=\"bjh-p\">【美侦察机前往侦察】");
        mainsiteQuoteDO.setCustomerMessage("\"<div class=\\\"article-content\\\"><p><span class=\\\"bjh-p\\\">\" +\n" +
                "                \"感知</span></p><p><span class=\\\"bjh-p\\\">【美侦察机前往侦察】\"");
        mainsiteQuoteDO.setSearchKey("<div><p><span class='width:60px'>这是面积<span/></p></div>");

        String s = HtmlUtil.cleanHtmlTag(JSONUtil.toJsonStr(mainsiteQuoteDO));
        MainsiteQuoteDO quoteDO = JSONUtil.toBean(s, MainsiteQuoteDO.class);
        System.out.println(quoteDO);
    }
}

Solution 2: Spring's own tools

org.springframework.web.util.HtmlUtils can realize the conversion between HTML tags and escape characters. There are many methods in this HtmlUtils, and you can experiment on how to use it.

code show as below: 

/** HTML转义 **/  
String s = HtmlUtils.htmlEscape("<div>hello world</div><p>&nbsp;</p>");  
System.out.println(s);  
String s2 = HtmlUtils.htmlUnescape(s);  
System.out.println(s2);  

Output result:

&lt;div&gt;hello world&lt;/div&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;  
  
<div>hello world</div><p>&nbsp;</p>  

Solution three: write regular tools by yourself

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import org.springframework.util.StringUtils;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * HTML标签过滤工具
 *
 * @author jim
 * @date 2017/11/27
 */
public final class HtmlUtils {

    private static final Logger logger = LoggerFactory.getLogger(HtmlUtils.class);

    /**
     * 禁止实例化
     */
    private HtmlUtils() {
        throw new IllegalStateException("禁止实例化");
    }

    /**
     * 过滤HTML标签输出文本
     *
     * @param inputString 原字符串
     * @return 过滤后字符串
     */
    public static String Html2Text(String inputString) {
        if (StringUtils.isEmpty(inputString)) {
            return "";
        }

        // 含html标签的字符串
        String htmlStr = inputString.trim();
        String textStr = "";
        Pattern p_script;
        Matcher m_script;
        Pattern p_style;
        Matcher m_style;
        Pattern p_html;
        Matcher m_html;
        Pattern p_space;
        Matcher m_space;
        Pattern p_escape;
        Matcher m_escape;

        try {
            // 定义script的正则表达式{或<script[^>]*?>[\\s\\S]*?<\\/script>
            String regEx_script = "<[\\s]*?script[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?script[\\s]*?>";

            // 定义style的正则表达式{或<style[^>]*?>[\\s\\S]*?<\\/style>
            String regEx_style = "<[\\s]*?style[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?style[\\s]*?>";

            // 定义HTML标签的正则表达式
            String regEx_html = "<[^>]+>";

            // 定义空格回车换行符
            String regEx_space = "\\s*|\t|\r|\n";

            // 定义转义字符
            String regEx_escape = "&.{2,6}?;";

            // 过滤script标签
            p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
            m_script = p_script.matcher(htmlStr);
            htmlStr = m_script.replaceAll("");

            // 过滤style标签
            p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
            m_style = p_style.matcher(htmlStr);
            htmlStr = m_style.replaceAll("");

            // 过滤html标签
            p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
            m_html = p_html.matcher(htmlStr);
            htmlStr = m_html.replaceAll("");

            // 过滤空格回车标签
            p_space = Pattern.compile(regEx_space, Pattern.CASE_INSENSITIVE);
            m_space = p_space.matcher(htmlStr);
            htmlStr = m_space.replaceAll("");

            // 过滤转义字符
            p_escape = Pattern.compile(regEx_escape, Pattern.CASE_INSENSITIVE);
            m_escape = p_escape.matcher(htmlStr);
            htmlStr = m_escape.replaceAll("");

            textStr = htmlStr;

        } catch (Exception e) {
            logger.info("Html2Text:{}", e.getMessage());
        }

        // 返回文本字符串
        return textStr;
    }
}

 

 

end!

Guess you like

Origin blog.csdn.net/weixin_44299027/article/details/110248100