【开源软件】htmlcleaner获取页面编码bug分析

HtmlCleaner是一个开源的Java语言的Html文档解析器。相当强大且简单易用。这里不介绍它的使用，具体使用可以到它的官网去看（http://htmlcleaner.sourceforge.net/javause.php）

这里说一个HtmlCleaner的bug.

问题现象：

在用htmlCleaner抓取网页内容时，如果不知道网页的编码，可以不设置编码。代码如下：

HtmlCleaner cleaner = new HtmlCleaner();
URL url = new URL("http://www.qq.com/");
TagNode node = cleaner.clean(url);

这样htmlCleaner会自动获取页面编码，但htmlCleaner在获取页面编码时，有一种情况没有考虑到。当页面的编码是以下面形式给出时

<meta charset="UTF-8" />

这时，htmlcleaner将无法获取页面编码，而使用系统编码。如果系统编码和网页编码不一致就会出现乱码。

解决方法：

public static String getCharset(URL url) throws Exception {
		URLConnection urlConnection = url.openConnection();
		String charset = null;
        if (charset == null) {
            charset = getCharsetFromContentTypeString( urlConnection.getHeaderField("Content-Type") );
        }
        if (charset == null) {
            charset = getCharsetFromContent(url);
        }
        if (charset == null) {
            charset = getCharsetFromMeta(url);
        }
        if (charset == null) {
            charset = HtmlCleaner.DEFAULT_CHARSET;           
        }
		return charset;
	}
	public static String getCharsetFromContentTypeString(String contentType) {
        if (contentType != null) {
            String pattern = "charset=([a-z\\d\\-]*)";
            Matcher matcher = Pattern.compile(pattern,  Pattern.CASE_INSENSITIVE).matcher(contentType);
            if (matcher.find()) {
                String charset = matcher.group(1);
                if (Charset.isSupported(charset)) {
                    return charset;
                }
            }
        }
        
        return null;
    }

    public static String getCharsetFromContent(URL url) throws IOException {
        InputStream stream = url.openStream();
        byte chunk[] = new byte[2048];
        int bytesRead = stream.read(chunk);
        if (bytesRead > 0) {
            String startContent = new String(chunk);
            String pattern = "\\<meta\\s*http-equiv=[\\\"\\']content-type[\\\"\\']\\s*content\\s*=\\s*[\"']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\"\\'\\>]";
            Matcher matcher = Pattern.compile(pattern,  Pattern.CASE_INSENSITIVE).matcher(startContent);
            if (matcher.find()) {
                String charset = matcher.group(1);
                if (Charset.isSupported(charset)) {
                    return charset;
                }
            }
        }

        return null;
    }
	public static String getCharsetFromMeta(URL url) throws Exception {
        InputStream stream = url.openStream();
        byte chunk[] = new byte[2048];
        int bytesRead = stream.read(chunk);
        if (bytesRead > 0) {
            String startContent = new String(chunk);
            String pattern = "\\<meta\\s*[\\\"\\']charset=([a-z\\d\\-]*)[\\\"\\'\\>]";
            Matcher matcher = Pattern.compile(pattern,  Pattern.CASE_INSENSITIVE).matcher(startContent);
            if (matcher.find()) {
                String charset = matcher.group(1);
                if (Charset.isSupported(charset)) {
                    return charset;
                }
            }
        }

        return null;
    }

注：getCharsetFromContentTypeString和 getCharsetFromContent方法是htmlCleaner包中提供的方法

使用getCharset方法获取编码，在初始化htmlCleaner时，设置网页编码：

HtmlCleaner cleaner = new HtmlCleaner();
URL url = new URL("http://www.qq.com/");
TagNode node = cleaner.clean(url,getCharset(url));

【开源软件】htmlcleaner获取页面编码bug分析

猜你喜欢