版权声明:本文为博主原创文章,转载请注明出处!有时候也不是原创,手快就选了(我的文章随意转载复制,不在乎的哈!) https://blog.csdn.net/qq_31384551/article/details/81627840
是这样的,在使用jsoup做爬虫的时候,抓取到的网页二进制编码不确定,有的是utf-8有的是GBK,所以就需要进行编码判断
使用工具:juniversalchardet
maven包:
<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency>
<groupId>com.googlecode.juniversalchardet</groupId>
<artifactId>juniversalchardet</artifactId>
<version>1.0.3</version>
</dependency>
判断编码格式代码(来自CSDN,这段代码是我复制的,原文地址:https://blog.csdn.net/ajaxhu/article/details/12446917)
package com.spider.common.tools;
import org.mozilla.universalchardet.UniversalDetector;
/**
* 作用:
* 作者:Tiddler
* 时间:2018-08-2018/8/13 12:00
* 类名:GetByteEncode
**/
public class GetByteEncode {
public static String getEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
UniversalDetector detector =new UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}
}
使用:
String responseContext = null;
byte[] bytes = response.body().bytes();
String encoding = GetByteEncode.getEncoding(bytes);//编码判断
System.out.println("字符编码是:"+encoding);
if(encoding.indexOf("GB")>=0){//由于GBK编码有多种,此处这样判断即可
responseContext = new String(bytes,"gbk");
}
if("UTF-8".equals(encoding)){
responseContext = new String(bytes,"utf-8");
}
如果获得了一段乱码字符串判断编码解决思路:
先把乱码字符串转byte数组,然后按照上面的方法进行判断即可