近期,有某些省份的电信用户反映公司的Android客户端App通过3G手机卡得到的部分数据显示是乱码,但在wifi环境下显示是正常,初步排查是因为数据在进行gzip压缩之前的编码不同,在某些省份是GBK,有些是UFT8,在解码后可能与预定的GBK编码不符,出现乱码。因此,需要对网络流进行编码探测,根据探测结果选择编码。
一种简单的方式是通过HttpEntity的ContentType分析字符编码:
public static String getContentCharSet(final HttpEntity entity)
throws ParseException {
if (entity == null) {
throw new IllegalArgumentException("HTTP entity may not be null");
}
String charset = null;
if (entity.getContentType() != null) {
HeaderElement values[] = entity.getContentType().getElements();
if (values.length > 0) {
NameValuePair param = values[0].getParameterByName("charset");
if (param != null) {
charset = param.getValue();
}
}
}
return charset;
}
测试的确是有效的,但是可能给人的感觉却似乎总是不放心的,比如万一HTTP Header里缺少ContentType,那如何判断字符编码?
网上有大量的判断字符编码的博客了,但多数是对于文件的编码判断,对于网络流的判断是失效的,在此推荐一个开源组件cpdetector(总共494KB),可以检测文件和字节流编码。
下面是EncodingDetector工具类代码。cpdetector是基于统计学的,统计的字节数越多,准确性越高。对于文件流,字节数是已知的,探测的字节数是文件长度-1,但不超过2000。
public class EncodingDetector {
private static final CodepageDetectorProxy detector = CodepageDetectorProxy .getInstance();
static {
detector.add(new ParsingDetector(false));
detector.add(JChardetFacade.getInstance());
detector.add(ASCIIDetector.getInstance());
detector.add(UnicodeDetector.getInstance());
}
public static String getCharset(InputStream is, Boolean useAvailable) {
Charset charset = null;
int detectCharNum = 2000; //检测的字节数越多越准确, 字节数的指定不能超过文本流的最大长度
try {
if(useAvailable) {
int available = is.available();
if(available <= 1) { //有的输入流可能没有能力返回字节数(比如网络流,并不能准确知道还有多少数据未到达)
return HTTP.UTF_8;
}
if(detectCharNum > available) {
detectCharNum = available - 1;
}
}
BufferedInputStream bufferedInputStream = new BufferedInputStream(is);
charset = detector.detectCodepage(bufferedInputStream, detectCharNum);
bufferedInputStream.reset();
} catch (Exception e) {
}
return null != charset ? charset.name() : null;
}
public static String getCharset(ByteArrayOutputStream bos) {
String charset = null;
try {
ByteArrayInputStream is = new ByteArrayInputStream(bos.toByteArray());
charset = getCharset(is, true); //bos字节数是已知的
is.close();
} catch (IOException e) {
}
return charset;
}
}
对于网络流,因为不能准确知道还有多少数据没有到达,应该先读取并缓存字节流,然后探测缓存的编码。
HttpGet httpGet = null;
HttpResponse httpResponse;
InputStream is = null;
BufferedReader in = null;
ByteArrayOutputStream bos = null;
try {
httpGet = new HttpGet(url);
//httpGet.addHeader();
httpResponse = httpClient.execute(httpGet);
int statusCode = httpResponse.getStatusLine().getStatusCode();
HttpEntity httpEntity = httpResponse.getEntity();
String json = "";
if(httpEntity != null){
is= httpEntity.getContent();
Header val = httpEntity.getContentEncoding();
if (val != null && val.getValue()!= null && val.getValue().contains("gzip")) {
is= new GZIPInputStream(is);
}
else{
BufferedInputStream bis = new BufferedInputStream(is);
bis.mark(2);
// 取前两个字节
byte[] header = new byte[2];
int result = bis.read(header);
// reset输入流到开始位置
bis.reset();
// 判断是否是GZIP格式
if(result!=-1 && Utils.toInt(header, 0)== GZip_Value) {
is= new GZIPInputStream(bis);
} else {
is= bis;
}
}
if(encoding != null) {
//解决部分省份出现乱码的问题
Boolean mustUseDefault = false;
if(needDetectEncoding) {
/*String chartsetFromHttpEntity = EntityUtils.getContentCharSet(httpEntity);
if(!TextUtils.isEmpty(chartsetFromHttpEntity)) {
chartsetFromHttpEntity = chartsetFromHttpEntity.toUpperCase();
mustUseDefault = chartsetFromHttpEntity.contains("UTF");
}*/
bos = new ByteArrayOutputStream();
byte[] buff = new byte[100]; //buff用于存放循环读取的临时数据
int rc = 0;
while ((rc = is.read(buff, 0, 100)) > 0) {
bos.write(buff, 0, rc);
}
String chartsetFromInputStream = EncodingDetector.getCharset(bos);
if(!TextUtils.isEmpty(chartsetFromInputStream)) {
chartsetFromInputStream = chartsetFromInputStream.toUpperCase();
mustUseDefault = chartsetFromInputStream.contains("UTF");
}
//android.util.Log.e("httpGetWithZip", chartsetFromHttpEntity + chartsetFromInputStream);
is.close();
is = new ByteArrayInputStream(bos.toByteArray());
}
if(mustUseDefault) {
in = new BufferedReader(new InputStreamReader(is));
} else {
in = new BufferedReader(new InputStreamReader(is, “GBK”));
}
} else{
in = new BufferedReader(new InputStreamReader(is));
}
String line = "";
while ((line = in.readLine()) != null) {
json += line;
}
}
if(statusCode != HttpStatus.SC_OK){
}
} catch (ClientProtocolException e) {
if(httpGet != null){
httpGet.abort();
}
}
catch(IllegalArgumentException e){
if(httpGet != null){
httpGet.abort();
}
}
catch(OutOfMemoryError e){
if(httpGet != null){
httpGet.abort();
}
}catch (IOException e) {
rspInfo.setStatusCode(NetError);
if(httpGet != null){
httpGet.abort();
}
}
finally{
if(is != null){
try {
is.close();
} catch (IOException e) {
}
}
if(in != null){
try {
in.close();
} catch (IOException e) {
}
}
if(bos != null) {
try {
bos.close();
} catch (IOException e) {
}
}
}
要正确使用detector.add(JChardetFacade.getInstance());,将cpdetector_1.0.10.jar放到\libs\目录下,并且antlr-2.7.4.jar、chardet-1.0.jar、jargs-1.0.jar也放到\libs\目录下。