如何根据字节流内容确定汉字编码，从而解决相同App在部分省份显示乱码的问题

近期，有某些省份的电信用户反映公司的Android客户端App通过3G手机卡得到的部分数据显示是乱码，但在wifi环境下显示是正常，初步排查是因为数据在进行gzip压缩之前的编码不同，在某些省份是GBK，有些是UFT8，在解码后可能与预定的GBK编码不符，出现乱码。因此，需要对网络流进行编码探测，根据探测结果选择编码。

一种简单的方式是通过HttpEntity的ContentType分析字符编码：

    public static String getContentCharSet(final HttpEntity entity)
        throws ParseException {

        if (entity == null) {
            throw new IllegalArgumentException("HTTP entity may not be null");
        }
        String charset = null;
        if (entity.getContentType() != null) { 
            HeaderElement values[] = entity.getContentType().getElements();
            if (values.length > 0) {
                NameValuePair param = values[0].getParameterByName("charset");
                if (param != null) {
                    charset = param.getValue();
                }
            }
        }
        return charset;
    }

测试的确是有效的，但是可能给人的感觉却似乎总是不放心的，比如万一HTTP Header里缺少ContentType，那如何判断字符编码？

网上有大量的判断字符编码的博客了，但多数是对于文件的编码判断，对于网络流的判断是失效的，在此推荐一个开源组件cpdetector（总共494KB），可以检测文件和字节流编码。

下面是EncodingDetector工具类代码。cpdetector是基于统计学的，统计的字节数越多，准确性越高。对于文件流，字节数是已知的，探测的字节数是文件长度-1，但不超过2000。

public class EncodingDetector {
    private static final CodepageDetectorProxy detector = CodepageDetectorProxy .getInstance();

    static {    	
        detector.add(new ParsingDetector(false));
        detector.add(JChardetFacade.getInstance());
        detector.add(ASCIIDetector.getInstance());
        detector.add(UnicodeDetector.getInstance());
    }
    
    public static String getCharset(InputStream is, Boolean useAvailable) {
        Charset charset = null;
        int detectCharNum = 2000; //检测的字节数越多越准确, 字节数的指定不能超过文本流的最大长度
        try {
        	if(useAvailable) {
	            int available = is.available();
	            if(available <= 1) { //有的输入流可能没有能力返回字节数（比如网络流，并不能准确知道还有多少数据未到达）
	            	return HTTP.UTF_8;
	            }
	            if(detectCharNum > available) {
	            	detectCharNum = available - 1;
	            }
        	}
        	BufferedInputStream bufferedInputStream = new BufferedInputStream(is);
            charset = detector.detectCodepage(bufferedInputStream, detectCharNum);
            bufferedInputStream.reset();            
        } catch (Exception e) {
        }
        return null != charset ? charset.name() : null;
    }

	public static String getCharset(ByteArrayOutputStream bos) {
		String charset = null;
		try {
			ByteArrayInputStream is = new ByteArrayInputStream(bos.toByteArray());
			charset = getCharset(is, true); //bos字节数是已知的
			is.close();
		} catch (IOException e) {
		}
		return charset;
	}

}

对于网络流，因为不能准确知道还有多少数据没有到达，应该先读取并缓存字节流，然后探测缓存的编码。

		HttpGet httpGet = null;
		HttpResponse httpResponse;
		InputStream is = null;
		BufferedReader in = null;
		ByteArrayOutputStream bos = null;
		try {
			httpGet = new HttpGet(url);
			//httpGet.addHeader();			
			httpResponse = httpClient.execute(httpGet);
			int statusCode = httpResponse.getStatusLine().getStatusCode();
			HttpEntity httpEntity = httpResponse.getEntity();
			String json = "";
			if(httpEntity != null){
				is= httpEntity.getContent();
				Header val = httpEntity.getContentEncoding();				
				if (val != null && val.getValue()!= null && val.getValue().contains("gzip")) {				
			        is= new GZIPInputStream(is);
				}
				else{
					BufferedInputStream bis = new BufferedInputStream(is);
					bis.mark(2);
					// 取前两个字节
					byte[] header = new byte[2];
					int result = bis.read(header);
					// reset输入流到开始位置
					bis.reset();
					// 判断是否是GZIP格式
					if(result!=-1 && Utils.toInt(header, 0)== GZip_Value) {
					    is= new GZIPInputStream(bis);
					} else {
					    is= bis;
					}
				}
				
				if(encoding != null) {
					//解决部分省份出现乱码的问题
					Boolean mustUseDefault = false;
					if(needDetectEncoding) {
						/*String chartsetFromHttpEntity = EntityUtils.getContentCharSet(httpEntity);
						if(!TextUtils.isEmpty(chartsetFromHttpEntity)) {
							chartsetFromHttpEntity = chartsetFromHttpEntity.toUpperCase();
							mustUseDefault = chartsetFromHttpEntity.contains("UTF");
						}*/
						
						bos = new ByteArrayOutputStream();
						byte[] buff = new byte[100]; //buff用于存放循环读取的临时数据 
						int rc = 0; 
						while ((rc = is.read(buff, 0, 100)) > 0) { 
							bos.write(buff, 0, rc); 
						}
						String chartsetFromInputStream = EncodingDetector.getCharset(bos);
						if(!TextUtils.isEmpty(chartsetFromInputStream)) {
							chartsetFromInputStream = chartsetFromInputStream.toUpperCase();
							mustUseDefault = chartsetFromInputStream.contains("UTF");
						}
						//android.util.Log.e("httpGetWithZip", chartsetFromHttpEntity + chartsetFromInputStream);
						is.close();
						is = new ByteArrayInputStream(bos.toByteArray());
					}
					if(mustUseDefault) {
						in = new BufferedReader(new InputStreamReader(is));
					} else {
						in = new BufferedReader(new InputStreamReader(is, “GBK”));
					}
				} else{
					in = new BufferedReader(new InputStreamReader(is));
				}
			
				String line = "";
				while ((line = in.readLine()) != null) {
					json += line;
				}						
			}
			
			
			if(statusCode != HttpStatus.SC_OK){	
				
			}					
		} catch (ClientProtocolException e) {
			if(httpGet != null){
				httpGet.abort();
			}
		} 
		catch(IllegalArgumentException e){
			if(httpGet != null){
				httpGet.abort();
			}
		}
		catch(OutOfMemoryError e){
			if(httpGet != null){
				httpGet.abort();
			}
		}catch (IOException e) {
			rspInfo.setStatusCode(NetError);
			
			if(httpGet != null){
				httpGet.abort();
			}
		}
		finally{
			if(is != null){
				try {
					is.close();
				} catch (IOException e) {
				}
			}
			if(in != null){
				try {
					in.close();
				} catch (IOException e) {
				}
			}
			if(bos != null) {
				try {
					bos.close();
				} catch (IOException e) {
				}
			}
		}

要正确使用detector.add(JChardetFacade.getInstance());，将cpdetector_1.0.10.jar放到\libs\目录下，并且antlr-2.7.4.jar、chardet-1.0.jar、jargs-1.0.jar也放到\libs\目录下。

如何根据字节流内容确定汉字编码，从而解决相同App在部分省份显示乱码的问题

猜你喜欢