JavaでHdfsファイルデータを読み取る際の文字化けの解決策と方法

JAVA APIを使用して、文字が文字化けしたHDFSファイルを読み取り、ピットを踏みます

HFDSのファイルデータの一部をプレビュー用に読み取るためのインターフェイスを作成したいのですが、オンラインブログに従って実装した後、たとえばcsvを読み取るときに、読み取り情報が文字化けして表示されることがあります。カンマで区切られた

英語の文字列aaaは
中国語の文字列「Hello」を正常
に表示でき、「aaaHello 」などの中国語と英語の混合文字列は正常に表示でき、文字が文字化けし
ます。多くのブログを調べた後、解決策はおそらく次のとおりです。xxxを使用するデコードする文字セット。信じられない思いで一つずつ試してみましたが、なかなかうまくいきませんでした。

ソリューション

HDFSは6文字セットのエンコードをサポートしているため、ローカルファイルのエンコード方法はそれぞれ異なる可能性があります。ローカルファイルをアップロードするときは、実際にファイルをバイトストリームにエンコードし、ファイルシステムにアップロードして保存します。したがって、異なるファイルと異なる文字セットエンコーディングのバイトストリームに直面しているファイルデータをGETする場合、正しくデコードできるのは間違いなく固定文字セットデコードではありません。

したがって、実際には2つの解決策があります

HDFSコーデックの文字セットを修正しました。たとえば、UTF-8を選択した場合、ファイルをアップロードすると、エンコーディングが統一されます。つまり、さまざまなファイルのバイトストリームがUTF-8エンコーディングに変換されて保存されます。このように、ファイルデータを取得する場合、UTF-8文字セットのデコードに問題はありません。しかし、そうすることは、トランスコーディングの部分でまだ多くの問題を抱えており、実装するのは困難です。
動的デコード。ファイルのコード化された文字セットに従ってデコードする対応する文字セットを選択します。このようにして、ファイルのネイティブ文字ストリームは変更されず、基本的に文字化けはありません。
動的デコードのアイデアを選択した後、デコードに使用する文字セットを決定する方法に問題があります。解決策を入手するには、次のコンテンツを参照してください

Javaはテキストのエンコーディングを検出します（バイトストリーム）

要求する：

特定のファイルまたは特定のバイトストリームは、そのエンコード形式を検出する必要があります。

成し遂げる：

jchardetに基づく

1
2
3
4
5

net.sourceforge.jchardetは
jchardet
1.0

コードを以下の通りであります：

public class DetectorUtils {
    
    
    private DetectorUtils() {
    
    
    }
  
    static class ChineseCharsetDetectionObserver implements
            nsICharsetDetectionObserver {
    
    
        private boolean found = false;
        private String result;
  
        public void Notify(String charset) {
    
    
            found = true;
            result = charset;
        }
  
        public ChineseCharsetDetectionObserver(boolean found, String result) {
    
    
            super();
            this.found = found;
            this.result = result;
        }
  
        public boolean isFound() {
    
    
            return found;
        }
  
        public String getResult() {
    
    
            return result;
        }
  
    }
  
    public static String[] detectChineseCharset(InputStream in)
            throws Exception {
    
    
        String[] prob=null;
        BufferedInputStream imp = null;
        try {
    
    
            boolean found = false;
            String result = Charsets.UTF_8.toString();
            int lang = nsPSMDetector.CHINESE;
            nsDetector det = new nsDetector(lang);
            ChineseCharsetDetectionObserver detectionObserver = new ChineseCharsetDetectionObserver(
                    found, result);
            det.Init(detectionObserver);
            imp = new BufferedInputStream(in);
            byte[] buf = new byte[1024];
            int len;
            boolean isAscii = true;
            while ((len = imp.read(buf, 0, buf.length)) != -1) {
    
    
                if (isAscii)
                    isAscii = det.isAscii(buf, len);
                if (!isAscii) {
    
    
                    if (det.DoIt(buf, len, false))
                        break;
                }
            }
  
            det.DataEnd();
            boolean isFound = detectionObserver.isFound();
            if (isAscii) {
    
    
                isFound = true;
                prob = new String[] {
    
     "ASCII" };
            } else if (isFound) {
    
    
                prob = new String[] {
    
     detectionObserver.getResult() };
            } else {
    
    
                prob = det.getProbableCharsets();
            }
            return prob;
        } finally {
    
    
            IOUtils.closeQuietly(imp);
            IOUtils.closeQuietly(in);
        }
    }
}

テスト：

		String file = "C:/3737001.xml";
		String[] probableSet = DetectorUtils.detectChineseCharset(new FileInputStream(file));
		for (String charset : probableSet) {
    
    
			System.out.println(charset);
		}
		```
Google提供了检测字节流编码方式的包。那么方案就很明了了，先读一些文件字节流，用工具检测编码方式，再对应进行解码即可。

## 具体解决代码
```c
dependency>
    <groupId>net.sourceforge.jchardet</groupId>
    <artifactId>jchardet</artifactId>
    <version>1.0</version>
</dependency>
从HDFS读取部分文件做预览的逻辑
// 获取文件的部分数据做预览
public List<String> getFileDataWithLimitLines(String filePath, Integer limit) {
    
    
 FSDataInputStream fileStream = openFile(filePath);
 return readFileWithLimit(fileStream, limit);
}
 
// 获取文件的数据流
private FSDataInputStream openFile(String filePath) {
    
    
 FSDataInputStream fileStream = null;
 try {
    
    
  fileStream = fs.open(new Path(getHdfsPath(filePath)));
 } catch (IOException e) {
    
    
  logger.error("fail to open file:{}", filePath, e);
 }
 return fileStream;
}
 
// 读取最多limit行文件数据
private List<String> readFileWithLimit(FSDataInputStream fileStream, Integer limit) {
    
    
 byte[] bytes = readByteStream(fileStream);
 String data = decodeByteStream(bytes);
 if (data == null) {
    
    
  return null;
 }
 
 List<String> rows = Arrays.asList(data.split("\\r\\n"));
 return rows.stream().filter(StringUtils::isNotEmpty)
   .limit(limit)
   .collect(Collectors.toList());
}
 
// 从文件数据流中读取字节流
private byte[] readByteStream(FSDataInputStream fileStream) {
    
    
 byte[] bytes = new byte[1024*30];
 int len;
 ByteArrayOutputStream stream = new ByteArrayOutputStream();
 try {
    
    
  while ((len = fileStream.read(bytes)) != -1) {
    
    
   stream.write(bytes, 0, len);
  }
 } catch (IOException e) {
    
    
  logger.error("read file bytes stream failed.", e);
  return null;
 }
 return stream.toByteArray();
}
 
// 解码字节流
private String decodeByteStream(byte[] bytes) {
    
    
 if (bytes == null) {
    
    
  return null;
 }
 
 String encoding = guessEncoding(bytes);
 String data = null;
 try {
    
    
  data = new String(bytes, encoding);
 } catch (Exception e) {
    
    
  logger.error("decode byte stream failed.", e);
 }
 return data;
}
 
// 根据Google的工具判别编码
private String guessEncoding(byte[] bytes) {
    
    
 UniversalDetector detector = new UniversalDetector(null);
 detector.handleData(bytes, 0, bytes.length);
 detector.dataEnd();
 String encoding = detector.getDetectedCharset();
 detector.reset();
 
 if (StringUtils.isEmpty(encoding)) {
    
    
  encoding = "UTF-8";
 }
 return encoding;
}