Solutions and methods for garbled characters when reading Hdfs file data in Java

Use JAVA api to read HDFS files with garbled characters and step on pits

I want to write an interface to read part of the file data on the HFDS for preview. After implementing it according to the online blog, I found that sometimes the reading information will appear garbled, for example, when reading a csv, the strings are separated by commas

The English string aaa can display normally the
Chinese string "Hello",
and the Chinese-English mixed string such as "aaaHello " can be displayed normally , and there are garbled
characters. After consulting many blogs, the solution is probably: use the xxx character set to decode. With unbelieving thoughts, I tried one by one, but it was useless.

Solutions

Because HDFS supports 6 character set encodings, each local file encoding method is very likely to be different. When we upload a local file, we actually encode the file into a byte stream and upload it to the file system for storage. So when GET file data, facing the byte stream of different files and different character set encoding, it is definitely not a fixed character set decoding that can be decoded correctly.

So there are actually two solutions

Fixed HDFS codec character set. For example, if I choose UTF-8, then when uploading files, the encoding is unified, that is, the byte streams of different files are converted into UTF-8 encoding and then stored. In this way, when obtaining file data, there is no problem with UTF-8 character set decoding. But doing so will still have many problems in the transcoding part, and it is difficult to implement.
Dynamic decoding. Select the corresponding character set to decode according to the coded character set of the file. In this way, the native character stream of the file will not be changed, and there will be basically no garbled characters.
After I chose the idea of ​​dynamic decoding, the difficulty lies in how to determine which character set to use for decoding. Refer to the following content to obtain a solution

Java detects the encoding of text (byte stream)

demand:

A certain file or a certain byte stream needs to detect its encoding format.

achieve:

Based on jchardet

1
2
3
4
5

net.sourceforge.jchardet
jchardet
1.0

code is as follows:

public class DetectorUtils {
    
    
    private DetectorUtils() {
    
    
    }
  
    static class ChineseCharsetDetectionObserver implements
            nsICharsetDetectionObserver {
    
    
        private boolean found = false;
        private String result;
  
        public void Notify(String charset) {
    
    
            found = true;
            result = charset;
        }
  
        public ChineseCharsetDetectionObserver(boolean found, String result) {
    
    
            super();
            this.found = found;
            this.result = result;
        }
  
        public boolean isFound() {
    
    
            return found;
        }
  
        public String getResult() {
    
    
            return result;
        }
  
    }
  
    public static String[] detectChineseCharset(InputStream in)
            throws Exception {
    
    
        String[] prob=null;
        BufferedInputStream imp = null;
        try {
    
    
            boolean found = false;
            String result = Charsets.UTF_8.toString();
            int lang = nsPSMDetector.CHINESE;
            nsDetector det = new nsDetector(lang);
            ChineseCharsetDetectionObserver detectionObserver = new ChineseCharsetDetectionObserver(
                    found, result);
            det.Init(detectionObserver);
            imp = new BufferedInputStream(in);
            byte[] buf = new byte[1024];
            int len;
            boolean isAscii = true;
            while ((len = imp.read(buf, 0, buf.length)) != -1) {
    
    
                if (isAscii)
                    isAscii = det.isAscii(buf, len);
                if (!isAscii) {
    
    
                    if (det.DoIt(buf, len, false))
                        break;
                }
            }
  
            det.DataEnd();
            boolean isFound = detectionObserver.isFound();
            if (isAscii) {
    
    
                isFound = true;
                prob = new String[] {
    
     "ASCII" };
            } else if (isFound) {
    
    
                prob = new String[] {
    
     detectionObserver.getResult() };
            } else {
    
    
                prob = det.getProbableCharsets();
            }
            return prob;
        } finally {
    
    
            IOUtils.closeQuietly(imp);
            IOUtils.closeQuietly(in);
        }
    }
}

test:

		String file = "C:/3737001.xml";
		String[] probableSet = DetectorUtils.detectChineseCharset(new FileInputStream(file));
		for (String charset : probableSet) {
    
    
			System.out.println(charset);
		}
		```
Google提供了检测字节流编码方式的包。那么方案就很明了了,先读一些文件字节流,用工具检测编码方式,再对应进行解码即可。

## 具体解决代码
```c
dependency>
    <groupId>net.sourceforge.jchardet</groupId>
    <artifactId>jchardet</artifactId>
    <version>1.0</version>
</dependency>
从HDFS读取部分文件做预览的逻辑
// 获取文件的部分数据做预览
public List<String> getFileDataWithLimitLines(String filePath, Integer limit) {
    
    
 FSDataInputStream fileStream = openFile(filePath);
 return readFileWithLimit(fileStream, limit);
}
 
// 获取文件的数据流
private FSDataInputStream openFile(String filePath) {
    
    
 FSDataInputStream fileStream = null;
 try {
    
    
  fileStream = fs.open(new Path(getHdfsPath(filePath)));
 } catch (IOException e) {
    
    
  logger.error("fail to open file:{}", filePath, e);
 }
 return fileStream;
}
 
// 读取最多limit行文件数据
private List<String> readFileWithLimit(FSDataInputStream fileStream, Integer limit) {
    
    
 byte[] bytes = readByteStream(fileStream);
 String data = decodeByteStream(bytes);
 if (data == null) {
    
    
  return null;
 }
 
 List<String> rows = Arrays.asList(data.split("\\r\\n"));
 return rows.stream().filter(StringUtils::isNotEmpty)
   .limit(limit)
   .collect(Collectors.toList());
}
 
// 从文件数据流中读取字节流
private byte[] readByteStream(FSDataInputStream fileStream) {
    
    
 byte[] bytes = new byte[1024*30];
 int len;
 ByteArrayOutputStream stream = new ByteArrayOutputStream();
 try {
    
    
  while ((len = fileStream.read(bytes)) != -1) {
    
    
   stream.write(bytes, 0, len);
  }
 } catch (IOException e) {
    
    
  logger.error("read file bytes stream failed.", e);
  return null;
 }
 return stream.toByteArray();
}
 
// 解码字节流
private String decodeByteStream(byte[] bytes) {
    
    
 if (bytes == null) {
    
    
  return null;
 }
 
 String encoding = guessEncoding(bytes);
 String data = null;
 try {
    
    
  data = new String(bytes, encoding);
 } catch (Exception e) {
    
    
  logger.error("decode byte stream failed.", e);
 }
 return data;
}
 
// 根据Google的工具判别编码
private String guessEncoding(byte[] bytes) {
    
    
 UniversalDetector detector = new UniversalDetector(null);
 detector.handleData(bytes, 0, bytes.length);
 detector.dataEnd();
 String encoding = detector.getDetectedCharset();
 detector.reset();
 
 if (StringUtils.isEmpty(encoding)) {
    
    
  encoding = "UTF-8";
 }
 return encoding;
}

Guess you like

Origin blog.csdn.net/dcj19980805/article/details/115173143