[Java] Manipulating doc files

read txt file

To operate on files in Java, you need to use IO streams.

   public static void main(String args[]){
        File f = new File("test.txt");
        try {
// InputStream 处理的是字节流，用read()的话每次读取的是一个byte
            InputStream in  = new FileInputStream(f);
// InputStreamReader 处理的是字符流，用read()的话每次读取一个字符
            InputStreamReader reader = new InputStreamReader(in,"gbk");
// BufferReader 处理的是字符流，能够一行一行的读取文件
            BufferedReader bufReader = new BufferedReader(reader);
            int i = 1;
            String line ="";
// readLine()每用一次读取一行
            while((line = bufReader.readLine()) != null){
                System.out.println("第"+ i + "行："+line);
                ++i;
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

output:
write picture description here

Manipulate doc files

If the above method is also used to read doc files, Chinese characters will appear garbled:
write picture description here

Java Apache POI can be used to operate doc files. There are two main ways to read data from doc files using POI: reading through WordExtractor and through HWPFDocument read. First, introduce the jar package of poi-scratchpad in Maven (if it is a docx file, introduce the jar package of poi-ooxml)

Using WordExtractor

The function of WordExtractor is less than that of HWPFDocument. When using WordExtractor to read a file, we can only read the text content of the file and some properties based on the document. As for the properties of the document content, it cannot be read, and WordExtractor cannot Modify the doc file. Use WordExtractor to read file contents:

public static void main(String args[]){
        File f = new File("test.doc");
        try {
            InputStream in = new FileInputStream(f);
            WordExtractor ex = new WordExtractor(in);
            System.out.println(ex.getText());
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

write picture description here

Using HWPFDocument

Because I mainly modify the content of the doc file, I use HWPFDocument. Remove all Chinese characters in the document:

public static void main(String args[]){
        File f = new File("test.doc");
        try {
            InputStream in = new FileInputStream(f);
            HWPFDocument ex = new HWPFDocument(in);
            Range range = ex.getRange();
             /**
             * 匹配中文字符 [\\u4e00-\\u9fa5]
             * 中文标点符号，、； \\uff1b|\\uff0c|\\u3001
             * 括号内容（包括括号） (\\(.*\\))
             */ 
            Pattern pattern = Pattern.compile("[\\u4e00-\\u9fa5]|\\uff1b|\\uff0c|\\u3001|(\\(.*\\))",Pattern.CASE_INSENSITIVE);
            Matcher matcher = pattern.matcher(range.text());
            OutputStream os = new FileOutputStream(f);
         // 找到中文字符并替换为“” 即删除
            while (matcher.find( )) {
                range.replaceText(matcher.group(),"");
            }
         // 将修改后的内容重新写入文档中
            ex.write(os);
            os.close();
            in.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

before fixing:
write picture description here

After modification:
write picture description here

This function is to memorize words, and a document has 5500 words and 308 pages of content. It is too troublesome to delete Chinese one by one, so I thought of using code to implement it.

Reference: http://www.jb51.net/article/101910.htm