read txt file
To operate on files in Java, you need to use IO streams.
public static void main(String args[]){
File f = new File("test.txt");
try {
// InputStream 处理的是字节流,用read()的话每次读取的是一个byte
InputStream in = new FileInputStream(f);
// InputStreamReader 处理的是字符流,用read()的话每次读取一个字符
InputStreamReader reader = new InputStreamReader(in,"gbk");
// BufferReader 处理的是字符流,能够一行一行的读取文件
BufferedReader bufReader = new BufferedReader(reader);
int i = 1;
String line ="";
// readLine()每用一次读取一行
while((line = bufReader.readLine()) != null){
System.out.println("第"+ i + "行:"+line);
++i;
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
output:
Manipulate doc files
If the above method is also used to read doc files, Chinese characters will appear garbled:
Java Apache POI can be used to operate doc files. There are two main ways to read data from doc files using POI: reading through WordExtractor and through HWPFDocument read. First, introduce the jar package of poi-scratchpad in Maven (if it is a docx file, introduce the jar package of poi-ooxml)
Using WordExtractor
The function of WordExtractor is less than that of HWPFDocument. When using WordExtractor to read a file, we can only read the text content of the file and some properties based on the document. As for the properties of the document content, it cannot be read, and WordExtractor cannot Modify the doc file. Use WordExtractor to read file contents:
public static void main(String args[]){
File f = new File("test.doc");
try {
InputStream in = new FileInputStream(f);
WordExtractor ex = new WordExtractor(in);
System.out.println(ex.getText());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
Using HWPFDocument
Because I mainly modify the content of the doc file, I use HWPFDocument. Remove all Chinese characters in the document:
public static void main(String args[]){
File f = new File("test.doc");
try {
InputStream in = new FileInputStream(f);
HWPFDocument ex = new HWPFDocument(in);
Range range = ex.getRange();
/**
* 匹配中文字符 [\\u4e00-\\u9fa5]
* 中文标点符号,、; \\uff1b|\\uff0c|\\u3001
* 括号内容(包括括号) (\\(.*\\))
*/
Pattern pattern = Pattern.compile("[\\u4e00-\\u9fa5]|\\uff1b|\\uff0c|\\u3001|(\\(.*\\))",Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(range.text());
OutputStream os = new FileOutputStream(f);
// 找到中文字符并替换为“” 即删除
while (matcher.find( )) {
range.replaceText(matcher.group(),"");
}
// 将修改后的内容重新写入文档中
ex.write(os);
os.close();
in.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
before fixing:
After modification:
This function is to memorize words, and a document has 5500 words and 308 pages of content. It is too troublesome to delete Chinese one by one, so I thought of using code to implement it.
Reference: http://www.jb51.net/article/101910.htm