JAVA reads WORD, EXCEL, PDF, TXT, RTF, HTML file text content using poi method

When making a system for a customer recently, the user asked to be able to import word files. Now there are several versions of microsoft word 97, 2003, and 2007, and the formats for storing data in these three versions are quite different. Now 97 has basically been withdrawn from the market, and almost no one uses this version, so only the 2003 version and the 2007 version are considered in our system, because we only need to be able to read the text content in word, and the text style in it , pictures and other information can be ignored, and there is no need to directly manipulate the word file, so we choose to use the POI of apache to read.

 

    It is relatively simple to read the word file of the 2003 version (.doc), and only need two jar packages, poi-3.5-beta6-20090622.jar and poi-scratchpad-3.5-beta6-20090622.jar, while the 2007 version (.docx) is more troublesome. The trouble I am talking about is not when we write code, but because there are many jar packages to be imported. There are as many as 7 as follows:
 1. openxml4j-bin-beta.jar
 2. poi -3.5-beta6-20090622.jar
 3. poi-ooxml-3.5-beta6-20090622.jar
 4. dom4j-1.6.1.jar
 5. geronimo-stax-api_1.0_spec-1.0.jar
 6. ooxml-schemas-1.0 .jar
 7. xmlbeans-2.3.0.jar
where 4-7 is the jar package that poi-ooxml-3.5-beta6-20090622.jar depends on (ooxml in poi-bin-3.5-beta6-20090622.tar.gz -lib directory).

 

    Before writing the code, we have to download the required jar packages. We only need to download poi-bin-3.5-beta6-20090622.tar.gz and openxml4j-bin-beta.jar , because other jar packages required can be Found in poi-bin-3.5-beta6-20090622.tar.gz, the following is the download address:
poi-bin-3.5-beta6-20090622.tar.gz: http://apache.etoak.com/poi/dev/ bin/poi-bin-3.5-beta6-20090622.tar.gz
openxml4j-bin-beta.jar: http://mirror.optus.net/sourceforge/o/op/openxml4j/openxml4j-bin-beta.jar
 
    Below is When reading the Java code of the word file, it is worth noting that: POI will not read the picture information in the word file when reading the word file, and for the 2007 version of word (.docx), if the word file contains Table, all the data in the table will be at the end of the read string.

 

The following is the code for reading the contents of several text files in Java. Among them, the OFFICE document (WORD, EXCEL) uses the POI control, and the PDF uses the PDFBOX control. WORD Java code package textReader; import java.io.*; import org.apache.poi.hwpf.extractor.WordExtr
 

The following is the code for reading the contents of several text files in Java. Among them, the OFFICE document (WORD, EXCEL) uses the POI control, and the PDF uses the PDFBOX control.
WORD Java code
package textReader;
import java.io.*;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class WordReader {
public WordReader(){
}
/**
* @param filePath file path
* @return read the content of the word
*/
public String getTextFromWord(String filePath){
String result = null;
File file = new File(filePath);
try{
FileInputStream fis = new FileInputStream(file);
WordExtractor wordExtractor = new WordExtractor(fis);
result = wordExtractor.getText();
}catch(FileNotFoundException e){
e.printStackTrace();
}catch(IOException e){
e.printStackTrace();
};
return result;
}
}
          EXCEL   Java代码
package textReader;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFCell;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

public class ExcelReader {
@SuppressWarnings("deprecation")
/**
* @param filePath 文件路径
* @return 读出的Excel的内容
*/
public String getTextFromExcel(String filePath) {
StringBuffer buff = new StringBuffer();
try {
//Create a reference to the Excel workbook file
HSSFWorkbook wb = new HSSFWorkbook(new FileInputStream(filePath));
//Create a reference to the worksheet.
for (int numSheets = 0; numSheets < wb.getNumberOfSheets(); numSheets++) {
if (null != wb.getSheetAt(numSheets)) {
HSSFSheet aSheet = wb.getSheetAt(numSheets);//Get a sheet
for (int rowNumOfSheet = 0; rowNumOfSheet <= aSheet.getLastRowNum(); rowNumOfSheet++) {
if (null != aSheet.getRow(rowNumOfSheet)) {
HSSFRow aRow = aSheet.getRow(rowNumOfSheet); //Get a row
for (int cellNumOfRow = 0; cellNumOfRow <= aRow.getLastCellNum(); cellNumOfRow++) {
if (null != aRow.getCell(cellNumOfRow)) {
HSSFCell aCell = aRow.getCell(cellNumOfRow);//Get the column value
switch(aCell.getCellType()){
case HSSFCell.CELL_TYPE_FORMULA:
break;
case HSSFCell.CELL_TYPE_NUMERIC:
buff.append(aCell.getNumericCellValue()).append('\t');break;
case HSSFCell.CELL_TYPE_STRING:
buff.append(aCell.getStringCellValue()).append('\t');break;
}  }  }
buff.append('\n');
}  }  }  }
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return buff.toString();
}   }

           PDF
Java代码

package textReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class PdfReader {
public PdfReader(){
}
/**
* @param filePath 文件路径
* @return 读出的pdf的内容
*/
public String getTextFromPdf(String filePath) {
String result = null;
FileInputStream is = null;
PDDocument document = null;
try {
is = new FileInputStream(filePath);
PDFParser parser = new PDFParser(is);
parser.parse();
document = parser.getPDDocument();
PDFTextStripper stripper = new PDFTextStripper();
result = stripper.getText(document);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (is != null) {
try {is.close();}catch(IOException e){e.printStackTrace();}
}
if (document != null) {
try{document.close();}catch (IOException e){e.printStackTrace();}
}  }
return result;
}  }

  package textReader;
import java.io.*;
public class TxtReader {
public TxtReader() {
}
/**
* @param filePath 文件路径
* @return 读出的txt的内容
*/
public String getTextFromTxt(String filePath) throws Exception {

FileReader fr = new FileReader(filePath);
BufferedReader br = new BufferedReader(fr);
StringBuffer buff = new StringBuffer();
String temp = null;
while((temp = br.readLine()) != null){
buff.append(temp + "\r\n");
}
br.close();
return buff.toString();
}
}


package textReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.swing.text.BadLocationException;
import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;

public class RtfReader {
public RtfReader(){
}
/**
* @param filePath file path
* @return read rtf content
*/
public String getTextFromRtf(String filePath) {
String result = null;
File file = new File(filePath);
try {
DefaultStyledDocument styledDoc = new DefaultStyledDocument( );
InputStream is = new FileInputStream(file);
new RTFEditorKit().read(is, styledDoc, 0);
result = new String(styledDoc.getText(0,styledDoc.getLength()).getBytes("ISO8859_1")) ;
//Extract text, read Chinese need to use ISO8859_1 encoding, otherwise there will be garbled characters
} catch (IOException e) {
e.printStackTrace();
} catch (BadLocationException e) {
e.printStackTrace();
}
return result;
} }

package textReader;
import java.io.*;

public class HtmlReader {
public HtmlReader() {
}
/**
* @param filePath 文件路径
* @return 获得html的全部内容
*/
public String readHtml(String filePath) {
BufferedReader br=null;
StringBuffer sb = new StringBuffer();
try {
br=new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "GB2312"));
String temp=null;
while((temp=br.readLine())!=null){
sb.append(temp);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return sb.toString();
}
/**
* @param filePath 文件路径
* @return the obtained html text content
*/
public String getTextFromHtml(String filePath) {
//Get the content in the body tag
String str= readHtml(filePath);
StringBuffer buff = new StringBuffer();
int maxindex = str.length() - 1;
int begin = 0;
int end;
//Intercept the content between > and <
while((begin = str.indexOf('>',begin)) < maxindex){
end = str.indexOf('<' ,begin);
if(end - begin > 1){
buff.append(str.substring(++begin, end));
}
begin = end+1;
};
return buff.toString();
} }

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326481178&siteId=291194637