[Original] java gets the text in word

demand scenario

  If the developed web office system needs to process a large number of Word documents (for example, there are thousands of documents), the user must put forward the need to find documents containing certain keywords, which requires the ability to read the text content in word, and Ignore the text styles, tables, pictures and other information in it.

case analysis

  Option 1: Use Apache POI technology to obtain the text of all documents on the server and store them in the database. When searching for documents, use SQL statements to retrieve whether the document text stored in the data contains keywords to search for relevant documents. However, there are two document formats doc and docx in microsoft word now, and there are considerable differences in the format of the data stored in these two versions. The survey found that apache POI provides two different API interfaces for doc and docx, and different codes need to be written for the two document formats. The format of the word document itself is complex, and the code to read the content of the word document will cause certain pressure on the server, and It is not possible to allow users to process word documents online. 

  POI homepage address: https://poi.apache.org/

  Option 2: Use the getDocumentText method of the FileSaver object of the PageOffice component to obtain the plain text content in the word document, and if you call PageOffice to implement this function, you can also realize the online editing of the word file.

  

Implementation steps

  1. Call PageOffice to open the word file online, for example: test.doc

PageOfficeCtrl poCtrl= new PageOfficeCtrl(request);
 // Set the server page 
poCtrl.setServerPage(request.getContextPath()+"/poserver.zz" );
 // Set the save page to SaveFile.jsp, or SaveFile.do SaveFile.action, etc. Either action method or RequestMapping method can be 
used poCtrl.setSaveFilePage("SaveFile.jsp" );
 // Open Word document 
poCtrl.webOpen("doc/test.doc",OpenModeType.docNormalEdit,"Tom");

  2. In the page (SaveFile.jsp) or method that handles the save operation, perform the operations of saving the file and obtaining the plain text content in the document:

FileSaver fs=new FileSaver(request,response);
fs.saveToFile(request.getSession().getServletContext().getRealPath( "doc/")+"/"+fs.getFileName()); // Save the file 
String strDocumentText = fs.getDocumentText(); // Get the document The plain text content of the document, without any additional formatting 
 // - write code that saves the text content of the document to the database - //
 ...
fs.close();

  3. When full-text search is required, you only need to perform SQL query on the database field that saves the plain text content of the word file in the database.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326350450&siteId=291194637