使用POI将Word转HTML遇到的问题

近期做的一个功能，将Word转为HTML，因为这一块是盲点，所以代码借鉴了网上的某个大佬。

详情移步到：https://blog.csdn.net/qq_18219457/article/details/97943035。这篇文章只为填坑。

一、docx文件转HTML

1、乱码问题：读写文件时必须将编码格式统一“utf-8”。否则你会发现，单独打开文件时正常，用浏览器预览时就乱码了。从代码中仔细查找，一定有某个地方没有设置编码格式。

2、docx文件相当于Word的压缩文件。使用压缩工具打开方式，可以看到文件中的xml文件。如图：

其中document.xml文件就是docx文件的主文件，可以看出文件中的详细内容，如：文本、图片、公式以及样式。

详情移步：https://blog.csdn.net/qq_18219457/article/details/98963136

3、转换代码：这里只获取文件中的文本内容（包括表格）、图片、公式。具体还有没有别的需求，还待测试，先做一个记录。

文本中如果有上下标，不会当作公式处理，必须手动转换。

VerticalAlign vertAlign = run.getSubscript();

扫描二维码关注公众号，回复： 12874943 查看本文章

int sort = vertAlign.getValue();

sort分别代表：1、BASELINE 指定父运行中的文本应位于基线处，并以与周围文本相同的大小显示。2、SUBSCRIPT 指定此文本应为上标。 3、SUPERSCRIPT 下标。

类似上下标的还有下划线等等。方法可参考文档：http://poi.apache.org/apidocs/dev/org/apache/poi/xwpf/usermodel/XWPFRun.html

/**
	 * docx文件转HTML
	 * @param storeFile
	 * @param relPath
	 * @return
	 */
	public synchronized static String docxToHtml(File storeFile,String relPath) {
		String htmlUrl = "";
		try {
			XWPFDocument document = new XWPFDocument(new FileInputStream(storeFile));
			List<IBodyElement> elements = document.getBodyElements();
			StringBuffer text = new StringBuffer();
			String s = UUID.randomUUID().toString();
			// 去掉-
			String aString = s.substring(0,8)+s.substring(9,13)+s.substring(14,18)+s.substring(19,23)+s.substring(24);
			String htmlName = aString + ".html";
			htmlUrl = relPath +"/"+ htmlName;
			// 判断HTML文件是否存在
			File htmlFile = new File(CommonConstants.fileRoot + relPath +"/"+htmlName);
			//创建图片文件夹
			String imgPath = CommonConstants.fileRoot+relPath +"/"+aString+ "/image";
			// 生成HTML文件上级文件夹
			File folder = new File(imgPath);
			if (!folder.exists()) {
				folder.mkdirs();
			}
			//构造ImageParse
			ImageParse imageParse = new ImageParse(imgPath, "");
			imageParse.setRelUrl(relPath+"/"+aString+ "/image/");
			if(elements != null){
				for (IBodyElement element : elements) {
					if (element instanceof XWPFParagraph) {// 段落
						text.append(getParagraphText((XWPFParagraph) element,document,imgPath,imageParse));
					}else if (element instanceof XWPFTable) {// 表格
						text.append(getTabelText((XWPFTable) element,document,imgPath,imageParse));
					}
				}
				map.clear();//自动编号处理完后需要清空
			}
			FileOutputStream outputStream = new FileOutputStream(htmlFile);
			OutputStreamWriter out = new OutputStreamWriter(outputStream,"UTF-8");
			out.write(text.toString());
			out.close();
		} catch (Exception e) {
			e.printStackTrace();
			map.clear();//若出现异常自动编号也需要清空
		}
		
		return htmlUrl;
	}

4、数学公式处理：

常见的数学公式由两种格式。

一种是在WPS中编辑，这种公式可以转换成图片。目前我了解到最好的方式就是转成svg格式图片。svg格式是矢量图片，放大缩小不会因屏幕分辨率差异造成模糊。屏幕展示效果好。

另一种是由微软的Word编辑，这种公式生成了一种omml标签，可以转换成mathML公式。目前部分浏览器已经支持mathML标签，如火狐等。但一些主流的浏览器还无法直接显示，可以前台引入一些插件也是可以显示的。也可以转成图片，但是我只能转成png格式，不知道svg图片怎么生成，希望有知道的大佬指点一下。

公式一、

/**
 * 获取图片格式公式 
 * @param run
 * @param runNode
 * @return
 * @throws Exception
 */
private static String getMath(XWPFRun run, Node runNode,String imgPath) throws Exception {
	StringBuffer math = new StringBuffer("<img");
	Node objectNode = getChildNode(runNode, "w:object");
	if (objectNode == null) {
		return "";
	}
	Node shapeNode = getChildNode(objectNode, "v:shape");
	if (shapeNode == null) {
		return "";
	}
	Node imageNode = getChildNode(shapeNode, "v:imagedata");
	if (imageNode == null) {
		return "";
	}
	Node binNode = getChildNode(objectNode, "o:OLEObject");
	if (binNode == null) {
		return "";
	}

	XWPFDocument word = run.getDocument();

	NamedNodeMap shapeAttrs = shapeNode.getAttributes();
	// 图片在Word中显示的宽高
	String style = shapeAttrs.getNamedItem("style").getNodeValue();
	

	NamedNodeMap imageAttrs = imageNode.getAttributes();
	// 图片在Word中的ID
	String imageRid = imageAttrs.getNamedItem("r:id").getNodeValue();
	// 获取图片信息
	PackagePart imgPart = word.getPartById(imageRid);
	String imgUrl = imgPart.getPartName().getName();
	String imgName = imgUrl.substring(imgUrl.lastIndexOf("/"),imgUrl.lastIndexOf("."));//图片名称
	//保存公式图片
	InputStream in = imgPart.getInputStream();
	FileOutputStream out = new FileOutputStream(new File(imgPath +"/"+ imgName+".wmf"));;
    byte[] buffer=new byte[2097152];
    int readByte = 0;
    while((readByte = in.read(buffer)) != -1){
        out.write(buffer, 0, readByte);
    }
    in.close();
    out.close();
    //将.wmf格式图片转成svg
    convertSVG(imgPath+"/"+imgName+".wmf");
    imgName += ".svg";
	NamedNodeMap binAttrs = binNode.getAttributes();
	// 公式二进制文件在Word中的ID
	String binRid = binAttrs.getNamedItem("r:id").getNodeValue();
	// 获取二进制文件
	PackagePart binPart = word.getPartById(binRid);
	//保存公式源文件， 以供后续使用
	File file=new File(imgPath.substring(0, imgPath.lastIndexOf("/image"))+"/math_source");
	if(!file.exists()){//如果文件夹不存在
		file.mkdir();//创建文件夹
	}
	InputStream inBin = binPart.getInputStream();
	FileOutputStream outBin = new FileOutputStream(new File(file.getPath()+binPart.getPartName().getName().substring(binPart.getPartName().getName().lastIndexOf("/"), binPart.getPartName().getName().length())));
    byte[] bufferBin=new byte[2097152];
    int readByteBin = 0;
    while((readByte = inBin.read(bufferBin)) != -1){
    	outBin.write(buffer, 0, readByteBin);
    }
    inBin.close();
    outBin.close();
    String relPath = imgPath.replaceFirst(CommonConstants.fileRoot, "");
    math.append(" src=\""+relPath+imgName+"\"");
    math.append(" style=\""+style+"\"/> ");
	return math.toString();
}
/**
 * 获取一个子标签对象
 * @param node
 * @param nodeName
 * @return
 */
private static Node getChildNode(Node node, String nodeName) {
	if (!node.hasChildNodes()) {
		return null;
	}
	NodeList childNodes = node.getChildNodes();
	for (int i = 0; i < childNodes.getLength(); i++) {
		Node childNode = childNodes.item(i);
		if (nodeName.equals(childNode.getNodeName())) {
			return childNode;
		}
		childNode = getChildNode(childNode, nodeName);
		if (childNode != null) {
			return childNode;
		}
	}
	return null;
}
/**
 * 将wmf文件转换成svg图片
 * @param path wmf文件路径
 * @return 转换后的文件路径
 */
public static String convertSVG(String path) {
	try {
	    String svgFile = StringUtils.replace(path, "wmf", "svg");
	    wmfToSvg(path, svgFile);
	    return svgFile;
	} catch (Exception e) {
	    e.printStackTrace();
	}
	return null;
}

/**
 * 将wmf转换为svg
 * 
 * @param src
 * @param dest
 */
public static void wmfToSvg(String src, String dest) {
    boolean compatible = false;
    try {
        InputStream in = new FileInputStream(src);
        WmfParser parser = new WmfParser();
        final SvgGdi gdi = new SvgGdi(compatible);
        parser.parse(in, gdi);

        Document doc = gdi.getDocument();
        OutputStream out = new FileOutputStream(dest);
        if (dest.endsWith(".svgz")) {
            out = new GZIPOutputStream(out);
        }

        output(doc, out);
    } catch (Exception e) {
        e.printStackTrace();
    }
}
private static void output(Document doc, OutputStream out) throws Exception {
    TransformerFactory factory = TransformerFactory.newInstance();
    Transformer transformer = factory.newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC,
            "-//W3C//DTD SVG 1.0//EN");
    transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM,
            "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd");
    transformer.transform(new DOMSource(doc), new StreamResult(out));
    out.flush();
    out.close();
}

公式二、

//Office的OMML公式转MathML插件资源文件
	private static File stylesheet = new File(CommonConstants.fileRoot+"/OMML2MML.XSL");
	private static StreamSource streamSource = new StreamSource(stylesheet);
	
    /**
     * 直接转node有等号会出问题，先转成xml的string，再转成mathML的字符串
     *
     * @param node
     * @return
     * @throws Exception 
     */
    private static String getMathMLFromNode(Node node) throws Exception {
        String s = W3cNodeUtil.node2XmlStr(node);
        // encoding utf-16
        String mathML = W3cNodeUtil.xml2Xml(s, streamSource);

        mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
        mathML = mathML.replaceAll("xmlns:mml", "xmlns");
        mathML = mathML.replaceAll("mml:", "");
        return mathML;
    }


    /**
     * MathML转PNG
     * @param node
     * @param imageParser
     * @return
     * @throws Exception 
     */
    public static String convertOmathToPng(XmlObject xmlObject, ImageParse imageParser) {
        Document document = null;
        try {
            String mathMLStr = getMathMLFromNode(xmlObject.getDomNode());
            document = W3cNodeUtil.xmlStr2Node(mathMLStr, "utf-16");
            return documentToImageHTML(document, imageParser);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

    /**
     * Document转PNG
     * @param node
     * @param imageParser
     * @return
     */
    private static String documentToImageHTML(Document node, ImageParse imageParser) {
        try {
            Converter mathMLConvert = Converter.getInstance();
            LayoutContextImpl localLayoutContextImpl = new LayoutContextImpl(LayoutContextImpl.getDefaultLayoutContext());
            localLayoutContextImpl.setParameter(Parameter.MATHSIZE, 64);
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            Dimension den = mathMLConvert.convert(node, os, "image/png", localLayoutContextImpl);
            String pngName = imageParser.parse(os.toByteArray(), ".png");
            os.close();
            double height = den.getHeight()*0.17;
            return "<img src=\""+ imageParser.getRelUrl() + pngName + "\"  style=\"vertical-align: text-bottom; height:"+height+"pt;\"/>";
        } catch (IOException e) {
            e.printStackTrace();
            logger.error("OmmlUtils.documentToImageHTML", e);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

5、图片处理，略。详情：https://blog.csdn.net/qq_18219457/article/details/98184621

二、doc格式转HTML

https://blog.csdn.net/qq_18219457/article/details/97943035

doc文件格式转换HTML后会生成标签样式等。如果想要纯净数据，就不能用doc格式编辑了。虽然已有完善的解析工具，但是现在office默认新建Word都是docx格式了。以后doc格式的文件会越来越少，会慢慢弃用。所以不再放重点！

新人笔记。如果有什么不对，望大佬指正。

使用POI将Word转HTML遇到的问题

猜你喜欢