Convert HTML files to PDF files

Without further ado, let’s start with the code:

package com.xxxxx.util.file;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStream;
import java.io.Reader;
import java.io.StringReader;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.w3c.tidy.Tidy;
import org.xhtmlrenderer.pdf.ITextFontResolver;
import org.xhtmlrenderer.pdf.ITextRenderer;

import com.itextpdf.text.pdf.BaseFont;
import com.lowagie.text.DocumentException;
import com.xxxxx.entity.SystemParam;
import com.xxxxx.service.SystemParamService;
import com.xxxxx.util.ApplicationUtil;
import com.xxxxx.util.Const;
import com.xxxxx.util.StringUtil;

/**
 * 用于将html转换成PDF
 * 
 * @author sxgkwei
 *
 */
public class HtmlToPdf {
	private static final Log log = LogFactory.getLog(HtmlToPdf.class);

	public static void main(String[] args) throws IOException, Exception {

	}

	/**
	 * 给源文件在系统默认目录下生成出pdf,并逐页切图
	 * 
	 * @param sourceFile
	 *            源文件
	 * @param pageImg
	 *            是否切图
	 * @return -1=失败;0=创建成功;
	 */
	public static int htmlToPdf(String sourceFile) {
		String outPath = com.xxxxx.util.FileUtils.getPdfByPath(sourceFile);
		try {
			String txt = getStringFromHtml(sourceFile);
			if (StringUtils.isBlank(txt)) {
				log.error("html转换PDF失败,html中未读取当任何内容。path=" + sourceFile);
				return -1;
			}
			toPdf(formatHtml(txt), outPath);
		} catch (Exception e) {
			log.error("html转换PDF失败:path=" + sourceFile, e);
			return MsToPdf.officeToPdf(sourceFile, outPath);// 还是转换失败,使用MS转换兜底,尽量保证能转换出PDF文件
		}

		return 0;
	}

	private static void toPdf(String html, String savePath) throws Exception {
		ITextRenderer renderer = new ITextRenderer();
		// 解决中文支持问题
		addFontDirectory(renderer.getFontResolver(), ApplicationUtil.getRoot() + "/css/fonts", BaseFont.NOT_EMBEDDED);
		renderer.setDocumentFromString(html);
		renderer.layout();
		try (OutputStream os = FileUtils.openOutputStream(new File(savePath))) {
			renderer.createPDF(os);
			os.flush();
		}
	}

	private static void addFontDirectory(ITextFontResolver resolver, String dir, boolean embedded) throws DocumentException, IOException {
		File f = new File(dir);
		if (f.isDirectory()) {
			File[] files = f.listFiles((d, name) -> {
				String lower = name.toLowerCase();
				return lower.endsWith(".otf") || lower.endsWith(".ttf") || lower.endsWith(".ttc");
			});
			if (files != null) {
				for (int i = 0; i < files.length; i++) {
					resolver.addFont(files[i].getAbsolutePath(), BaseFont.IDENTITY_H, embedded);
				}
			}
		}

	}

	private static String formatHtml(String html) throws IOException {

		SystemParamService service = ApplicationUtil.getBean(SystemParamService.class);
		String command = service.queryValueByKey(SystemParam.KEY_3664);
		html = "<style>" + command + " body{font-family: SimSun;}</style>" + html;

		Tidy tidy = new Tidy();
		tidy.setQuiet(true);// 不在控制台输出html内部描述的balabala的一大堆话
		tidy.setShowErrors(-1);// 各种行/未识别标签报错信息都不要输出

		tidy.setWraplen(Integer.MAX_VALUE);// 必须设置行宽,否则格式化出的HTML代码标签在无错误的情况下,可能因为换行被折叠出错误
		tidy.setMakeClean(true);// ms office 输出的html清理
		tidy.setXHTML(true); // 设定输出为xhtml(还可以输出为xml)
		tidy.setTidyMark(false); // 不设置它会在输出的文件中给加条meta信息
		tidy.setXmlPi(true); // 让它加上<?xml version="1.0"?>
		tidy.setInputEncoding(Const.WEB_CHARSET);// 输入的字符集
		tidy.setOutputEncoding(Const.WEB_CHARSET);// 输出的字符集
		tidy.setForceOutput(true);// 无论是否还有错误,强制输出html源码:否则在有错误时,Tidy会不输出字符串

		try (ByteArrayOutputStream os2 = new ByteArrayOutputStream(); Reader reader = new StringReader(html)) {
			tidy.parse(reader, os2);
			return os2.toString(Const.WEB_CHARSET);
		}
	}

	private static String getStringFromHtml(String path) throws IOException {
		String txt = "";
		File file = new File(path);
		if (file.exists() && file.length() > 0) {
			txt = FileUtils.readFileToString(file, Const.WEB_CHARSET);
			String charset = StringUtil.getHtmlCharset(txt);
			if (!Const.WEB_CHARSET.equalsIgnoreCase(charset)) {
				txt = FileUtils.readFileToString(file, charset);
			}
		}
		return txt;
	}

}

Let's do some code explanation:

 

1. In the formatHtml method, the statements inserted at the beginning of all html codes are actually:

<style>@page{size:210mm 297mm;} body{font-family: SimSun;}</style>

This sentence is very important, among which:

 @page is a directive, intended to specify the page size generated by the PDF file, and 210*297 is exactly the size of A4 paper. If you want a landscape PDF, just convert this number. Expand the number, and the PDF page can become larger when possible. When html is output vertically, the page may be very long, you can consider expanding the vertical value to 500mm

The font designation of body is to understand the font scheme of the page. If no font scheme is set on the page, you can directly use the default Chinese font. In this case, just put the Chinese font file SIMSUN.TTC in the project /css/fonts/ directory.

 

2. The html file is read according to the correct encoding to prevent garbled characters, mainly in the getStringFromHtml method. Among them, the default reading code of the system is:

Const.WEB_CHARSET="UTF-8"

This involves a method of reading it first and then judging the current html encoding. The specific method code is as follows:

/**
 * 从html代码中找到本html文本的 charset 值,如果未找到,则返回 UTF-8
 * 
 * @param html
 * @return
 */
public static String getHtmlCharset(String html) {
	String charset = Const.WEB_CHARSET;
	String reg = "charset[\\s]*=[\\s]*['\"]?[\\s]*([a-zA-z0-9\\-]+)[\\s]*['\"]?";
	Pattern p = Pattern.compile(reg);
	Matcher m = p.matcher(html);
	if (m.find()) {
		charset = m.group(1);
	}
	return charset;
}

In fact, it is read from the description of the charset attributes of general html pages, and returns UTF-8 by default. As in the above regular expression, the code of the page depends on the conventional description of the following two types:

<meta charset='utf-8'>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Of course, there are many variants of these two main ways of writing. For example, there are spaces in some positions in the main body, and the difference between single quotes and double quotes, etc., all need to be taken into account by regular expressions.

As such, the HTML file is read once with the default UTF-8 and then parsed for its own encoding. If the self-encoding is inconsistent with the encoding used for pre-reading, read it again using the self-encoding.

 

3. Dynamic support for Chinese font folders. This is designed so that when customers find some Chinese characters converted to be blank in the process of using the system, they can re-convert them by adding font files to the folder at any time without starting the application. Server, mainly the addFontDirectory method.

This method is actually a copy of the source code of the addFontDirectory method of the ITextFontResolver class, only slightly modified to add support for ttc suffix files. Changed the font encoding of the original method from CP1252 to IDENTITY_H .

 

4. The use of JTidy. JTidy is used to format HTML code, which is the core of the conversion and will be highlighted.

    a, JTidy is currently flawed: because its last update was in 2009, it lacks support for HTML5 tags, and can only be deleted directly when encountered.

    b. I have tested org.jsoup.Jsoup. Its processing of html text is more inclined to formatting, and its ability to clean errors in html itself is not strong. Therefore, when it is subsequently converted to itext, various format errors will be reported.

    c. I have tested htmlcleaner. Compared with JTidy, it does not wrap the js and css code blocks in CDATA tags, resulting in the processed HTML code. When it is handed over to the subsequent conversion processing, it often reports an error because of the characters in the js code.

    To sum up: In the solution I tried, I can only use JTidy to deal with this html cleaning problem, so that it will not report an error when it is passed into the subsequent conversion itext.

    So, here's the point, JTidy's various setting items are simply dizzying, and the setting items that Du Niang can find are always unsatisfactory in some cases. After I refer to the source code item by item, the current settings are as follows:

Tidy tidy = new Tidy();
tidy.setQuiet(true);// 不在控制台输出html内部描述的balabala的一大堆话
tidy.setShowErrors(-1);// 各种行/未识别标签报错信息都不要输出
tidy.setWraplen(Integer.MAX_VALUE);// 必须设置行宽,否则格式化出的HTML代码标签在无错误的情况下,可能因为换行被折叠出错误
tidy.setMakeClean(true);// ms office 输出的html清理
tidy.setXHTML(true); // 设定输出为xhtml(还可以输出为xml)
tidy.setTidyMark(false); // 不设置它会在输出的文件中给加条meta信息
tidy.setXmlPi(true); // 让它加上<?xml version="1.0"?>
tidy.setInputEncoding(Const.WEB_CHARSET);// 输入的字符集
tidy.setOutputEncoding(Const.WEB_CHARSET);// 输出的字符集
tidy.setForceOutput(true);// 无论是否还有错误,强制输出html源码:否则在有错误时,Tidy会不输出字符串

No doubt, every line above is a classic. Among the configuration items that Du Niang can find, the added ones are about console output, about line width, about forced output, and about MS office processing. Although there are only a few short lines of code, I spent In 2 days, I looked at the source code one by one and sorted it out; do it and cherish it.

 

5. For the final bottom line of the htmlToPdf method, MsToPdf.officeToPdf is called. Why is it written like this?

    In fact, there was a conversion method in our system before, which was achieved by calling MS office software through jacob. However, the conversion effect was very bad, and various table lines/styles were lost. This prompted us to implement a new and better strategy for HTML to PDF conversion.

    It is undeniable that MS office conversion has an irreplaceable benefit: it is always error-free and can convert you a PDF file - even if the PDF file is ugly or slightly messy in memory layout. This is the bottom line strategy above; we hope to get better, but if there is no better, give me something to watch, at least better than nothing.

 

Write at the end:

    In programming, technology is everywhere. The technology is not the latest version, it is not the distributed micro-service cloud computing big data, it is every thoughtful idea, the most appropriate processing decision in every scenario, and the thinking of the transaction boundary handled by every external interface is tireless attention to detail. Pursue. Don't be obsessed with new technology and new versions, real technology is your mastery of how things work.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324897182&siteId=291194637