接口爬虫之网页表单数据提取

本人最近接到一项任务，要爬一项数据，这个数据在某个网页的表格里面，数据量几百。打开调试模式发现接口返回的就是一个html页面，只要当做string处理。（解析html文件用xpath爬虫有些麻烦）方案采用了正则匹配所有的单元行，然后提取单元格内容，这里面遇到了一些其他问题：

1、本来采用直接提取内容，发现内容涉及各国语言文字，有点坑，不搞了。

2、截取完单元行之后，发现两个字段内容之间有空格，且数量不确定，使用了spit方法限制数组大小

3、编码格式不正确导致乱码

分享代码供大家参考：

public static void main(String[] args) {

		String url = "https://docs.oracle.com/cd/E13214_01/wli/docs92/xref/xqisocodes.html";
		HttpGet httpGet = getHttpGet(url);
		JSONObject httpResponse = getHttpResponse(httpGet);
		String content = httpResponse.getString("content");
		List<String> strings = regexAll(content, "<tr.+</a>" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + "</div>");
		int size = strings.size();
		for (int i = 0; i < size; i++) {
			String s = strings.get(i).replaceAll("<.+>", EMPTY).replaceAll(LINE, EMPTY);
			String[] split = s.split(" ", 2);
			String sql = "INSERT country_code (country,code) VALUES (\"%s\",\"%s\");";
			output(String.format(sql, split[0].replace(SPACE_1, EMPTY), split[1].replace(SPACE_1, EMPTY)));
		}
		testOver();
	}

其中的一些封装方法如下：

/**
	 * 返回所有匹配项
	 *
	 * @param text  需要匹配的文本
	 * @param regex 正则表达式
	 * @return
	 */
	public static List<String> regexAll(String text, String regex) {
		List<String> result = new ArrayList<>();
		Pattern pattern = Pattern.compile(regex);
		Matcher matcher = pattern.matcher(text);
		while (matcher.find()) {
			result.add(matcher.group());
		}
		return result;
	}

最终拼接的sql部分结果为：

INSERT country_code (country,code) VALUES ("German","de");
INSERT country_code (country,code) VALUES ("Greek","el");
INSERT country_code (country,code) VALUES ("Greenlandic","kl");
INSERT country_code (country,code) VALUES ("Guarani","gn");
INSERT country_code (country,code) VALUES ("Gujarati","gu");
INSERT country_code (country,code) VALUES ("Hausa","ha");
INSERT country_code (country,code) VALUES ("Hebrew","he");
INSERT country_code (country,code) VALUES ("Hindi","hi");
INSERT country_code (country,code) VALUES ("Hungarian","hu");
INSERT country_code (country,code) VALUES ("Icelandic","is");
INSERT country_code (country,code) VALUES ("Indonesian","id");
INSERT country_code (country,code) VALUES ("Interlingua","ia");
INSERT country_code (country,code) VALUES ("Interlingue","ie");
INSERT country_code (country,code) VALUES ("Inuktitut","iu");
INSERT country_code (country,code) VALUES ("Inupiak","ik");
INSERT country_code (country,code) VALUES ("Irish","ga");
INSERT country_code (country,code) VALUES ("Italian","it");
INSERT country_code (country,code) VALUES ("Japanese","ja");

欢迎有兴趣的朋友一起交流：群号:340964272

接口爬虫之网页表单数据提取

猜你喜欢