HtmlUnit java reptile Getting real-world examples to explain the electricity supplier website crawling data

Recent use of free time to help a friend made a crawling data from several electricity providers websites applet using htmlUnit feel htmlUnit crawling speed and stability is still very good so write a blog post describes the use htmlUnit under relevant can be considered the record about

This is the main page of the site

The idea is to get specific div commodity where to get data href <a> label of each item to enter the URL of the commodity by crawling div then export EXCEL table features such as automatic translation

1. First we need to get the data of the main page

WebClient webClient = new WebClient(BrowserVersion.CHROME  );//模拟创建打开一个谷歌浏览器窗口
webClient.getOptions().setTimeout(15000);//设置网页响应时间
webClient.getOptions().setUseInsecureSSL(true);//是否
webClient.getOptions().setRedirectEnabled(true);//是否自动加载重定向
webClient.getOptions().setThrowExceptionOnScriptError(false);//是否抛出页面javascript错误
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);//是否抛出response的错误
webClient.getOptions().setJavaScriptEnabled(false);// HtmlUnit对JavaScript的支持不好,关闭之
webClient.getOptions().setCssEnabled(false);// HtmlUnit对CSS的支持不好,关闭之
String url = "https://shop.sanrio.co.jp/products/list.php?product_status=1";
HtmlPage page = webClient.getPage(url);//通过url获取整个页面
In this way we get HtmlPage subject of the page.

By looking at the page object's method may be able to find it on elements like JS wrote the same operation pages

2. View Page Source Gets A label merchandise in the corresponding div (Scaled images can right View Original)


Whereby goods are found in a nest = the above mentioned id "prdlist" of the div so we can

// 获取A标签的div
HtmlDivision element = (HtmlDivision) page.getHtmlElementById("prdlist"); 
This will get the goods to the corresponding div tag  and then we'll see each individual commodity div corresponding source code to get the web address


This is each individual commodity div corresponding source code we can see the div for each item is only one A label so long as we get to the A label on Ok

DomNodeList<HtmlElement> list = element.getElementsByTagName("a"); // 获取页面上的所有A连接(商品标签)
out2: for (HtmlElement htmlElement : list) {
   HtmlPage click = htmlElement.click(); // 进入商品页面
   HtmlDivision div3 = (HtmlDivision) click.getByXPath("//div[@class='box_right_summary']").get(0); // 获得名字的div
}
This URL just crawled content to take a div ID that there is no div or other tag ID should be how to get it htmlunit good package getByXPath a method designed to crawl specific wording does not get special label name above chart of DIV retrieves the class attribute of div = box_right_summary collection

Htmlunit comes in basic contains all html tags such as HtmlInput HtmlTable HtmlSpan etc., etc. Readers can download a jar for yourself

In addition the use of XPath syntax can refer to this article  Xpath syntax

In addition htmlunit may also submit an analog form to use htmlElement the click () method to get form input box and then set the value to simulate a click event

 // 获取首页
	    final HtmlPage page1 = (HtmlPage) webClient.getPage("http://htmlunit.sourceforge.net");

	    // 根据form的名字获取页面表单,也可以通过索引来获取:page.getForms().get(0)
	    final HtmlForm form = page1.getFormByName("myform");
	    final HtmlSubmitInput button 
	        = (HtmlSubmitInput) form.getInputByName("submitbutton");
	    final HtmlTextInput textField 
	        = (HtmlTextInput) form.getInputByName("userid");

	    // 设置表单域的值
	    textField.setValueAttribute("root");

	    // 提交表单,返回提交表单后跳转的页面
	    final HtmlPage page2 = (HtmlPage) button.click();
It's just a way to get in the way can also xpath ID tagNAME etc. methods require the reader to explore

3. Next, enter the product details page for product information

This is the Product Details page we need to get to its profile, and other name-Size Image Price

Then observe Source

This div may get to name Price

HtmlDivision div3 = (HtmlDivision) click.getByXPath(
						"//div[@class='box_right_summary']").get(0); // 获得名字的div

String name = div3.getElementsByTagName("h2").get(0).asText();
excels.setJname(name); // 设置日本名字 (这是自己创建的导出EXCEL的实体类)
// 获取商品的价格信息
HtmlSpan span = (HtmlSpan) click.getByXPath("//span[@class='priceSelect']").get(0);
String cname = getCname(name); // 通过百度翻译接口获取中文名字 
excels.setCname(cname);
Next is the number of dimensions and other commodities


Can be found in FIG.

// 获取商品的详细信息
				HtmlDivision div2 = (HtmlDivision) click.getByXPath(
						"//div[@class='productSummary accordionBlock01']").get(
						0);

				DomNodeList<HtmlElement> ths = div2.getElementsByTagName("tr");
				for (HtmlElement th : ths) {
					if (th.getElementsByTagName("th").get(0).asText().equals("サイズ")) {
						String sizeString = th.getElementsByTagName("td")
								.get(0).asText();
						
						// 设置商品尺寸
						excels.setSize(sizeString);

						
					}
					if (th.getElementsByTagName("th").get(0).asText().equals("商品コード")) {

						// 商品编号
						String nums2 = th.getElementsByTagName("td").get(0)
								.asText();
						// 设置商品编号
						excels.setNums2(nums2);
					}
				}
Code written in non-standard make do and see

Next is to get goods picture information

HtmlDivision div = (HtmlDivision) click.getByXPath(
						"//div[@class='box_pic']").get(0);
				// System.out.println(div.asXml());
				DomNodeList<HtmlElement> imgs = div.getElementsByTagName("img");
				// 遍历 下载图片到本地
				for (HtmlElement img : imgs) {
					download(SANLIOU + img.getAttribute("src"), DOWNDS
							+ filename + "/");//这是自己封装的下载图片的方法
				}
That SANLIOU + img.getAttribute ( "src") is actually represented by the image URL address because the general picture is not the full path url This time we need to manually copy the root of the site plus the src attribute of the img tag to get a real picture of address such as:

This is a picture of goods img src not complete

The address of the web page is: https: //shop.sanrio.co.jp/products/detail.php product_id = 54355?

We interception https://shop.sanrio.co.jp This is the root of the site plus /upload/save_image/N-1801-318795_1.jpg to form a picture of the true path

So far roughly crawling content to complete the detailed project case I upload to the download you can go and see my CSDN case download

Htmlunit is actually a very simple little frame basis as long as there is a good school front page is very easy to achieve their desired function entry



Published 15 original articles · won praise 21 · views 30000 +

Guess you like

Origin blog.csdn.net/q690080900/article/details/79072729