WebCollector crawler learning record (1)

1. Crawling the knowledge plate of the International Petroleum Network

The website to be crawled is: http://oil.in-en.com/zhishi/

The website structure is relatively standard, the news list page is single and contains href links, next page links and other information

1.1 Adding seeds

	}else if (crawler.webMoudle == 38) {
			if (crawler.mk.equals("Sybk")) {
				/* start page */
				crawler.addSeed("http://oil.in-en.com/zhishi/");
				crawler.addRegex("http://oil.in-en.com/html/oil.*");
			}	
		}

 1.2 visit override

  1. First extract the Elements of the list page, his class is "clist sborder"
  2. Filter the a[href] in it, if it contains "http://oil.in-en.com/html/oil", add cleaning and storage
  3. If indexOf("next page") is next.add(href);, that is, set this link as the seed of the next page
else if (this.webMoudle == 38) {
				Elements pageHaveClass = page.select("div[class]");
				for (Iterator it = pageHaveClass.iterator(); it.hasNext();) {
					Element pageSelectedClass = (Element) it.next();
					String classAttr = pageSelectedClass.attr("class");
					if (classAttr.equals("clist sborder")) {
						Elements es = pageSelectedClass.select("a[href]");
						for (Iterator itHref = es.iterator(); itHref.hasNext();) {
							Element e = (Element) itHref.next();
							String href = e.attr("abs:href");
							if (e.text().indexOf("next page") >= 0) {
								next.add(href);
							}
							if (href.indexOf("http://oil.in-en.com/html/oil") != -1) {

								String title = e.text();
								datebaseByContentExtractor(href, title);
							}
						}
					}
				}
			}

 

The processes of cleaning, time condition filtering, keyword filtering and storage are omitted here

The step of filtering class=" clist sborder" seems to be troublesome, but the css selector of jsoup is not very familiar with the solution to the space in the middle of the attribute, and I have not found it. See if it can be solved below.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326667338&siteId=291194637
Recommended