1. Crawling the knowledge plate of the International Petroleum Network
The website to be crawled is: http://oil.in-en.com/zhishi/
The website structure is relatively standard, the news list page is single and contains href links, next page links and other information
1.1 Adding seeds
}else if (crawler.webMoudle == 38) { if (crawler.mk.equals("Sybk")) { /* start page */ crawler.addSeed("http://oil.in-en.com/zhishi/"); crawler.addRegex("http://oil.in-en.com/html/oil.*"); } }
1.2 visit override
- First extract the Elements of the list page, his class is "clist sborder"
- Filter the a[href] in it, if it contains "http://oil.in-en.com/html/oil", add cleaning and storage
- If indexOf("next page") is next.add(href);, that is, set this link as the seed of the next page
else if (this.webMoudle == 38) { Elements pageHaveClass = page.select("div[class]"); for (Iterator it = pageHaveClass.iterator(); it.hasNext();) { Element pageSelectedClass = (Element) it.next(); String classAttr = pageSelectedClass.attr("class"); if (classAttr.equals("clist sborder")) { Elements es = pageSelectedClass.select("a[href]"); for (Iterator itHref = es.iterator(); itHref.hasNext();) { Element e = (Element) itHref.next(); String href = e.attr("abs:href"); if (e.text().indexOf("next page") >= 0) { next.add(href); } if (href.indexOf("http://oil.in-en.com/html/oil") != -1) { String title = e.text(); datebaseByContentExtractor(href, title); } } } } }
The processes of cleaning, time condition filtering, keyword filtering and storage are omitted here
The step of filtering class=" clist sborder" seems to be troublesome, but the css selector of jsoup is not very familiar with the solution to the space in the middle of the attribute, and I have not found it. See if it can be solved below.