WebCollector crawler learning record (1) - Code World

WebCollector crawler learning record (1)

Others 2022-05-18 21:55:20 views: 0

1. Crawling the knowledge plate of the International Petroleum Network

The website to be crawled is: http://oil.in-en.com/zhishi/

The website structure is relatively standard, the news list page is single and contains href links, next page links and other information

1.1 Adding seeds

	}else if (crawler.webMoudle == 38) {
			if (crawler.mk.equals("Sybk")) {
				/* start page */
				crawler.addSeed("http://oil.in-en.com/zhishi/");
				crawler.addRegex("http://oil.in-en.com/html/oil.*");
			}	
		}

1.2 visit override

First extract the Elements of the list page, his class is "clist sborder"
Filter the a[href] in it, if it contains "http://oil.in-en.com/html/oil", add cleaning and storage
If indexOf("next page") is next.add(href);, that is, set this link as the seed of the next page

else if (this.webMoudle == 38) {
				Elements pageHaveClass = page.select("div[class]");
				for (Iterator it = pageHaveClass.iterator(); it.hasNext();) {
					Element pageSelectedClass = (Element) it.next();
					String classAttr = pageSelectedClass.attr("class");
					if (classAttr.equals("clist sborder")) {
						Elements es = pageSelectedClass.select("a[href]");
						for (Iterator itHref = es.iterator(); itHref.hasNext();) {
							Element e = (Element) itHref.next();
							String href = e.attr("abs:href");
							if (e.text().indexOf("next page") >= 0) {
								next.add(href);
							}
							if (href.indexOf("http://oil.in-en.com/html/oil") != -1) {

								String title = e.text();
								datebaseByContentExtractor(href, title);
							}
						}
					}
				}
			}

The processes of cleaning, time condition filtering, keyword filtering and storage are omitted here

The step of filtering class=" clist sborder" seems to be troublesome, but the css selector of jsoup is not very familiar with the solution to the space in the middle of the attribute, and I have not found it. See if it can be solved below.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326667338&siteId=291194637

WebCollector crawler learning record (1)

Crawler learning record

JAVA web crawler WebCollector in-depth analysis - crawler core

Python crawler learning (1)

Record the process of Xiaobai learning python crawler (2)

python learning record 1

spark learning record -1

postgresql learning record 1

Learning Record: No.1

netty learning record (1)

OpenGL learning record (1)

Diagnostic learning record (1)

AnomalyGPT learning record (1)

1_vue learning record

ECMAScript 6 learning record (1)

Python learning small record 1

Histcite learning and use record 1

Python entry learning record (1)

HTML div learning record 1

AB Package - Learning Record (1)

Go language learning record (1)

AirSim simulation learning record (1)

MMSegmentation document learning record (1)

Python crawler learning (1) Introduction to the Requests library

Python crawler learning (1)-simple cookies crawling

React learning record -1- component lifecycle

SSM-Mybatis learning record (1)

SSM-SpringMVC learning record (1)

Numpy and pandas learning small record 1

Java learning record (intermediate)-[1], exception handling

Recommended

The sixth meeting of openKylin Community Ecology Committee was successfully held

Alibaba Cloud officially releases Tongyi Qianwen 2.5

Python 3.13 releases first Beta: experimental free-threading mode and JIT, improved interactive interpreter

Stack Overflow used my code to train large AI models and banned my account.

Pop!_OS’s COSMIC desktop completes App Store listing

Report: Django is still the first choice for 74% of developers

"Internet Investment and Financing Operation in the First Quarter of 2024" Research Report

15 years ago, he was on the "FFmpeg pillar of shame", and today he still has to thank us - Tencent QQPlayer avenges its shame?

Ranking

Python handwritten digit recognition, the corresponding mathematical formulas and Detailed procedures

Der M-Chip-Mac implementiert mehrere Android-Emulatoren

CentOS 명령줄 모드에서 화면이 항상 켜져 있도록 설정 ---- 예상한 효과를 얻지 못했습니다.

Teach you step by step how to use GDB to debug programs: a comprehensive guide from entry to mastery

python3 pip ipython installation

A lot of python crawler source code sharing -- talk about the little thing about python crawler

Agile Sprint in Alpha Stage (7)

Access URL in Unity

FunctoinDemo form

MS SQL common SQL statements (6): create, modify, delete triggers and other operations sq

Daily

More

2024-05-10(34)

2024-05-09(32)

2024-05-08(18)

2024-05-07(34)

2024-05-06(6)

2024-05-05(0)

2024-05-04(18)

2024-05-03(8)

2024-05-02(0)

2024-05-01(4)