Yesterday, by learning about the simple architecture of reptiles: Today we had specific exercises will he - crawling Wikipedia entry
First environment eclipse + python3.8
Look at the specific framework:
url_manager: url manager; html_downloader: website downloader; html_parser: page parser; html_outputer: obtaining output
In general we are more concerned about is the parser: How to extract data out of the page - usually select the right to review the content of elements in the page module to be extracted see where he
For example: Select the Right
Inspect Element
Right-edit as html
Copy module code of the desired product
<dd class = "title-lemmaWgt-lemmaTitle">
<h1 of> block chain </ h1>
In the reptile parser we'll use:
title_node=soup.find('dd',class_="lemmaWgt-lemmaTitle-title").find("h1") res_data['title']=title_node.get_text()