Day2-Python reptile small practice crawling Wikipedia entry

  Yesterday, by learning about the simple architecture of reptiles: Today we had specific exercises will he - crawling Wikipedia entry

First environment eclipse + python3.8

Look at the specific framework:

url_manager: url manager; html_downloader: website downloader; html_parser: page parser; html_outputer: obtaining output

In general we are more concerned about is the parser: How to extract data out of the page - usually select the right to review the content of elements in the page module to be extracted see where he  

 

For example: Select the Right 

Inspect Element

Right-edit as html

Copy module code of the desired product

<dd class = "title-lemmaWgt-lemmaTitle">
<h1 of> block chain </ h1>

In the reptile parser we'll use:

  title_node=soup.find('dd',class_="lemmaWgt-lemmaTitle-title").find("h1")
        res_data['title']=title_node.get_text()

Guess you like

Origin www.cnblogs.com/1983185414xpl/p/12177593.html