Get started quickly with Python web crawler! Zero-based introductory tutorial!

Environmental preparation:

Installed in advance, pycharm
opens File——>Settings——>Projext——>Project Interpriter


Click the plus sign (in the red circle in the picture)


Click the button in the red circle


Select the first one, click the pencil, and replace the original link with (it has been replaced here):
https://pypi.tuna.tsinghua.edu.cn/simple/ After
clicking OK, enter requests-html and press Enter to
select Click Install Package after requests-html


Wait for the installation to succeed, close

By parsing the source code of the webpage

Example content:
crawl the desired content from all articles of a certain blogger.
Example background:
Get the title, time, and reading volume of each article from all the articles of the blogger (https://me.csdn.net/weixin_44286745).

  1. Import the HTMLSession method in requests_html and create its object
from requests_html import HTMLSession
session = HTMLSession()

123
  1. Use get request to get the website to be crawled, and get the source code of the page.
html = session.get("https://me.csdn.net/weixin_44286745").html

12
  • Find all articles
  allBlog=html.xpath("//dl[@class='tab_page_list']") 
1
  • Enter the homepage of the website (in this example: https://me.csdn.net/weixin_44286745)
  • Right-click in the blank space of the article to locate the label of this article
  • Operate like other articles, and then find the common tags of all articles (the class of all articles here is'my_tab_page_con')
  • xpath can traverse the various tags and attributes of html to locate and extract the information we need.
  • Web page analysis to get the title, reading volume, date.
for i in allBlog:
    title = i.xpath("dl/dt/h3/a")[0].text
    views = i.xpath("//div[@class='tab_page_b_l fl']")[0].text
    date = i.xpath("//div[@class='tab_page_b_r fr']")[0].text
    print(title +' ' +views +' ' + date )
12345

Web analysis:

  • Because there are multiple articles, which are obtained separately using a for loop, the above code has obtained all articles, so i means an article
  • The second line of code to get the title of the article is similar to getting the article. Place the mouse on the title and right-click to check. Because the article has only one title, you can use the absolute path to get to the title position layer by label.
  • What xpath returns is a list, we want the first one so we need to add a subscript (there is only one element in the list), and what we want to output is text, so text gets the text.
  • Reading volume and time are also repeated operations
  • You can use a relative path or an absolute path. Generally, a relative path is used, and the format is modeled after the code.
  • The fifth line of code, output every time information about an article is obtained, and all the information can be obtained after traversing.
  •  

Complete code:

from requests_html import HTMLSession
session = HTMLSession()


html = session.get("https://me.csdn.net/weixin_44286745").html

allBlog=html.xpath("//dl[@class='tab_page_list']")

for i in allBlog:
    title = i.xpath("dl/dt/h3/a")[0].text
    views = i.xpath("//div[@class='tab_page_b_l fl']")[0].text
    date = i.xpath("//div[@class='tab_page_b_r fr']")[0].text
    print(title +' ' +views +' ' + date )

1234567891011121314
  •  

You can crawl other things yourself, such as article pictures, try it out! ! !
To be continued

Request via html

Click here to get the complete project code

 

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/109077749