Environmental preparation:
Installed in advance, pycharm
open File——>Settings——>Projext——>Project Interpriter,
click the plus sign (the red circle in the figure),
click the button in the red circle to
select the first one, click the pencil to replace the original link It is (replaced here):
https://pypi.tuna.tsinghua.edu.cn/simple/ After
clicking OK, enter requests-html and
press Enter. After selecting requests-html, click Install Package and
wait for the installation to succeed, close
By parsing the source code of the webpage
Example content:
Crawl the desired content from all articles of a certain blogger.
Example background:
Get the title, time, and reading volume of each article from all the articles of the blogger (https://me.csdn.net/weixin_44286745).
- Import the HTMLSession method in requests_html and create its object
from requests_html import HTMLSession
session = HTMLSession()
- Use get request to get the website to be crawled, and get the source code of the page.
html = session.get("https://me.csdn.net/weixin_44286745").html
- Find all articles
allBlog=html.xpath("//dl[@class='tab_page_list']")
-
Enter the homepage of the website (in this example: https://me.csdn.net/weixin_44286745)
-
Right-click in the blank space of the article to locate the label of this article
-
Operate like other articles, and then find the common tags of all articles (the class of all articles here is'my_tab_page_con')
-
xpath can traverse the various tags and attributes of html to locate and extract the information we need.
-
Web page analysis to get the title, reading volume, date.
for i in allBlog:
title = i.xpath("dl/dt/h3/a")[0].text
views = i.xpath("//div[@class='tab_page_b_l fl']")[0].text
date = i.xpath("//div[@class='tab_page_b_r fr']")[0].text
print(title +' ' +views +' ' + date )
Web analysis:
-
Because there are multiple articles, they are obtained separately using a for loop, and the above code has obtained all articles, so i means an article
-
The second line of code to get the title of the article is similar to getting the article. Place the mouse on the title and right-click to check. Because the article has only one title, you can use the absolute path to get to the title position layer by label.
-
What xpath returns is a list, we want the first one so we need to add a subscript (there is only one element in the list), and what we want to output is text, so text gets the text.
-
Reading volume and time are also repeated operations
-
You can use a relative path or an absolute path. Generally, a relative path is used, and the format is modeled after the code.
-
The fifth line of code, output every time information about an article is obtained, and all the information can be obtained after traversing.
Complete code:
from requests_html import HTMLSession
session = HTMLSession()
html = session.get("https://me.csdn.net/weixin_44286745").html
allBlog=html.xpath("//dl[@class='tab_page_list']")
for i in allBlog:
title = i.xpath("dl/dt/h3/a")[0].text
views = i.xpath("//div[@class='tab_page_b_l fl']")[0].text
date = i.xpath("//div[@class='tab_page_b_r fr']")[0].text
print(title +' ' +views +' ' + date )
You can crawl other things yourself, such as article pictures, try it out! ! !
To be continued