Dynamic web scraping of Python development crawler: Crawling blog comment data - simulating browser crawling through Selenium

Take crawling the personal blog comments of the author of the book "Python Web Crawler: From Getting Started to Practice" as an example. URL: http://www.santostang.com/2017/03/02/hello-world/

1) Find the HTML code tag of the comment. Open the article page in Chrome, right-click the page, and open the "Inspect" option. According to the method of Chapter 2, locate the comment data. As shown in the following figure: You can see that the label of the data is "\div\class='reply-content"> The 21st test comment/div>"

2) Try to get a comment data. On the code data of the original open page, we can use the following code to get the first comment data. In the following code, driver.find_element_by_css_selector uses CSS selectors to find elements and finds div elements with class 'reply-content'; find_element_by_tag_name is to find elements by tag, which means to find the p element in the comment. Finally, output the text text in the p element.

Relevant code 1:

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

caps=webdriver.DesiredCapabilities().FIREFOX
caps["marionette"]=True
binary =FirefoxBinary(r ' E:\software installation directory\installation prerequisite software\Mozilla Firefox\firefox.exe ' )
driver=webdriver.Firefox(firefox_binary=binary,capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
#page=driver.find_element_by_xpath(".//html")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
comment=driver.find_element_by_css_selector('div.reply-content-wrapper')
content=comment.find_element_by_tag_name('p')
print(content.text)
#driver.page_source

output:

I can't find https://api.gentie.163.com/products/ in JS. Which god can help me. thanks.

 

In the previous section, we just got one comment. If we want to get all the comments, use a loop to get all the comments.

Relevant code 2:

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

caps=webdriver.DesiredCapabilities().FIREFOX
caps["marionette"]=True
binary =FirefoxBinary(r ' E:\software installation directory\installation prerequisite software\Mozilla Firefox\firefox.exe ' )
driver=webdriver.Firefox(firefox_binary=binary,capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
#page=driver.find_element_by_xpath(".//html")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))

comments=driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
    content=eachcomment.find_element_by_tag_name('p')
    print(content.text)
#driver.page_source

output:

I can't find https://api.gentie.163.com/products/ in JS, which great god can help me to answer. thanks.
@Mr. Zhang originally had to follow the operation here. . .
I can't find https://api.gentie.163.com/products/ in JS. Which god can help me. thanks.
@Mr. Zhang This is a connection address on NetEase Cloud, that server is closed
I can't find https://api.gentie.163.com/products/ in JS. Which god can help me. thanks.
test
Why does the article I open with the code only have two comments, there are originally 46 comments, does anyone know what's going on?
A rookie, looking for a learning group
lalala1
I'll give it a try
I'll give it a try
You should click JS, and then look at the Preview or Response inside, which responds to the content of Ajax, and then if you want to crawl the comments of the website, click on the js request and click Headers --> Copy the RequestURL in General.

Note that in code 2, comment=driver.find_element_by_css_selector('div.reply-content-wrapper') in code 1 is changed to comments=driver.find_elements_by_css_selector('div.reply-content')

elements added s

 

 

Bibliography: Tang Song, from "Python Web Crawler: From Introduction to Practice"

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324394839&siteId=291194637