Python3 crawler second step Selenium uses a simple way to crawl complex page information

Introduction to Selenium

Click here for the previous crawler article in this series of columns .

As the complexity of the website increases, the way crawlers are written will also increase. Using Selenium, you can grab complex website pages in a simple way and get the information you want.

Selenium is to operate the browser to automate, such as automatically accessing websites, clicking buttons, and collecting information. Compared with directly using bs4 to capture information, Selenium's crawling speed is very flawed, but if there are not many pages crawled, the pages are complicated. , Selenium is a good choice.
This article will use Selenium some simple grab and want in-depth study Selenium can view the "selenium3 underlying analysis," I've written before on the next two.

Selenium use note

Before using Selenium, you need to install Selenium. Use the pip command to install it as follows:

pip install selenium

After installing Selenium, you need to download a driver.

  • Google Chrome driver: The driver version needs to correspond to the browser version, different browsers use different versions of the driver, click to download
  • If you are using the Firefox browser, check the Firefox browser version, and click the
    GitHub Firefox driver download address to
    download (for students who are not good in English, right-click to translate, and each version has instructions for the corresponding browser version. Just read it and download it. can)

The author’s environment description is as follows:

  • Operating System: Windows7 SP1 64
  • python version: 3.7.7
  • Browser: Google Chrome
  • Browser version: 80.0.3987 (64 bit)

After downloading the driver, you must configure the driver to the system environment, or drop it into your python root directory.

Officially begin

First introduce selenium into the code

from selenium import webdriver

Maybe some readers did not configure the driver to the environment, then we can specify the location of the driver:

driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')

The above code calls the Chrome method and configures the driver address ( 这里使用 executable_path 指定驱动地址) as " F:\python\dr\chromedriver_win32\chromedriver.exe". At this time, the driver location can be specified, or there is no need to configure the environment.
Now run the code to see if it will open a browser.
Insert picture description here
At this time, Google Chrome will be successfully opened.
At this time drivervariable is an object browser, by driveroperating the browser, use getways to access a Web site. At this time we can visit Baidu. code show as below:

from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
driver.get("http://baidu.com")

Successfully opened the Baidu search interface:
Insert picture description here
For example, we want to search for crawlers and use selenium to realize automatic search. The first function you need to understand is find_element_by_id, which can find interface elements by id. In html, most of the elements with special functions will be given an id. When searching, you need to fill in the text box of Baidu search keywords. Move the mouse to the text box, align the text box and click the right mouse button, and click to check to view the element .
Insert picture description here
After clicking check, a source code window will appear:
Insert picture description here
input is a text box element, and the value of id is kw.
Then know that the id value of the text box is kw, you can use the find_element_by_id function to give the id value, find the element object, and operate the element object for addition and deletion. Since find_element_by_id is a method of the browser object, use the browser object to call, the code is as follows:

input = driver.find_element_by_id('kw')

At this time, it is still short of inputting the value to be searched into the object. Use the send_keys method to automatically enter the value, as follows:

input.send_keys("爬虫")

Where input is the element object just obtained. Run the code at this time to see the effect:
Insert picture description here
At this time, the key frame "crawler" to be searched is automatically typed in. Next, according to the previous steps, you should find the id of the button on Baidu, and then click it. Through the same process, get the html code of Baidu click button:

<input type="submit" id="su" value="百度一下" class="bg s_btn">

To get the id as su, use find_element_by_id to get the element object:

enter = driver.find_element_by_id('su')

The element can be clicked by calling the click method:

enter.click()

The final code is as follows:

from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
driver.get("https://www.baidu.com/")
input = driver.find_element_by_id('kw')
input.send_keys("爬虫")
enter = driver.find_element_by_id('su')
enter.click()

The results are as follows:
Insert picture description here

Access to information

It can be opened automatically, the next step is to get the search information.
Here we need to introduce a knowledge point xpath. We can understand that xpath is like x and y coordinates. It is used for positioning in html or xml language to indicate a location. Simple use does not need to learn how to write it, because we can get it directly from the browser.

As shown in the figure below, we right-click to search for the first title of the information, and the source code will appear after clicking Check. Right-click in the source code, select Copy and click Copy XPath, then we get the XPath of the current element.
Insert picture description here
After obtaining the XPath, copy it to the text box and view the following format:

//*[@id="3001"]/div[1]/h3/a

Note here that theoretically the first line of each page will be the XPath, and it does not need to be obtained on each page. However, when the situation is inconsistent, the specific situation needs to be analyzed in detail.
We simply do not understand the practical XPath too, then you can use find_element_by_xpathto obtain the current element object.

res_element=driver.find_element_by_xpath('//*[@id="3001"]/div[1]/h3/a')

After obtaining the element object, you can call the text property of the element object to obtain the current text value:

print(res_element.text)

The complete code is as follows:

from selenium import webdriver
import time
driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
driver.get("https://www.baidu.com/")
input = driver.find_element_by_id('kw')
input.send_keys("爬虫")
enter = driver.find_element_by_id('su')
enter.click()
time.sleep(2)

res_element=driver.find_element_by_xpath('//*[@id="3001"]/div[1]/h3/a')
print(res_element.text)

The above code time.sleep(2)is to wait for a page to load data click on the search, or you'll get less than the object.
The results are as follows: The
Insert picture description here
above omits the process of automatically opening the browser and searching for content, and directly view the results.
Then we get the first result on each page, then just click the next page automatically and get it.
First get the element object of the next page button:
Insert picture description here
copy the XPath value:

//*[@id="page"]/div/a[10]

The code to get the next page object and click to jump is:

nextbtn_element=driver.find_element_by_xpath('//*[@id="page"]/div/a[10]')
nextbtn_element.click()

After running, it is found that the second page is successfully jumped to, and then you can continue to get the first object in the search bar. You can use the loop to achieve this process. We set to search for all the first result values ​​of the first 10 pages, and then all The code can be written as:

from selenium import webdriver
import time
driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
driver.get("https://www.baidu.com/")
input = driver.find_element_by_id('kw')
input.send_keys("爬虫")
enter = driver.find_element_by_id('su')
enter.click()
time.sleep(2)


for _ in range(10):
    res_element=driver.find_element_by_xpath('//*[@id="3001"]/div[1]/h3/a')
    print(res_element.text)
    nextbtn_element=driver.find_element_by_xpath('//*[@id="page"]/div/a[10]')
    nextbtn_element.click()
    time.sleep(2)

The 2 second stop at the bottom of the for loop is the waiting time for data loading after clicking the next page.
After running, it turned out that an error was reported:
Insert picture description here
our 12 behaviors:

res_element=driver.find_element_by_xpath('//*[@id="3001"]/div[1]/h3/a')

So here should be the //*[@id="3001"]/div[1]/h3/apositioning error, the positioning of the element is not found. We check the first results on the first page, the second page, and the third page for comparison:

第一页://*[@id="3001"]/div[1]/h3/a
第二页://*[@id="11"]/h3/a
第三页://*[@id="21"]/h3/a
第四页://*[@id="31"]/h3/a
第五页://*[@id="41"]/h3/a

From the above data, only the XPath of the first page is different, and the other XPaths follow the rule of adding 10 to each page from 11-21-31-41.
And found that the XPath of the next page button has also changed, becoming:

//*[@id="page"]/div/a[11]

The complete code is as follows:

from selenium import webdriver
import time
#请求网页
driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
driver.get("https://www.baidu.com/")
#输入并且搜索
input = driver.find_element_by_id('kw')
input.send_keys("爬虫")
enter = driver.find_element_by_id('su')
enter.click()
#等待2秒加载
time.sleep(2)
#获取第一个结果并且点击下一页
res_element=driver.find_element_by_xpath('//*[@id="3001"]/div[1]/h3/a')
print(res_element.text)
nextbtn_element=driver.find_element_by_xpath('//*[@id="page"]/div/a[10]')
nextbtn_element.click()
time.sleep(2)

#设置一个变量start
start=1
#循环点击下一页 并且获取第一条数据
for _ in range(10):
    start+=10
    xpath_val=r'//*[@id="'+str(start)+r'"]/h3/a' #//*[@id="11"]/h3/a
    res_element=driver.find_element_by_xpath(xpath_val)
    print(res_element.text)
    nextbtn_element=driver.find_element_by_xpath('//*[@id="page"]/div/a[11]')#//*[@id="page"]/div/a[11] //*[@id="page"]/div/a[11]
    nextbtn_element.click()
    time.sleep(2)

In the above code:

from selenium import webdriver
import time
#请求网页
driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
driver.get("https://www.baidu.com/")
#输入并且搜索
input = driver.find_element_by_id('kw')
input.send_keys("爬虫")
enter = driver.find_element_by_id('su')
enter.click()
#等待2秒加载
time.sleep(2)
#获取第一个结果并且点击下一页
res_element=driver.find_element_by_xpath('//*[@id="3001"]/div[1]/h3/a')
print(res_element.text)
nextbtn_element=driver.find_element_by_xpath('//*[@id="page"]/div/a[10]')
nextbtn_element.click()
time.sleep(2)

For the previous code, a new loop is added to traverse the next page and get the first result:

   #设置一个变量start
start=1
#循环点击下一页 并且获取第一条数据
for _ in range(10):
    start+=10
    xpath_val=r'//*[@id="'+str(start)+r'"]/h3/a' #//*[@id="11"]/h3/a
    res_element=driver.find_element_by_xpath(xpath_val)
    print(res_element.text)
    nextbtn_element=driver.find_element_by_xpath('//*[@id="page"]/div/a[11]')#//*[@id="page"]/div/a[11] //*[@id="page"]/div/a[11]
    nextbtn_element.click()
    time.sleep(2)

First set a start, because the value of the change in XPath on the second page is 11-21-31..., set a variable to 1, add 10 each time, so in the loop, the first sentence is:

start+=10

Since other strings in the XPath value have not changed, the entire XPath statement can be written as:

xpath_val=r'//*[@id="'+str(start)+r'"]/h3/a'

Then pass in the xpath function to get the element:

res_element=driver.find_element_by_xpath(xpath_val)

The following statements have not changed much, only the XPath of the button has changed, so the XPath is changed. Other codes are similar to the previous ones. The final running result is as follows:
Insert picture description here
because there is some other information, it is coded. This is a simple way to write selenium crawlers, and the crawler series will be updated continuously.

Guess you like

Origin blog.csdn.net/A757291228/article/details/107189707