[python crawler] 10. Command the browser to work automatically (selenium)

Preface

In the previous level, we learned about cookies and sessions.
Learned their usage and differences respectively.
Insert image description here
Insert image description here
I also made a project: log in with a small cookie, and then leave a comment on the blog.

In addition to the login issues mentioned in the previous level, we may also encounter various difficult problems during the crawling process——

Some website logins are very complicated and the verification code is difficult to crack, such as the famous 12306.

Insert image description here
Some website pages have complex interactions and use technologies that are difficult to crawl, such as Tencent documents.

Insert image description here
Some websites have very complicated encryption logic for URLs. For example, in the QQ Music song reviews crawled in Level 4, it is quite difficult to find the parameter variables of the URL.

In the above cases, it will be difficult to break through the anti-crawler technology of these websites.

However, you don’t have to worry. In this level, I will teach you the ultimate weapon-selenium, through which you can solve all the above problems.

what is selenium

What is selenium? It is a powerful Python library.

What can it do? It can use a few lines of code to control the browser and perform operations such as automatic opening, input, and clicks, just like a real user is operating it.

Let’s take a look at a short screen recording. The text will pale in front of the video.

Insert image description here
This is the script I wrote using selenium to let the browser automatically open the web page, then enter text and click the submit button. I will talk about the code used here later.

I would like to commend a user I taught before. The login and operation of their company's intranet were very cumbersome, and the operations after login were mechanically repetitive. After he learned selenium, he wrote a Python program.

The first thing he does when he goes to work every day is to turn on his computer and run the script he wrote, so that the browser will automatically open the company intranet to complete the login, and the repetitive tasks will be completed at the same time. As for himself, he was sitting there drinking tea leisurely.

Selenium can control the browser. How does this help solve the problems we just raised?

First of all, when you encounter a website with a complex verification code, selenium allows people to manually enter the verification code and then leaves the rest of the operations to the machine.

For those websites with complex interactions and complex encryption, Selenium simplifies the problem and crawls dynamic web pages as easily as crawling static web pages.

What is a dynamic web page and what is a static web page? In fact, you have already been exposed to both web pages.

Level 2 teaches you to write web pages in HTML, which are static web pages. We use BeautifulSoup to crawl this type of web page, because the source code of the web page contains all the information of the web page, so the URL in the address bar of the web page is the URL of the web page source code.

Insert image description here

Later, you start to come into contact with more complex web pages, such as QQ Music. The data to be crawled is not in the HTML source code, but in json. You cannot directly use the URL in the address bar, but need to find the real URL of the json data. . This is a dynamic web page.

Insert image description here
No matter where the data is stored, the browser is always making various requests to the server. When these requests are completed, they will together form the rendered web page source code displayed in the Elements of the developer tools.

Insert image description here
Selenium comes in handy when encountering complex page interactions or complex URL encryption logic. It can actually open a browser, wait for all data to be loaded into Elements, and then crawl the web page as a static web page. Just take it.

Having said so many advantages, of course there are also shortcomings when using selenium.

Since it takes some time to actually run a local browser, open the browser and wait for the network rendering to complete, selenium's work inevitably sacrifices speed and more resources, but at least it will not be slower than humans.

Knowing its advantages and disadvantages, let's start learning how to use selenium.

how to use

First of all, like all other Python libraries, selenium needs to be installed. The method is also very simple. Use pip to install it.

pip install selenium # Windows电脑安装selenium
pip3 install selenium # Mac电脑安装selenium

Selenium scripts can control the operations of all common browsers. Before use, you need to install the browser driver.

What I recommend is the Chrome browser. Open the link below and you can download the Chrome installation package. It is available for Windows and Mac.

https://localprod.pandateacher.com/python-manuscript/crawler-html/chromedriver/ChromeDriver.html

I strongly recommend that you download it now and install the browser driver on your computer, because this level is quite special and requires you to learn while running the code in the local environment.

This is because I can only use some animations to show you how it works.

The impact of this is that when learning the content of this level, if you want to see the operation process of the browser more intuitively, you have to run the script on your local computer.

Before officially starting to explain the knowledge, I would like to first let you experience the effect of running the selenium script program on your local terminal. Because at the beginning of learning Selenium, if you can personally see the operation effect after the browser automatically pops up, it will be of great help to your subsequent learning.

The code below is the code for the animation at the beginning of this lesson. You don't need to understand the specific meaning now, you will learn how to use each line later.

Now you just need to copy this code to your local code editor and run it to experience the effect of your browser automatically working for you. Of course, the premise is that you have installed the selenium library and Chrome browser driver.

# 本地Chrome浏览器设置方法
from selenium import  webdriver 
import time

driver = webdriver.Chrome() 
driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') 
time.sleep(4)

teacher = driver.find_element_by_id('teacher')
teacher.send_keys('必须是吴枫呀')
assistant = driver.find_element_by_name('assistant')
assistant.send_keys('都喜欢')
time.sleep(1)
button = driver.find_element_by_class_name('sub')
time.sleep(1)
button.click()
time.sleep(1)
driver.close()

In addition to watching the program run, it is better to open this website manually and do the same operations as in the program. The URL will be given to you:

https://localprod.pandateacher.com/python-manuscript/hello-spiderman/

The first thing that catches the eye is [Hello, Spider-Man! 】A few big words, and after a second, it will automatically jump to a new page, asking you to enter your favorite teacher and teaching assistant. After you click submit, it will jump to the Chinese-English comparison page of Zen of Python.

If you look carefully, you will find that during this process, the web page URL has not changed. It can be seen that [Hello, Spider-Man! 】It is a dynamic web page.

After experiencing selenium, we will officially start explaining the code.

Set browser engine

As before, to use a new Python library, you first need to call it. Selenium is a little different. In addition to calling, you also need to set up the browser engine.

# 本地Chrome浏览器设置方法
from selenium import webdriver #从selenium库中调用webdriver模块
driver = webdriver.Chrome() # 设置引擎为Chrome,真实地打开一个Chrome浏览器

The above is how the browser is set up: set the Chrome browser as the engine, and then assign it to the variable driver. The driver is an instantiated browser. You will always see its shadow later. This is understandable because we want to control this instantiated browser to do something for us.

After configuring the browser, we can start letting it work for us!

Next, we will learn the specific usage of selenium. This part of the knowledge explanation will be based on what you have seen several times, [Hello Spider-Man! 】This website is an example:

https://localprod.pandateacher.com/python-manuscript/hello-spiderman/

Let's follow the four steps of crawler to explain the usage of selenium and see how selenium obtains, parses and extracts data. Since the data extracted in this level are not too complicated, you can just print it directly on the terminal, and the step of storing the data will not be involved.

Insert image description here

retrieve data

First, let’s take a look at how to write the code to obtain data.

import time

# 本地Chrome浏览器设置方法
from selenium import webdriver #从selenium库中调用webdriver模块
driver = webdriver.Chrome() # 设置引擎为Chrome,真实地打开一个Chrome浏览器

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 打开网页
time.sleep(1)
driver.close() # 关闭浏览器

The first three lines of code are what you have learned, calling the module and setting up the browser. Only the last two lines of code are new.

get(URL) is a method of webdriver, its mission is to open the web page of the specified URL for you.

As mentioned just now, the driver here is an instantiated browser, so the web page is opened through this browser.

When a web page is opened, the data in the web page is loaded into the browser, that is, the data is obtained by us.

driver.close() is to close the browser driver. Every time you call webdriver, you must add a line of driver.close() after using it to close it.

Just like, every time you open the refrigerator door and put something in, you must remember to close the door, and you must remember to close the browser after using selenium to call the browser.

Copy and paste the above code and run it on your local computer. You can see that a browser automatically starts and opens a web page for you. After one second, the browser closes.

Insert image description here
Next, we let the browser parse and extract the data, and then print it out so that we can see the returned results.

Parse and extract data

We spent the previous two levels learning to use BeautifulSoup to parse the source code of web pages and then extract the data.

The selenium library also has the ability to parse and extract data. It is consistent with the underlying principles of BeautifulSoup, but differs in some details and syntax.

The first obvious difference is that what selenium parses and extracts is all the data in Elements, while what BeautifulSoup parses is only the response to the 0th request in the Network.

As I said at the beginning of this level, use selenium to open the web page, and all the information will be loaded into Elements. After that, you can crawl the dynamic web page using the static web page method.

How does selenium parse and extract data? Let’s try to extract [Hello Spider-Man! 】In the web page, <label>the content of the element.

Insert image description here
I have written the code, you can run it and take a look! Tip: If there is an error when running the reference code, please copy it and modify it yourself!

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库
import time

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(2) # 等待2秒
label = driver.find_element_by_tag_name('label') # 解析网页并提取第一个<label>标签
print(label.text) # 打印label的文本
driver.close() # 关闭浏览器

Output result:

(提示:吴枫)

As you can see from the running results, we extracted the text <label>(hint: Wu Feng) </label>in (hint: Wu Feng).

Only the last few lines of code in the above code are new. Line 11: Wait for 2 seconds; Line 12: Then parse the web page and extract the first label in the web page <label>; Line 13: Print the text content of the label.

Use time.sleep(3) to wait for three seconds because it takes some time for the browser to buffer and load the web page, and I set one second in this web page before jumping from the home page to the input page, so wait for three seconds before going. Parsing and extraction are relatively stable.

Looking at it this way, parsing and extracting data actually only uses one line of code here:

label = driver.find_element_by_tag_name('label') # 解析网页并提取第一个<label>标签中的文字

Can you tell which part is doing the parsing and which part is doing the extraction?

Recall that when using BeautifulSoup to parse and extract data, you must first parse the Response object into a BeautifulSoup object, and then extract data from it.

In selenium, the obtained web page is stored in the driver, and then the parsing and extraction are done at the same time, both done by the instantiated browser of the driver.

Therefore, the answer to the previous question is: parsing data is automatically completed by the driver, and extracting data is a method of the driver.

Now that we understand the essence of parsing and extraction, let’s talk about the methods of parsing data in detail.

Of course, selenium can not only extract data through tags, but also has many methods to find and extract elements, all of which are very straightforward.

Insert image description here
As you can see, the methods for extracting data are all literal translations of English. Let me give you an example of their usage. Please read the comments of the following code carefully:

# 以下方法都可以从网页中提取出'你好,蜘蛛侠!'这段文字

find_element_by_tag_name:通过元素的名称选择
# 如<h1>你好,蜘蛛侠!</h1> 
# 可以使用find_element_by_tag_name('h1')

find_element_by_class_name:通过元素的class属性选择
# 如<h1 class="title">你好,蜘蛛侠!</h1>
# 可以使用find_element_by_class_name('title')

find_element_by_id:通过元素的id选择
# 如<h1 id="title">你好,蜘蛛侠!</h1> 
# 可以使用find_element_by_id('title')

find_element_by_name:通过元素的name属性选择
# 如<h1 name="hello">你好,蜘蛛侠!</h1> 
# 可以使用find_element_by_name('hello')

#以下两个方法可以提取出超链接

find_element_by_link_text:通过链接文本获取超链接
# 如<a href="spidermen.html">你好,蜘蛛侠!</a>
# 可以使用find_element_by_link_text('你好,蜘蛛侠!')

find_element_by_partial_link_text:通过链接的部分文本获取超链接
# 如<a href="https://localprod.pandateacher.com/python-manuscript/hello-spiderman/">你好,蜘蛛侠!</a>
# 可以使用find_element_by_partial_link_text('你好')

The above is how to extract a single element.

So, what kind of elements do we extract? What properties and methods does this object have? Let's take a look now. Please read the code below and click Run:

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 开启无头模式

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(2) # 等待2秒
label = driver.find_element_by_tag_name('label') # 解析网页并提取第一个<label>标签中的文字
print(type(label)) # 打印label的数据类型
print(label.text) # 打印label的文本
print(label) # 打印label
driver.close() # 关闭浏览器

operation result:

<class 'selenium.webdriver.remote.webelement.WebElement'>
(提示:吴枫)
<selenium.webdriver.remote.webelement.WebElement (session="6d400c6ad6f0aa4f5a241b4332ea0c4c", element="0.9387651316030954-1")>

There are 3 lines in the running result
. It can be seen that the extracted data belongs to the WebElement class object. If you print it directly, a string of descriptions of it will be returned.

It is similar to the Tag object in BeautifulSoup. It also has an attribute .text that can display the extracted elements in string format.

I would also like to add that the WebElement class object is similar to the Tag object. It also has a method that can extract the value of the attribute through the attribute name. This method is .get_attribute().

Insert image description here

Let's give an example:

Insert image description here
Let’s try, use class="teacher" to locate the highlighted element in the picture above, and then extract the value text of the type attribute.

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 开启无头模式

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(2) # 等待2秒
label = driver.find_element_by_class_name('teacher') # 根据类名找到元素
print(type(label)) # 打印label的数据类型
print(label.get_attribute('type')) # 获取type这个属性的值
driver.close() # 关闭浏览器

operation result:

<class 'selenium.webdriver.remote.webelement.WebElement'>
text

Therefore, we can conclude that during the process of selenium parsing and extracting data, we operate the object conversion:

Insert image description here
Just now, what we have done is to extract the first data that meets the requirements in the web page. Next, let's look at the method of extracting multiple elements.

find_element_by_ is similar to find in BeautifulSoup. It can extract the first element in the web page that meets the requirements; since BeautifulSoup has the method find_all to extract all elements, selenium also has the method.

The method is also very simple, just replace the element just now with the plural elements.

Insert image description here
Let’s try to extract [Hello, Spider-Man! ] The text in all label tags.

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 开启无头模式

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(2) # 等待2秒

labels = driver.find_elements_by_tag_name('label') # 根据标签名提取所有元素
print(type(labels)) # 打印labels的数据类型
print(labels) # 打印labels
driver.close() # 关闭浏览器

operation result:

<class 'list'>
[<selenium.webdriver.remote.webelement.WebElement (session="87d373c4e7a09aef4dd31f5940f8cf84", element="0.794826797904179-1")>, <selenium.webdriver.remote.webelement.WebElement (session="87d373c4e7a09aef4dd31f5940f8cf84", element="0.794826797904179-2")>]

As you can see from the running results, what is extracted is a list, <class 'list'>. The content of the list is the WebElements object. These symbols are the description of the object. As we just learned, we need to use .text to return its text content.

Now that you have the list, you can use a for loop to traverse the list similar to the result returned by find_all to extract each value in the list.

So, please write this code:

Reference Code:

labels=driver.find_elements_by_tag_name('label')
for i in labels:
    print(i.text)

The above is the method of parsing and extracting data in selenium.

In addition to using selenium to parse and extract data, there is another solution, that is, use selenium to obtain the web page, and then let BeautifulSoup parse and extract it.

Next, let's see how selenium and BeautifulSoup can happily cooperate.

Let's review how BeautifulSoup works.

Insert image description here
BeautifulSoup needs to parse the web page source code in string format into a BeautifulSoup object, and then extract data from it.

Selenium can just get the fully rendered web page source code.

How to get it? It is also a way to use driver: page_source.

HTML源代码字符串 = driver.page_source 

Let’s do it now and get [Hello, Spider-Man! 】Web page source code:

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 开启无头模式

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(2) # 等待2秒

pageSource = driver.page_source # 获取完整渲染的网页源代码
print(type(pageSource)) # 打印pageSource的类型
print(pageSource) # 打印pageSource
driver.close() # 关闭浏览器

We successfully obtained and printed the web page source code O(∩_∩)O~~ and its data type is <class 'str'>.

Do you still remember that what you get with requests.get() is the Response object. Before passing it to BeautifulSoup for analysis, you need to use the .text method to return the content of the Response object in the form of a string.

The source code of the web page obtained using selenium is already a string itself.

Insert image description here
After obtaining the web page source code in string format, you can use BeautifulSoup to parse and extract data. This is an after-school homework I leave for you.

At this point, the methods of parsing and extracting data have been explained.

Regarding the usage of selenium, what else has not been said? right! This is the function we demonstrated at the beginning of this level, which controls the browser to automatically enter text and click submit.

Insert image description here
The web URL is given to you again:

https://localprod.pandateacher.com/python-manuscript/hello-spiderman/

I will solve this mystery for you now.

Automate your browser

In fact, to achieve the effect shown in the above animation, you only need to learn two new methods:

.send_keys() # 模拟按键输入,自动填写表单
.click() # 点击元素

Using these two lines of code, combined with the method of parsing and extracting data just mentioned, you can complete the effect of operating the browser.

After learning this, we can write down all the code. This is exactly what I gave you at the beginning, letting you copy the code that has been run locally.

# 本地Chrome浏览器设置方法
from selenium import webdriver # 从selenium库中调用webdriver模块
import time # 调用time模块
driver = webdriver.Chrome() # 设置引擎为Chrome,真实地打开一个Chrome浏览器

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(3) # 暂停三秒,等待浏览器缓冲

teacher = driver.find_element_by_id('teacher') # 找到【请输入你喜欢的老师】下面的输入框位置
teacher.send_keys('必须是吴枫呀') # 输入文字
assistant = driver.find_element_by_name('assistant') # 找到【请输入你喜欢的助教】下面的输入框位置
assistant.send_keys('都喜欢') # 输入文字
button = driver.find_element_by_class_name('sub') # 找到【提交】按钮
button.click() # 点击【提交】按钮
time.sleep(1)
driver.close() # 关闭浏览器

Since the commands of this code are to control the browser to perform some operations, the terminal will not return any results.

When you were copying, did you notice that the last 6 lines of code corresponded to each other? Before each input and click, you must first locate the corresponding position. The method for searching and positioning is the one you learned earlier, parsing and methods of extracting data.

For example, before entering your favorite teacher, you must first find the location of the input box in the source code of the web page. The method is still the same as the method we learned before. Click the small arrow in the upper left corner of the developer tool, and then place the mouse in the space of the web page. at.

Insert image description here
As can be seen from the source code of the web page, you can search and locate here based on id="teacher" or class="teacher".

Assign the extracted location information to the teacher, and then use the teacher.send_keys() method to enter the text you want to fill in the blank.

This completes a complete operation, and the methods for the next two operations are similar. From this, the entire code is written.

I also want to add a little knowledge. In addition to the two methods of input and click, which are often used in conjunction with them, there is also a method. clear(), which is used to clear the content of the element.

Insert image description here
If [Spider-Man] has been entered in the space just now, if you want to change it to [Wu Feng], you need to use .clear() to clear the words [Spider-Man] first, and then fill in the new text. .

At this point, the knowledge explanation part of this level is all completed. Let’s work on a project together~

Every time you learn new knowledge, you must conduct practical exercises in time to consolidate the knowledge learned, so that you can form a deeper understanding and memory of the knowledge.

Practical application

Confirm target

This time we tried to use selenium to crawl QQ Music's song comments. The song I chose was "Sweet".

https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html

I don’t know if you still have an impression. When I was learning json in Level 5, I crawled the latest comments of songs on QQ Music. This time we crawled the wonderful comments. The crawling methods of the two comments are essentially the same.

Insert image description here

Now I will take you to use selenium to do a project that I have done before. Of course, it is not my decision to be lazy and pat my head, but after careful consideration, because the same project can be done twice, or even many times.

Using different paths to achieve the same goal, this learning and training method will help you gain more thorough knowledge.

After confirming the goal, let’s start taking action! As always, before writing code, analyze your ideas first.

Analysis process

Still analyze according to the four steps of the crawler.

The first is to get the data:

Through the study of Level 5, you already know that the comments we want are not in the source code of the web page, but are stored in Json. You need to view the XHR to find the real URL of the Json data of each page of comments to obtain the data.

We are using selenium this time, so we don’t need to spend energy finding and cracking the URL, because the data is loaded into elements by opening the browser through selenium.

The method of getting more comments has become very simple. Just use selenium to control the browser and click the [Click to load more] button, and the comment data will naturally be loaded into elements. It is perfect:

Insert image description here
Next is parsing and extracting data:

The first solution is to use selenium to extract data.

Insert image description here
The second solution is to first obtain the complete web page source code, and then use BeautifulSoup to crawl it. Both methods can complete the work of parsing and extraction.

We skip the final step of storing data and print it directly in the terminal.

Once you have sorted out the entire process, you can start writing! generation! code! La!

Code

First, call all required modules, set up the Chrome browser engine, access the web page, and obtain data.

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 开启无头模式

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html') # 访问页面

Then, use selenium's parsing and extraction method to get the song comments and print them.

It should be noted that after obtaining the web page and before parsing and extracting, time.sleep(2) must be added, because it takes a few tenths of a second to load the web page. To be on the safe side, we wait for 2 seconds.

When extracting data, you first need to know where the data is stored on the web page. It is still the old method, [right-click-inspect], put the mouse over the song's wonderful comments, and find the corresponding position in Elements:

Insert image description here
What should be noted here is that in the source code of this web page, there are multiple class attributes in the element where the comment is located. When using selenium, only one of the attributes can be used to extract data.

By analyzing the structure of the web page, we choose to use class_name and tag_name to extract data. The code to get the wonderful comments on the first page of this song can be written:

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 开启无头模式

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html') # 访问页面
time.sleep(2)

comments = driver.find_element_by_class_name('js_hot_list').find_elements_by_class_name('js_cmt_li') # 使用class_name找到评论
print(len(comments)) # 打印获取到的评论个数
for comment in comments: # 循环
    sweet = comment.find_element_by_tag_name('p') # 找到评论
    print ('评论:%s\n ---\n'%sweet.text) # 打印评论
driver.close() # 关闭浏览器

operation result:

15
评论:想起那晚我在你耳边轻轻的的说爱你,你一脸害羞的看着我并且点点头,你此时此刻的笑像夹心饼干,双手捂着嘴,而你的脸庞笑的却如此可爱,这或许就是恋爱最甜甜的趣事吧 Jay在录制这一首歌的时候,就考虑到这歌曲风格甜美是否适合自己以往的演唱,但想到这首歌能表达出Jay对学生时代那种单纯感觉的怀念,于是造就了现在的经典 那一晚过后,我们的每次相约,你的眼中只有我,望着我的样子眼神中充满满满宠溺,我用拥抱给了你一切的回应 我喜欢的样子你都有
 ---

评论:我想留着西瓜最中间的一勺,掺杂着巧克力屑的奶油蛋糕,草莓曲奇的第一口,双皮奶的最上层,偷喝妹妹奶粉的最后一口,所有我见过最甜蜜的。却没有甜过有我所有喜欢的样子的你,你眼中的只有最喜欢你的我。
 ---

评论:这首一定是婚礼必备。这首歌里最喜欢的歌词是“啾!”,告诉我不止我一个人
 ---

评论:我也超喜欢杰伦这首《甜甜的》!从高中听到结婚生子!依旧没有改变那种甜甜的旋律!
 ---

评论:第一次实在广告里面听的,然后就开始找啊找,找的好辛苦啊。。。。一听钟情!
 ---

评论:这首甜甜的 满满的都是中学时代的回忆。 那时候还很懵懂,那时候还不懂什么是爱情,就是喜欢某个女生 喜欢和她一起的那个时光, 午后的操场 六楼的钢琴室 学校周边的街道 … 如今再也回不去了,但是这首歌里满满的都是回忆。
 ---

评论:听到这首歌想起了初中的时候,每个人心中都住着那么一个人,不是爱,也不是喜欢,但是每次见到哪怕是提到他的名字就会怦然心动的感觉,要怪就怪当时没有提起勇气告诉他,也许有些人就是用来怀念的
 ---

评论:这首歌!!真的炒鸡炒鸡甜!炒鸡甜!甜到掉牙!好了,我要去看书了
 ---

评论:刘霞,你在哪里。我为你跑了很远很远。我知道你喜欢周董,希望你能看到。我相信缘分,,,,
 ---

评论:歌如其名,如果你有心上人,大概会不自主的想到ta吧?嘴角一定也是上扬露出笑容,因为,我喜欢的样子你都有~
 ---

评论:听到这首歌想起了初中的时候,每个人心中都住着那么一个人,不是爱,是喜欢,但是每次见到哪怕是提到她的名字就会怦然心动的感觉,要怪就怪当时没有提起勇气告诉她,也许有些人就是用来怀念的
 ---

评论:早晨领份狗粮去上课那句「啾」真的萌爆了以前竟然都没注意过,这也是首适合告白的歌
 ---

评论:又啥都没干听了两小时周杰伦了…
 ---

评论:明明很煽情却一点都没有腻的感觉,这就是周董的实力吧!听到广告就觉得很海森!大爱!
 ---

评论:如果用周杰伦的歌代表我对感情的认知,应该是从情窦初开的<简单爱>到热恋期的<甜甜的>,俩个人的世界满满的好<星晴>,然而异地恋开始了,我们中间隔了一片<珊瑚海>。最后我选择了<退后>,失去了关于你的<轨迹>,但我承认这一切都是我的错,是我<搁浅>了我们之间的感情。再后来我们失去了联系,而<一路向北>也成了我的单曲循环。
 ---

This time 15 comments were extracted. Next, we want to get more comments. Click [Click to load more] on the web page, and the data of 15 new comments will be loaded.

Insert image description here
At this time, the method of writing code is very clear. First, find the location of [Click to load more] in the source code of the web page, click it, and wait for the source code to be loaded, then you can extract all 30 comments.

Insert image description here
Please try to complete the following:

Reference Code:

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 开启无头模式

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html') # 访问页面
time.sleep(2)

loadmore=driver.find_element_by_class_name('comment__show_all').find_element_by_tag_name('a')
loadmore.click()
time.sleep(2)
comments = driver.find_element_by_class_name('js_hot_list').find_elements_by_class_name('js_cmt_li')


for i in comments:
    txt=i.find_element_by_tag_name('p')
    print(txt.text)

I successfully obtained two pages of comments, applause to you~

If you want to get more comments, add a loop and a conditional judgment - whether you can find the option to click to turn the page, you can achieve it. I won't write the code here. You can practice it yourself after class. The purpose of the practice is to learn the method, and there is no need to actually get all the thousands of comments.

The first method of parsing and extraction was used just now. Of course, the second method can also be used: the combination of selenium and BeautifulSoup.

First use selenium to obtain the complete web page source code, and then use BeautifulSoup, which you are already familiar with, to parse and extract data.

I have written the code. The difference from before is the last few lines of code:

from selenium import webdriver  # 从selenium库中调用webdriver模块
from bs4 import BeautifulSoup # 调用BeautifulSoup库
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 开启无头模式

chrome_options = webdriver.ChromeOptions() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = webdriver.Chrome(options=chrome_options) # 声明浏览器对象

driver.get('https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html') # 访问页面
time.sleep(2)

button = driver.find_element_by_class_name('js_get_more_hot') # 根据类名找到【点击加载更多】
button.click() # 点击
time.sleep(2) # 等待两秒

pageSource = driver.page_source # 获取Elements中渲染完成的网页源代码
soup = BeautifulSoup(pageSource,'html.parser')  # 使用bs解析网页
comments = soup.find('ul',class_='js_hot_list').find_all('li',class_='js_cmt_li') # 使用bs提取元素
print(len(comments)) # 打印comments的数量

for comment in comments: # 循环
    sweet = comment.find('p') # 提取评论
    print ('评论:%s\n ---\n'%sweet.text) # 打印评论
driver.close() # 关闭浏览器 # 关闭浏览器

At this point, all the code has been written.

We used a different method from Level 5 to complete the same project. Moreover, when parsing and extracting data, two methods are also used to achieve it.

After learning so many methods, if you encounter similar problems in the future, you can evaluate which methods can be used to achieve it based on the actual situation, and then choose one of the methods to do the project.

Summary of this level

Thank you for your hard work. After learning all the knowledge and completing the project, we are now at the end of the level.

In this level, I will teach you how to install selenium and Chrome drivers, and then introduce how to set up the browser:

# 本地Chrome浏览器的可视模式设置:
from selenium import webdriver #从selenium库中调用webdriver模块
driver = webdriver.Chrome() # 设置引擎为Chrome,真实地打开一个Chrome浏览器

This setup method allows you to see the browser in action. What I would like to add here is that in the local operating environment, you can also set the Chrome browser on your computer to silent mode, that is, let the browser just run in the background and not open it on the computer. Visual interface.

Because when doing a crawler, there is usually no need to open the browser. The purpose of the crawler is to crawl to the data, not to watch the browser's operation process. In this case, you can use the browser's silent mode.

Its setting method is like this:

# 本地Chrome浏览器的静默模式设置:
from selenium import  webdriver #从selenium库中调用webdriver模块
from selenium.webdriver.chrome.options import Options # 从options模块中调用Options类

chrome_options = Options() # 实例化Option对象
chrome_options.add_argument('--headless') # 把Chrome浏览器设置为静默模式
driver = webdriver.Chrome(options = chrome_options) # 设置引擎为Chrome,在后台默默运行

Compared with the visual settings of the browser above, lines 3, 5, and 6 are new. First, a new class—Options—is called, and then a parameter is input to the browser through its methods and properties— —headless. In the 7th line of code, the browser settings just made are passed to the Chrome browser.

The browser's visual mode and silent mode settings are the difference between the above four lines of code. You know, all the code after this is the same.

Now that I have told you all the knowledge I want to tell you, let’s continue with the routine summary at the end of each level~

We just learned how to use selenium to obtain data: .get('URL').
Methods of parsing and extracting data:
Insert image description here
and in this process, the object conversion process:

Insert image description here
In addition to the above methods, selenium can also be used with BeautifulSoup to parse and extract data, provided that the web page source code in string format is first obtained.

HTML源代码字符串 = driver.page_source 

And some ways to automate browser operations.

Insert image description here

Also, after using the browser, remember to close it to avoid wasting resources. Just add a line of driver.close() at the end of the code.

By now, you should be able to feel that Selenium is a powerful network data collection tool. Its advantage is that it is simple and intuitive, but of course it also has shortcomings.

Since it is a real simulation of human operation of the browser, it needs to wait for the web page to be buffered. When crawling a large amount of data, the speed will be relatively slow.

Usually, in crawler projects, selenium is used when problems cannot be solved or are difficult to solve by other methods.

Of course, in addition to crawlers, selenium has many usage scenarios. For example: it can control the display of image files in web pages, control the loading and execution of CSS and JavaScript, and so on.

Our course just gets you started and talks about some simple and commonly used operations. If you want to learn more, you can go through Selenium’s official document link, which is currently only available in English:

https://seleniumhq.github.io/selenium/docs/api/py/api.html

You can also refer to this Chinese document:

https://selenium-python-zh.readthedocs.io/en/latest/

In the next level, we will also talk about a practical method, timing and notification. We look forward to seeing you!

Guess you like

Origin blog.csdn.net/qq_41308872/article/details/132619499