This column is based on teacher Yang Xiuzhang’s crawler book "Python Web Data Crawling and Analysis "From Beginner to Proficiency"" as the main line, personal learning and understanding as the main content, written in the form of study notes.
This column is not only a study and sharing of my own, but also hopes to popularize some knowledge about crawlers and provide some trivial crawler ideas.
Column address: Python web data crawling and analysis "From entry to proficiency" For
more crawler examples, please see the column: Python crawler sledgehammer test

table of Contents

1.1 Positioning elements

1.2 Open the Chrome browser

1.3 Use Selenium to get elements

1.4 Set a pause to enter the verification code and log in

2 First encounter with Weibo crawlers

2.1 Weibo

2.2 Login entrance

2.2.1 Common login entrances on Sina Weibo

2.2.2 Sina Weibo mobile phone login entrance

2.3 Weibo automatic login

3 Crawl hot Weibo information

3.1 Search the desired Weibo topic

3.2 Crawling Weibo content

3.2.1 Demand analysis

3.2.2 Analyze the HTML source code law of Weibo

3.2.3 Locate user name

4 Summary of this article

In the process of writing web crawlers, Python usually encounters situations where login verification can be used to crawl data, such as Qzone data, Sina Weibo data, and mailboxes. If verification is not performed, some websites intelligently crawl the homepage data, and even many websites cannot be crawled. At the same time, as social networks become more and more popular, the massive data they bring have more and more application value, and they are often used in public opinion analysis, text analysis, recommendation analysis, recommendation systems and other fields.

This article mainly introduces the Selenium technology based on login verification, and also explains the example of Selenium crawling Weibo data.

Before this, I also wrote a similar article, you can click to view → From login to crawl: Python anti-crawl to obtain thousands of public business data from a treasure

1 Login verification

At present, many websites have a login verification page. On the one hand, the security of the website is improved. On the other hand, the website can be managed and scheduled differently according to different user rights. For example, Baidu login verification page, you need to enter a user name, password and verification code. So if the data the user wants can be crawled after logging in, or even a verification code can be crawled, then how to solve it?

Python crawlers have many ways to solve login verification. Common ones include setting the message header during login, simulating login, and bypassing the login interface. This article mainly combines Selenium technology to explain the method of login verification.

As Selenium technology is used in crawlers, it is also widely used in automated website testing. It can automatically manipulate the keyboard and mouse to simulate a single click operation. Therefore, this technology is used to simulate login. Of course, sometimes the simulated login is not all smooth sailing, and human-machine verification is required: for example, you need to move the slider to the correct position to log in, or you need to fill in a verification code to log in.

Suppose that you now need to write Python code to realize the function of automatically logging in to the 163 mailbox. Only after logging in can you crawl the mailbox receiving and sending emails, so as to carry out related data analysis experiments.

1.1 Positioning elements

First visit the 163 website and locate the login user name, password and other elements. Normally, you can quickly locate the HTML source code corresponding to the target element by using the "Element Selector" with the F12 key.

As can be seen from the above figure, the source code of the elements that need to be located are "<input name="email">" and "<input name="password">", which correspond to the user name and password respectively.

1.2 Open the Chrome browser

Call the Chrome browser driver defined by driver = webdriver.Chrome(), and then open the target page URL in the browser through the driver.get(ur) function.

1.3 Use Selenium to get elements

Call the find_element_by_name() or find_element_by_path() function through Selenium to locate the element corresponding to the 163 mailbox login username and password, and enter the correct username and password through the send_keys() function. The core code is as follows:

elem_user = driver.find_element_by_name("email")
elem_user.send_keys("这里填用户名")
elem_pwd = driver.find_element_by_name("password")
elem_pwd.send_keys("这里填密码")

1.4 Set a pause to enter the verification code and log in

If the website needs to enter a verification code, you need to call time.sleep(3) to set the pause time for 3 seconds, and manually enter the verification code to wait for automatic login; if you need slider verification, you can refer to the previous article, call simulated mouse, keyboard, etc. The operation is further automated. After logging in, you can get the data you need.

import time
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# 打开 Chrome 浏览器，这顶等待加载时间
chromedriver = 'E:/software/chromedriver_win32/chromedriver.exe'
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

# 模拟登录 163 邮箱
url = 'https://mail.163.com/'
driver.get(url)

# 用户名、密码
driver.find_element_by_xpath('//*[@id="auto-id-1594007552696"]').send_keys('用户名')
time.sleep(1)
driver.find_element_by_xpath('//*[@id="auto-id-1594002566766"]').send_keys('密码')
time.sleep(2)
driver.send_keys(Keys.RETURN)

time.sleep(3)
driver.close()
driver.quit()

For example, like the above code, you will find that you still cannot log in, or even report an error. This is because the login pages of many websites are dynamically loaded, we cannot capture the HTML node, and Selenium cannot locate the node, so subsequent operations cannot be implemented. At the same time, programmers who develop websites often modify the HTML source code of the website from time to time in order to prevent malicious attacks and crawling. However, this kind of thinking has been provided to everyone, and I hope everyone will continue to improve to crawl the data they need.

2 First encounter with Weibo crawlers

2.1 Weibo

Weibo, short for MicroBlog, is also a broadcast social media and network platform that shares short and real-time information through a follow mechanism based on user relationship information sharing, dissemination, and acquisition. It allows users to use the Web, Wep, Mail, App, IM, SMS, and users can access through various mobile terminals such as PCs and mobile phones, and realize instant sharing and dissemination of information in multimedia forms such as text, pictures, and videos.

As a sharing and communication platform, Weibo pays more attention to timeliness and randomness, and can better express the use of one's own thoughts and latest developments at all times, while blogs are more focused on sorting out what you have seen over a period of time. What you hear and feel. Common Weibo includes: Sina Weibo, Tencent Weibo, Netease Weibo, Sohu Weibo, etc. Unless otherwise specified, Weibo refers to Sina Weibo.

The official address of the Sina Weibo web page is https://weibo.com/ , and the login interface is as shown in the figure below. You can see popular Weibo, Weibo of special interest, dynamic information, etc. Each Weibo usually includes user name, Weibo content, reading volume, number of comments and number of likes.

When you click on personal information, you can view personal information, basic information, celebrities you follow or your fans. These information can provide great value when you do social network analysis, public opinion analysis, graph relationship analysis, and Weibo user portraits.

2.2 Login entrance

Why should I log in? Because if you do not log in, many data in Sina Weibo cannot be obtained or accessed, such as Weibo fan lists and personal information. When these hyperlinks stand alone, they will automatically jump to the login interface. This is what the developer has taken to protect Weibo. At the same time, software companies usually provide API interfaces for developers to access Weibo data or perform operations, but here, Selenium is used to simulate browser operations for login verification.

First, you need to find the Weibo login entrance. Open the URL: " https://weibo.com/ ", the home page as shown in the figure below will be displayed, and the login place is on the right. However, the website adopts HTTPS authentication, which makes it more secure. In addition, the dynamic loading of the login button prevents us from using Selenium for positioning, so we need to find a new login entry.

2.2.1 Common login entrances on Sina Weibo

Sina Weibo common login entry URL: https://login.sina.com.cn/ or https://login.sina.com.cn/signup/signin.php

2.2.2 Sina Weibo mobile phone login entrance

The mobile terminal of Sina Weibo stores the data of the mobile phone Weibo APP. The data is more refined, the pictures are smaller, and the loading speed is faster. It is suitable for real-time access on the mobile terminal. Its entry address is: https://weibo.cn/ or https://weibo.cn/pub/ . You can see that the information seen on the Sina Weibo mobile page is still very refined.

Next, we will explain how to automatically log in to Weibo, how to crawl hot topics, a certain person's Weibo information, etc.

2.3 Weibo automatic login

First, enter the target URL on the browser page, click the keyboard F12 key, and locate the "Login Name" and "Password" through the "Element Selector", and view the HTML source code location of the relevant buttons, as shown in the figure below.

We can locate the node whose id attribute is "username" and name attribute is "username", find the "login name" text box, or locate the second input node under the <li class="item"> path. Here we use the related functions of the Selenium library to locate the node. The core code is as follows:

elem_user = driver.find_element_by_name("username")
elem_user.send_keys("登录名")

In the same way, we then locate the HTML source code of the "Password" text box. Here is the core code:

elem_pwd = driver.find_element_by_name("password")
elem_pwd.send_keys("密码")

Call the find_element_by_xpath() function to locate the "login" button node, and then call the click() function to click the "login" button to log in. The code is as follows:

elem_sub = driver.find_element_by_xpath("//input[@class='W_btn_a btn_34px']")
elem_sub.click()    # 单击登录

At the same time, you can log in by pressing the Enter key, that is, elem_pwd.send_keys (Keys.RETURN). Finally, the complete code for automatically logging in to Sina Weibo using Selenium technology is given. After entering the account and password, click login.

import time
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# 打开 Chrome 浏览器，这顶等待加载时间
chromedriver = 'E:/software/chromedriver_win32/chromedriver.exe'
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

# 模拟登录新浪微博
url = 'https://login.sina.com.cn/signup/signin.php'
driver.get(url)

driver.implicitly_wait(10) # 隐式等待（单位是秒） 等到页面渲染完之后就不再等待
driver.maximize_window() # 最大化浏览器

# 用户名、密码
elem_user = driver.find_element_by_name("username")
elem_user.send_keys("账户")
elem_pwd = driver.find_element_by_name("password")
elem_pwd.send_keys("密码")
elem_pwd.send_keys(Keys.RETURN)

# 暂停20s，人为输入验证码
time.sleep(20)
elem_sub = driver.find_element_by_xpath('//input[@class="W_btn_a btn_34px"]')
elem_sub.click()    # 单击登录

print("登陆成功！")
driver.close()
driver.quit()

Note : Since the verification code needs to be entered when logging in on Weibo, and the verification code can only be seen after clicking the "Login" button, the user will automatically enter the account password and then press the Enter key, and a verification code prompt will pop up. Then use the time.sleep(20) function to set a pause of 20 s and manually enter the verification code to successfully log in to Weibo. The following figure shows the process of successful login after entering the account, password, and verification code.

3 Crawl hot Weibo information

The following will explain how to use Python to crawl data on a certain topic on Weibo.

3.1 Search the desired Weibo topic

After logging in to Weibo, a Weibo search box will appear at the top of the page for keyword search on Weibo. Similarly, press the F12 key on the keyboard and use the "Element Selector" to select the target location to view its HTML source code. As you can see, it is located at the <input in="search_input"> position.

Then use the driver. find_element_by_xpath() function to locate the location of the search text box. The core code is as follows:

elem_topic = driver.find_element_by_xpath("//input[@id='search_input']")
elem_topic.send_keys("高考")
elem_topic.send_keys(Keys.RETURN)

But here we use another method to enter keywords and search Weibo topics, that is, visit the "Weibo Search" page (URL: https://s.weibo.com/ ), and locate the HTML source code of the search text box as follows:

Call the find_element_by_xpath() function to locate the search text box, and press the Enter key to search and jump. The core code is as follows:

elem_topic = driver.find_element_by_xpath('//*[@id="pl_homepage_search"]/div/div[2]/div/input')
elem_topic.send_keys("高考")
elem_topic.send_keys(Keys.RETURN)

The complete code of the Weibo search part is as follows:

import time
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# 打开 Chrome 浏览器，这顶等待加载时间
chromedriver = 'E:/software/chromedriver_win32/chromedriver.exe'
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

try:
    # 访问新浪微博搜索页面
    url = 'https://s.weibo.com/'
    driver.get(url)

    driver.implicitly_wait(10) # 隐式等待（单位是秒） 等到页面渲染完之后就不再等待
    driver.maximize_window() # 最大化浏览器

    # 按回车键搜索主题
    elem_topic = driver.find_element_by_xpath('//*[@id="pl_homepage_search"]/div/div[2]/div/input')
    elem_topic.send_keys("高考")
    elem_topic.send_keys(Keys.RETURN)
    time.sleep(5)

except Exception as e:
    print('Error: ', e)

finally:
    print('爬取结束！')

3.2 Crawling Weibo content

After getting feedback search results, you can crawl the corresponding Weibo content. It also uses the technology of the browser to review the element location node. Because this technology can identify the HTML source code of the content to be crawled, it is widely used in web crawlers.

3.2.1 Demand analysis

Determine the information of the obtained Weibo content, as shown in the figure below, the obtained information includes user name, content, release time, forwarding volume, number of comments and number of likes. Among them, the number of reposts, the number of comments and the number of likes can be used to analyze the popularity of Weibo and user portraits.

3.2.2 Analyze the HTML source code law of Weibo

When analyzing the distribution law of Weibo HTML source code, feedback is usually in the form of a list. For example, if you search for the subject of "College Entrance Examination", he will feed back a lot of Weibo on the subject, and then distribute them in sequence, as shown in the figure below. Among them, the layout of each Weibo is the same. As shown in the figure above, we need to review its source code to see what laws there are. Follow the previously described method to view the HTML source code corresponding to the target location.

Each Weibo information is located under the <div class="card-feed">...</div> node. Multiple Weibo information can be obtained through the find_elements_by_xpath() function, and then the core information, such as user name, Content, release time, forwarding volume, number of comments and number of likes, etc. The core code is as follows:

info = driver.find_elements_by_xpath('//div[@class="card-feed"]')

for value in info:
    print(value.text)
    content = value.text

At this time, the crawled content is shown in the figure below, and the required field content can be extracted in turn by using regular expressions and string operations.

3.2.3 Locate user name

The first hyperlink located under the node <div class="info">...</div>, its corresponding source code is shown in the following figure:

Python's core code for locating user names is:

YHM = driver.find_element_by_xpath('//*[@id="pl_feedlist_index"]/div[1]/div[6]/div[2]/div[1]/div[2]/div[1]/div[2]/a[1]')

Finally, all the codes in this article are given for reference only:

import time
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# 打开 Chrome 浏览器，这顶等待加载时间
chromedriver = 'E:/software/chromedriver_win32/chromedriver.exe'
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)


# 登录函数
def LoginWeibo(username, password):
    print('准备登录微博...')
    driver.get('https://login.sina.com.cn/')
    # 用户名、密码
    elem_user = driver.find_element_by_name("username")
    elem_user.send_keys(username)
    elem_pwd = driver.find_element_by_name("password")
    elem_pwd.send_keys(password)
    elem_pwd.send_keys(Keys.RETURN)

    # 暂停20s，人为输入验证码
    time.sleep(20)
    elem_sub = driver.find_element_by_xpath('//input[@class="W_btn_a btn_34px"]')
    elem_sub.click()    # 单击登录

    print("登陆成功！")

# 函数搜索主题
def SearchWeibo(topic):
    try:
        # 访问新浪微博搜索页面
        url = 'https://s.weibo.com/'
        driver.get(url)

        driver.implicitly_wait(10)  # 隐式等待（单位是秒） 等到页面渲染完之后就不再等待
        driver.maximize_window()  # 最大化浏览器

        # 按回车键搜索主题
        elem_topic = driver.find_element_by_xpath('//*[@id="pl_homepage_search"]/div/div[2]/div/input')
        elem_topic.send_keys(topic)
        elem_topic.send_keys(Keys.RETURN)
        time.sleep(5)

        # 获取用户名
        for i in range(1,11):
            elem_name = driver.find_elements_by_xpath('//*[@id="pl_feedlist_index"]/div[1]/div[{}]/div[2]/div[1]/div[2]/div[1]/div[2]/a[1]'.format(i))
            for value in  elem_name:
                print(value.text)

        # 获取内容
        for i in range(1,11):
            elem_content = driver.find_elements_by_xpath('//*[@id="pl_feedlist_index"]/div[1]/div[{}]/div[2]/div[1]/div[2]/p[1]'.format(i))
            for value in  elem_content:
                print(value.text)

        # 获取时间
        for i in range(1,11):
            elem_time = driver.find_elements_by_xpath('//*[@id="pl_feedlist_index"]/div[1]/div[{}]/div[2]/div[1]/div[2]/p[3]/a[1]'.format(i))
            for value in  elem_time:
                print(value.text)

        # 获取来自
        for i in range(1,11):
            elem_from = driver.find_elements_by_xpath('//*[@id="pl_feedlist_index"]/div[1]/div[{}]/div[2]/div[1]/div[2]/p[3]/a[2]'.format(i))
            for value in  elem_from:
                print(value.text)

        # 获取评论数
        for i in range(1, 11):
            elem_PLnumber = driver.find_elements_by_xpath('//*[@id="pl_feedlist_index"]/div[1]/div[{}]/div[2]/div[2]/ul/li[3]/a'.format(i))
            for value in elem_PLnumber:
                print(value.text)

        # 获取转发数
        for i in range(1, 11):
            elem_ZFnumber = driver.find_elements_by_xpath('//*[@id="pl_feedlist_index"]/div[1]/div[{}]/div[2]/div[2]/ul/li[2]/a'.format(i))
            for value in elem_ZFnumber:
                print(value.text)

        # 获取点赞数
        for i in range(1, 11):
            elem_DZnumber = driver.find_elements_by_xpath('//*[@id="pl_feedlist_index"]/div[1]/div[{}]/div[2]/div[2]/ul/li[4]/a/em'.format(i))
            for value in elem_DZnumber:
                print(value.text)

    except Exception as e:
        print('Error: ', e)

    finally:
        print('爬取结束！')

# 主函数
if __name__ == '__main__':
    # 定义用户名、密码
    username = '账户'
    password = '密码'
    topic = '高考'
    # 调用函数登录微博
    LoginWeibo(username, password)
    # 调用函数搜索热门主题
    SearchWeibo(topic)

4 Summary of this article

In the process of using Python to design web crawlers, you often encounter situations where you need to log in and verify to crawl data, and some even need to enter a verification code, such as Weibo, Zhihu, email, Qzone, etc. The common solution is to implement simulated login by setting the header of the message. This article introduces another method to access the browser through Selenium technology, and operate the mouse and keyboard to automatically enter the user name and password, and then submit the form to log in. If you need to enter a verification code during the login process, you can use the time.sleep() code to pause, and manually enter the verification code to log in and then crawl the required information. This method can solve Weibo login, email login, and Baidu Login, Taobao login and other issues. Pay special attention to that when crawling massive amounts of data in a short period of time, the anti-crawler technology of some websites will detect your crawler and block your current IP, such as Weibo or Taobao. This needs to be achieved through an IP proxy. . Of course, more practical applications still need in-depth research and analysis.

Welcome to leave a message, learn and communicate together~

Thanks for reading

"Python crawler series explanation" 11. Selenium Weibo crawler based on login analysis

1 Login verification

1.1 Positioning elements

1.2 Open the Chrome browser

1.3 Use Selenium to get elements

1.4 Set a pause to enter the verification code and log in

2 First encounter with Weibo crawlers

2.1 Weibo

2.2 Login entrance

2.2.1 Common login entrances on Sina Weibo

2.2.2 Sina Weibo mobile phone login entrance

2.3 Weibo automatic login

3 Crawl hot Weibo information

3.1 Search the desired Weibo topic

3.2 Crawling Weibo content

3.2.1 Demand analysis

3.2.2 Analyze the HTML source code law of Weibo

3.2.3 Locate user name

4 Summary of this article

Welcome to leave a message, learn and communicate together~

END

Guess you like