Introductory crawler study notes Day 5 + record small problems encountered

1. Label object extracts text content and attribute values

1. Get text: element_text
2. Get attribute value: element.get_attribute("attribute name")

Code: (rewrite the content in the for loop on the basis of learning in day 4)

import time
from selenium import webdriver

url = 'https://bj.58.com/chuzu/?PGTID=0d200001-0000-1fd0-27cd-e5f5a55e18c9&ClickID=1'
driver = webdriver.Chrome()
driver.get(url)

#取出标题
el_list = driver.find_elements_by_xpath('/html/body/div[6]/div[2]/ul/li/div[2]/h2/a')

for el in el_list:
    print(el.text, el.get_attribute('href'))

The final for loop output is as follows: (text + attribute value)
insert image description here

2. Label switching

1. Obtain the window handle (list)
handle: point to the identification of the tab page object
current_windows = driver.window_handles

2. Switch the label through the window handle (list index subscript)
driver.switch_to.window(current_windows[0])

Code:
insert image description here
Find the "rent" button in the above picture - right click, check - copy xpath - fill in find_element()

from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'http://jn.58.com/'
driver = webdriver.Chrome()
driver.get(url)

#先打印出来检查一下
print(driver.current_url)
print(driver.window_handles)

#定位并且点击租房按钮
el = driver.find_element(By.XPATH,'/html/body/div[3]/div[1]/div[1]/div/div[1]/div[1]/span[1]/a')
el.click()

#点击以后再打印一下url和句柄
print(driver.current_url)
print(driver.window_handles)

The above code is to get the list of handles. (I used find_element_by_xpath() at the beginning, and found an error later. See the hyperlink below for the correction method, and finally use find_element() instead). After running the above code, you can see that
insert image description here
the output handle is only one line when there is no click, and there are two windows after clicking, so the handle list is two lines. From the url
here , it can be seen that the operation is mainly on the home page, and the page does not jump to the newly opened page for operation.

If you want to switch windows , add the following code:
Note: Copy the xpath of the title of the opened page to find_elements(), and delete the index of the li tag to select all the titles.
insert image description here
Add after the code just now:

driver.switch_to.window(driver.window_handles[-1])

el_list = driver.find_elements(By.XPATH,'/html/body/div[6]/div[2]/ul/li/div[2]/h2/a')
print(len(el_list)) #switch切换之后长度不为0

If switch_to.window() is commented out, the output len ​​is 0, because the home page does not have a corresponding label before the page is switched.

3. Window switching

Take login QQ space as an example:
insert image description here
check—check the 'account password login' button and the id of the account number and password input box and the id of the login button.

Code: (in the send_keys() below, you need to enter your account and password to log in)

from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'https://qzone.qq.com/'
driver = webdriver.Chrome()
driver.get(url)

el_frame = driver.find_element(By.XPATH,'//*[@id="login_frame"]')
driver.switch_to.frame('login_frame')

driver.find_element(By.ID,'switcher_plogin').click()
driver.find_element(By.ID,'u').send_keys('自己的账号')
driver.find_element(By.ID,'p').send_keys('自己的密码')
driver.find_element(By.ID,'login_button').click()

4. Cookies operation

code:

from selenium import webdriver

url = 'https://www.baidu.com/'
driver = webdriver.Chrome()
driver.get(url)

# print(driver.get_cookies())

# cookies = {}
# for data in driver.get_cookies():
#     cookies[data['name']] = data['value']
#上面三行写成正则表达式,如下:
cookies = {
    
    data['name']:data['value'] for data in driver.get_cookies()}

print(cookies)

operation result:
insert image description here

Five, execute js code

When we encounter a new page, the button to be clicked is not on the page (you have to pull down to see it)

Scroll bar dragging:

js = ‘scrollTo(x,y)’

Here x is generally 0, and if you drag down, y needs to enter a value greater than 0.

code:

from selenium import webdriver
import time

url = 'https://www.某个网址.com/'
driver = webdriver.Chrome()
driver.get(url)

#js语句
js = 'window.scrollTo(0,document.body.scrollHeight)'
#执行js语句
driver.execute_script(js)

time.sleep(5)
driver.quit()

6. Page waiting

3 categories of page waits:

1. Mandatory Classification

Set a fixed wait time:

time.sleep(5)

2. Implicit classification (recommended)

Set the waiting time. If the corresponding element has been located before the time is up, proceed to the next step.

driver.implicitly_wait(10)

3. Display classification (understand)

Explicitly wait for an element.

4. Case: (Taobao page flip)

code:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

url = 'https://www.taobao.com/'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(1)

for i in range(10):
    i += 1
    try:
        time.sleep(3)
        element = driver.find_element(By.XPATH,'//div[@class="shop-inner"]/h3[1]/a')
        print(element.get_attribute('href'))
        break
    except:
        js = 'window.scrollTo(0,{})'.format(i*500)
        driver.execute_script(js)

driver.quit()

Seven, configuration object

Enable headless mode

code:

from selenium import webdriver

url = 'http://www.baidu.com/'

#创建配置对象
opt = webdriver.ChromeOptions()

#添加配置参数
opt.add_argument('--headless')
opt.add_argument('--disable-gpu')

#创建浏览器对象的时候添加配置对象
driver = webdriver.Chrome(options=opt)

driver.get(url)

driver.save_screenshot('无界面浏览器截图.png')

little problem encountered

1. The grammar of selenium element positioning method has changed

Error: Deprecation Warning: find_element_by_* commands are deprecated. Please use find_element() instead
el = driver.find_element_by_xpath
insert image description here
corresponds to the solution I want to use xpath here:
deprecate by_xpath

Similarly, find_element_by_id, class and other methods are also replaced by the find_element() method.
new positioning method

Finally remember to import:

from selenium.webdriver.common.by import By

2. When configuring the object, the chrome_options parameter reports an error

Code at the beginning:

#创建浏览器对象的时候添加配置对象
driver = webdriver.Chrome(chrome_options=opt)

Error:

Deprecation Warning: use options instead of chrome_options

Probably because the parameter chrome_options has been deprecated.

Just replace chrome_options with options.

Guess you like

Origin blog.csdn.net/qq_51669241/article/details/122530796
Recommended