Please install the python environment first O(∩_∩)O
preliminary work
First install the chrome driver
plug-in website: chrome driver
finds the corresponding download on the plug-in website:
Unzip Mobile:
There is a version number that shows success.
Remember under mac, be sure to let go of the permissions! ! ! !
First, the above figure demonstrates the result:
Only a part is shown, because there are too many data sources, and the complete code will be released in the end. The
principle is that the code controls the browser, and after the data is loaded asynchronously, the complete web page data can be obtained
(this website is for reference only, please test other data, the principle all the same)
Necessary library, remember to import
import time
from selenium import webdriver
from selenium.webdriver.common.by import By # 与下面的2个都是等待时要用到
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
Login part:
check the data of the web page first: log in to the website,
how to view the web page will not be taught
1. Get the xpath of two inputs
Code: fill username password:
url = 'http://www.mirrorpsy.cn/index.php?s=member&c=login&m=index'
driver = webdriver.Chrome()
driver.get(url)
wait = WebDriverWait(driver, 25)
driver.maximize_window() # 窗口最大化
user = wait.until(
EC.presence_of_element_located((By.XPATH, '//*[@id="myform"]/div/div[1]/input')))
user.send_keys('xxxx') # 用户密码就不暴露了
pw = wait.until(
EC.presence_of_element_located((By.XPATH, '//*[@id="myform"]/div/div[2]/input')))
pw.send_keys('xxxxx')
print('login param input success')
Run it directly, you can see that the browser starts automatically, and enter the username and password
2. Login
Get the xpath of the login button
login = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="myform"]/div/div[4]/button'))) # 登录按钮
login.click()
print('login success')
time.sleep(10)
Sleep for 10s in order for the page to fully load
3、cookies
Some websites use cookies, but this website [does not use it], post the one that gets cookies
cookies = driver.get_cookies()
cookie_list = []
for dict in cookies:
cookie = dict['name'] + '=' + dict['value']
cookie_list.append(cookie)
cookie = ';'.join(cookie_list)
print('cookie' + cookie)
4. Officially crawl the data and get the picture link
This paragraph is the list of pictures you need (we will talk about it after pagination)
directly check the xpath of the ul tag (I won’t teach it here, there is a way to view it above)
//*[@id=“cn_perio_list”]
After obtaining, get the label li under ul:
items = driver.find_element_by_xpath('//*[@id="cn_perio_list"]')
lis = items.find_elements_by_xpath('li')
The obtained li is a collection, traverse to obtain the subtag a , and obtain the attribute src of a
for li in lis:
imgs = li.find_element_by_tag_name('img')
print(imgs)
print(imgs.get_attribute('src'))
image_list.append(imgs.get_attribute('src'))
In this way, one page of data is just fine
paging
There are pagination labels below, but this website is quite special, the number of pages is dynamically loaded, and the content of the last page keeps changing
So directly use the method of the next page. If there is no li in the ul , just jump out of the infinite loop: Of course, how to get the page number is also provided (but it is useless)
# 本来想做页数解析的,但是他的页数动态改变,所以页数就不用了
pages = driver.find_element_by_xpath('//*[@id="laypage_1"]')
a = pages.find_elements_by_xpath('a')
pageEle = pages.find_element_by_xpath('//*[@id="laypage_1"]/a[' + str(len(a) - 1) + ']')
print('page:' + str(pageEle.text))
page = int(pageEle.text)
The above completes the crawling
complete code:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By # 与下面的2个都是等待时要用到
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
if __name__ == "__main__":
url = 'http://www.mirrorpsy.cn/index.php?s=member&c=login&m=index'
driver = webdriver.Chrome()
driver.get(url)
wait = WebDriverWait(driver, 25)
driver.maximize_window() # 窗口最大化
user = wait.until(
EC.presence_of_element_located((By.XPATH, '//*[@id="myform"]/div/div[1]/input')))
user.send_keys('xxx') #想要测试可留言
pw = wait.until(
EC.presence_of_element_located((By.XPATH, '//*[@id="myform"]/div/div[2]/input')))
pw.send_keys('xxxx')
print('login param input success')
login = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="myform"]/div/div[4]/button'))) # 登录按钮
login.click()
print('login success')
time.sleep(10)
cookies = driver.get_cookies()
cookie_list = []
for dict in cookies:
cookie = dict['name'] + '=' + dict['value']
cookie_list.append(cookie)
cookie = ';'.join(cookie_list)
print('cookie' + cookie)
print('get knowledge')
# 学科知识
url_knowledge = 'http://www.mirrorpsy.cn/index.php?s=xkzs'
# 心理视频
url_video = 'http://www.mirrorpsy.cn/index.php?s=video'
# 心理音频
url_jyxl = 'http://www.mirrorpsy.cn/index.php?s=jyxl'
# 心理图片
url_photo = 'http://www.mirrorpsy.cn/index.php?s=photo'
# 学术论文
url_xslw = 'http://www.mirrorpsy.cn/index.php?s=xslw'
# 心理案例
url_xlal = 'http://www.mirrorpsy.cn/index.php?s=xlal'
# 心理专家
url_news = 'http://www.mirrorpsy.cn/index.php?s=news'
# driver.get(url_photo)
# 本来想做页数解析的,但是他的页数动态改变,所以页数就不用了
# pages = driver.find_element_by_xpath('//*[@id="laypage_1"]')
# a = pages.find_elements_by_xpath('a')
# pageEle = pages.find_element_by_xpath('//*[@id="laypage_1"]/a[' + str(len(a) - 1) + ']')
# print('page:' + str(pageEle.text))
# page = int(pageEle.text)
image_list = []
pageWeb = 0
while True:
print('get page:' + str(pageWeb))
url_pic = 'http://www.mirrorpsy.cn/index.php?s=photo&c=search&page=' + str(pageWeb + 1)
driver.get(url_pic)
items = driver.find_element_by_xpath('//*[@id="cn_perio_list"]')
lis = items.find_elements_by_xpath('li')
print('len:' + str(len(lis)))
if len(lis) == 0:
break
for li in lis:
imgs = li.find_element_by_tag_name('img')
print(imgs)
print(imgs.get_attribute('src'))
image_list.append(imgs.get_attribute('src'))
time.sleep(5)
pageWeb = pageWeb + 1
# pages = driver.find_element_by_xpath('//*[@id="laypage_1"]')
# a = pages.find_elements_by_xpath('a')
# pageEle = pages.find_element_by_xpath('//*[@id="laypage_1"]/a[' + str(len(a) - 1) + ']')
# page = int(pageEle.text)
print('图片爬取结束')
print('图片集合结果:' + str(image_list))
time.sleep(500000)
Thank you, pay attention to the wave, and update it irregularly every week in the future