UserWarning: Selenium support for PhantomJS has been deprecated

使用PhantomJS报错PhantomJS已经被停用了,搜索之后发现主要python3有这个现象,selenium已经不支持PhantomJS了。

PS C:\Users\jiangcheng\Documents\Python Scripts> & F:/Program/Anaconda3/python.exe "c:/Users/jiangcheng/Documents/Python Scripts/save_data.py"
F:\Program\Anaconda3\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
  warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Traceback (most recent call last):
  File "c:/Users/jiangcheng/Documents/Python Scripts/save_data.py", line 9, in <module>
    driver = webdriver.PhantomJS()
  File "F:\Program\Anaconda3\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py", line 56, in __init__
    self.service.start()
  File "F:\Program\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 98, in start
    self.assert_process_still_running()
  File "F:\Program\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 111, in assert_process_still_running
    % (self.path, return_code)
selenium.common.exceptions.WebDriverException: Message: Service phantomjs unexpectedly exited. Status code was: 4294967295

解决方案根据提示选择使用headless chrome,下载chromedriver,然后修改代码如下:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import csv

# 网易云音乐歌单第一页的url
url = 'https://music.163.com/#/discover/playlist/'\
      '?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0'

# 用PtantomJS接口创建一个Selenium的WebDriver
# driver = webdriver.PhantomJS()
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=chrome_options)

运行脚本之后发现报错:

[0709/142159.018:INFO:CONSOLE(990)] "Mixed Content: The page at 'https://music.163.com/#/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0' was loaded over HTTPS, but requested an insecure image 'http://p1.music.126.net/ybIVQ5KJE8SkZuPIQlLrvA==/19172184254128371.jpg?param=140y140'. This content should also be served over HTTPS.", source: https://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0 (990)
[0709/142159.172:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://s4.music.126.net will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://music.163.com/#/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0 (0)
[0709/142159.183:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://s3.music.126.net will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0 (0)
[0709/142159.787:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://p1.music.126.net will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0 (0)
[0709/142159.932:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://acstatic-dun.126.net will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://music.163.com/#/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0 (0)
Traceback (most recent call last):
  File "c:/Users/jiangcheng/Documents/Python Scripts/save_data.py", line 30, in <module>
    for i in range(len(data)):
TypeError: object of type 'WebElement' has no len()

经过查找发现有个方法用错了:

data = driver.find_element_by_id("m-pl-container").find_elements_by_tag_name("li")

其中"li"标签的获取写错了成为了 

find_element_by_tag_name

修改之后再运行一边发现报了另一个错:

[0709/142739.480:INFO:CONSOLE(990)] "Mixed Content: The page at 'https://music.163.com/#/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0' was loaded over HTTPS, but requested an insecure image 'http://p1.music.126.net/ybIVQ5KJE8SkZuPIQlLrvA==/19172184254128371.jpg?param=140y140'. This content should also be served over HTTPS.", source: https://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0 (990)
[0709/142739.481:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://s3.music.126.net will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0 (0)
[0709/142739.719:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://p1.music.126.net will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0 (0)
Traceback (most recent call last):
  File "c:/Users/jiangcheng/Documents/Python Scripts/save_data.py", line 35, in <module>
    msk = data[i].find_element_by_css_selectot("a.msk")
AttributeError: 'WebElement' object has no attribute 'find_element_by_css_selectot'

根据提示,这句代码有错

msk = data[i].find_element_by_css_selector("a.msk")
修改之后重新运行报错
[0709/143836.831:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://p1.music.126.net will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=35 (0)
Traceback (most recent call last):
  File "c:/Users/jiangcheng/Documents/Python Scripts/save_data.py", line 37, in <module>
    writer.writerow([msk.get_attribute('title'),nb,msk.get_attribute('href')])
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 7: illegal multibyte sequence

经过分析应该是csv存储数据的编码的问题

代码里有这样一句:

csv_file = open("playlist.csv","w",newline='')

修改为:

csv_file = open("playlist.csv","w",encoding='utf-8',newline='')

修改后运行报错:

DevTools listening on ws://127.0.0.1:54335/devtools/browser/00e94039-0401-48ce-8b68-6785a704e93f
Traceback (most recent call last):
  File "c:/Users/jiangcheng/Documents/Python Scripts/save_data.py", line 18, in <module>
    csv_file = open("playlist.csv","w",encoding='utf-8')
PermissionError: [Errno 13] Permission denied: 'playlist.csv'

经过搜索分析原来是因为自己已经将csv文件打开查看造成的。

关闭之后重新运行发现文档乱码如下:


经过搜索可以参考的根据如下,excel打开是出现中文乱码是因为excel能够正确识别用gb2312、gbk、gb18030或utf_8 with BOM 编码的中文,如果是utf_8 no BOM编码的中文文件,excel打开会乱码。所以不光是gb18030,utf_8_sig也可以解决。


终于跑通了。

完整代码分享如下:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import csv

# 网易云音乐歌单第一页的url
url = 'https://music.163.com/#/discover/playlist/'\
      '?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=0'

# 用PtantomJS接口创建一个Selenium的WebDriver
# driver = webdriver.PhantomJS()
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=chrome_options)

# 准备好存储歌单的csv文件
csv_file = open("playlist.csv","w",encoding='gb18030',newline='')
writer = csv.writer(csv_file)
writer.writerow(['标题','播放数','链接'])

# 解析每一页,直到‘下一页’为空
while url != 'javascript:void(0)':
    # 用WebDriver加载页面
    driver.get(url)
    #切换到内容的iframe
    driver.switch_to.frame("contentFrame")
    #定位歌单标签
    data = driver.find_element_by_id("m-pl-container").find_elements_by_tag_name("li")
    #解析一页中的所有歌单
    for i in range(len(data)):
        # 获取播放数
        nb = data[i].find_element_by_class_name("nb").text
        if '万' in nb and int(nb.split("万")[0]) > 500:
            # 获取播放数大于500万的歌单的封面
            msk = data[i].find_element_by_css_selector("a.msk")
            #把封面上的标题和链接连同播放数一起写到文件中
            writer.writerow([msk.get_attribute('title'),nb,msk.get_attribute('href')])
    # 定位‘下一页’的url
    url = driver.find_element_by_css_selector("a.zbtn.znxt").get_attribute('href')
csv_file.close()

猜你喜欢

转载自blog.csdn.net/sky_jiangcheng/article/details/80969330