Python爬虫实战：爬取JS组成的页面

最粗暴的方法是使用selenium+phantomjs无界面浏览器，这两者的结合其实就是直接操作浏览器，可以获取JavaScript渲染后的页面数据。

这两者结合使用的缺点：

由于是无界面浏览器，采用此方案效率极低，如果大批量抓取不推荐。

对于异步请求并且数据在源码中并不存在的，同时也就无法抓取到的数据

Selenium简介

Selenium是一个用于Web应用的功能自动化测试工具，Selenium 直接运行在浏览器中，就像真正的用户在操作一样。

由于这个性质，Selenium也是一个强大的网络数据采集工具，其可以让浏览器自动加载页面，获取需要的数据，甚至页面截图，或者是判断网站上某些动作是否发生。

Selenium自己不带浏览器，需要配合第三方浏览器来使用。支持的浏览器有Chrome、Firefox、IE、Phantomjs等。如果使用Chrome、FireFox或IE，我们可以看得到一个浏览器的窗口被打开、打开网站、然后执行代码中的操作。因为selenium+Firefox或者Chrome太慢了，所以我们选用selenium+PhantomJS。

Phantomjs是一个“无头”浏览器，也就是没有界面的浏览器，但是功能与普通的浏览器无异。是一个基于webkit的没有界面的浏览器，也就是它可以像浏览器解析网页，功能非常强大

例子1、爬取今日头条

# coding=utf-8
import requests
import json

url = 'http://www.toutiao.com/api/pc/focus/'
wbdata = requests.get(url).text

print(wbdata)
print()
data = json.loads(wbdata)
news = data['data']['pc_feed_focus']

for n in news:    
    title = n['title']    
    img_url = n['image_url']    
    url = n['media_url']    
    print(url,title,img_url)

需要引入requests包

结果：

例子2：爬取QQ空间说说

需要引入selenium包和下载phantomjs

# coding=utf-8
from bs4 import BeautifulSoup
from selenium import webdriver
import time

#使用selenium
driver = webdriver.PhantomJS(executable_path="D:\\python\\phantomjs-2.1.1\\bin\\phantomjs.exe")
driver.maximize_window()

#登录QQ空间
def get_shuoshuo(qq):
    driver.get('http://user.qzone.qq.com/{}/311'.format(qq))
    time.sleep(5)
    try:
        driver.find_element_by_id('login_div')
        a = True
        print("需要登录...")
    except:
        a = False
        print("不需要登录...")
        
    if a == True:
        driver.switch_to.frame('login_frame')
        driver.find_element_by_id('switcher_plogin').click()
        driver.find_element_by_id('u').clear()#选择用户名框
        driver.find_element_by_id('u').send_keys('QQ号码')
        driver.find_element_by_id('p').clear()
        driver.find_element_by_id('p').send_keys('QQ密码')
        driver.find_element_by_id('login_button').click()
        time.sleep(3)
    driver.implicitly_wait(3)
    
    print("验证权限...")
    try:
        driver.find_element_by_id('QM_OwnerInfo_Icon')
        b = True
    except:
        b = False
        
    if b == True:
        print("获取说说...")
        driver.switch_to.frame('app_canvas_frame')
        content = driver.find_elements_by_css_selector('.content')
        stime = driver.find_elements_by_css_selector('.c_tx.c_tx3.goDetail')
        for con,sti in zip(content,stime):
            data = {
                'time':sti.text,
                'shuos':con.text
            }
            print(data)
        pages = driver.page_source
        #print(pages)
        soup = BeautifulSoup(pages,'lxml')

    cookie = driver.get_cookies()
    cookie_dict = []
    for c in cookie:
        ck = "{0}={1};".format(c['name'],c['value'])
        cookie_dict.append(ck)
        
    i = ''
    for c in cookie_dict:
        i += c
    print('Cookies:',i)
    print("==========完成================")

    driver.close()
    driver.quit()

if __name__ == '__main__':
    get_shuoshuo('好友ＱＱ号')

。。

Python爬虫实战：爬取JS组成的页面

猜你喜欢