Python初级爬虫体验爬取静态页面v.s. selenium webdriver 抓取动态页面

参考：

https://foofish.net/python-crawler-html2pdf.html

http://www.cnblogs.com/tuohai666/p/8718107.html

最近做python 2 to 3的工作，想要爬取w3c school的python3教程并转换成pdf方便随时查看。

简单搜了一下，找到参考链接的博客，于是开始step by step跟着走。

本文简单记录步骤和遇到的问题，windows10，python36。

工具准备

python packages

pip install requests
pip install beautifulsoup4
pip install pdfkit

安装wkhtmltopdf

当前平台是windows，到官网 https://wkhtmltopdf.org/downloads.html 下载稳定版并将程序执行路径加入到系统环境$PATH变量中。

目标页面分析

浏览器F12打开开发者工具，找到目标元素的locator，需要采集的目标元素是以下三个div

div.siderbar-content: 章节目录
div.content-top：章节标题
div.content-bg：章节内容

爬虫实现

任务拆分

从python3页面拿到所有章节的URLs。
用requests把目标URL整个内容加载到本地, 用beautifulsoup操作HTML的dom元素提取正文部分，存到本地html文件(当前目录的static子目录下）。
用pdfkit htmls文件列表，生成一个pdf文档。

第一步，从https://www.w3cschool.cn/python3/ 页面拿到所有章节的URLs。

base_url='https://www.w3cschool.cn/python3/'

def get_url_list():
    # Get URLs list for python3 tutorial
    base_url_for_python3 = base_url + '/python3'
    response = requests.get(base_url_for_python3)

    soup = BeautifulSoup(response.content.decode(), 'html.parser')
    menu_tags = soup.find('div','sidebar-content')

    urls = []
    for link in menu_tags.find_all('a'):
        url = base_url + link.get('href')
        urls.append(url)
    return urls

第二步，用requests把目标URL整个内容加载到本地，用beautifulsoup操作HTML的dom元素提取正文部分，存到本地html文件。

def get_content(url):
    # Get Page Content for each url and save to html files
    print('Opening URL',url)  
    chapter_name = url.split('/')[-1]
    response = requests.get(url) soup = BeautifulSoup(response.content.decode(), 'html.parser') 
    # Get chapter title 
    head = soup.find_all('div','content-top') 
    # Get chapter content 
    content = soup.find_all('div','content-bg') 
    html = str(head).encode() + str(content).encode() 
    with open('static/{}'.format(chapter_name), 'wb') as f: 
        f.write(html)

第三步，用pdfkit htmls文件列表，生成一个pdf文档。

def save_as_pdf(htmls):
    # Save the html file list content to pdf file
    options = {
        'page-size': 'Letter',
        'margin-top': '0.75in',
        'margin-right': '0.75in',
        'margin-bottom': '0.75in',
        'margin-left': '0.75in',
        'encoding': "UTF-8",
        'custom-header': [
            ('Accept-Encoding', 'gzip')
        ],
        'cookie': [
            ('cookie-name1', 'cookie-value1'),
            ('cookie-name2', 'cookie-value2'),
        ],
        'outline-depth': 10,
    }
    pdfkit.from_file(htmls,'w3c_python3_tutorial.pdf',options=options)

入口函数

def get_w3c_python3_tutorial():
    url_list = get_url_list()
    for url in url_list:
        get_content(url)
    ori_list = os.listdir('./static')    
    file_name_list = [ 'static/' + s for s in ori_list ]
    save_as_pdf(file_name_list)

问题解决

以上代码碰到了两个小问题。

pdf文件内容没有按照章节先后顺序排序。
部分章节获取的内容不全，如 https://www.w3cschool.cn/python3/python3-basic-syntax.html，因打开页面后，浏览器执行了 javascript脚本动态生成页面。使用requests包无法操作javascript，满足不了动态页面内容的抓取。

第一个问题的解决方法比较简单粗暴…… 定义一个全局变量，指定章节序号，每处理一个章节，序号+1。然后再以数字序号对文件列表排序，将排序好的list传给save_as_pdf。

base_url = 'https://www.w3cschool.cn'
counter = 1

def get_content(url):
    # Get Page Content for each page and save to html files
    print('Opening URL',url)
    chapter_name = url.split('/')[-1]
    response = requests.get(url)

    soup = BeautifulSoup(response.content.decode(), 'html.parser')
    head = soup.find_all('div','content-top')
    content = soup.find_all('div','content-bg')
    html = str(head).encode() + str(content).encode()

    global counter
    file_name = '%d-%s' % (counter, chapter_name)
    with open('static/{}'.format(file_name), 'wb') as f:
        f.write(html)
    counter = counter + 1

def sorted_aphanumeric(list):
    # Sort list with numbers
    convert = lambda text: int(text) if text.isdigit() else text.lower()
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
    return sorted(data, key=alphanum_key)

第二个问题，我们要用到另外的python package - selenium。

pip install selenium

从官网下载稳定版的chromedriver.exe - http://chromedriver.chromium.org/downloads ，并将它放到脚本可访问到的目录，本实验将chromedriver.exe放在了脚本的同目录下。

启动webdriver。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')  # ‘无头’调用chrome，用户不能看到chrome被启动，最新的chrome browser支持headless模式，可以替代phantomjs
chrome_options.add_argument('--ignore-certificate-errors')
chrome_driver = os.getcwd() + r'\chromedriver.exe'

driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=chrome_driver)

使用selenium webdriver获取url list

def get_url_list():
    python3_tutorial_url = base_url + '/python3'
    driver.get(python3_tutorial_url)
    sidebar_content = driver.find_element_by_css_selector('div.sidebar-content')
    soup = BeautifulSoup(sidebar_content.get_attribute('innerHTML'), 'html.parser')

    urls = []
    for link in soup.find_all('a'):
        url = base_url + link.get('href')
        urls.append(url)
    return urls

获取页面内容

def get_content(url):
    print('Opening URL',url)
    global counter
    chapter_name = str(counter) + '-' + url.split('/')[-1]

    driver.get(url)
    content_top = driver.find_element_by_css_selector('div.content-top')
    content_bg = driver.find_element_by_css_selector('div.content-bg')

    content_value = content_top.get_attribute('innerHTML').encode() + content_bg.get_attribute('innerHTML').encode()

    with open('dynamic/{}'.format(chapter_name), 'wb') as f:  # 这次我们存到dynamic子目录下  :)
        f.write(content_value)

    counter= counter+1

效果图如下