2019-11-11 爬网页3-爬取豆瓣电影信息保存到csv(python2.7，静态页面。requests/bs4/lxml/xpath）

网上看到《Python爬取豆瓣Top 250的电影,并输出到文件. demo,学习篇》，学习一下。

我的环境是Win10+python2.7
下载包，需要requests,lxml,bs4。bs4我已经安装过了。

C:\>pip install requests
Collecting requests
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
    100% |████████████████████████████████| 61kB 28kB/s
Collecting certifi>=2017.4.17 (from requests)
  Downloading https://files.pythonhosted.org/packages/18/b0/8146a4f8dd402f60744fa380bc73ca47303cccf8b9190fd16a827281eac2/certifi-2019.9.11-py2.py3-none-any.whl (154kB)
    100% |████████████████████████████████| 163kB 14kB/s
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests)
  Downloading https://files.pythonhosted.org/packages/e0/da/55f51ea951e1b7c63a579c09dd7db825bb730ec1fe9c0180fc77bfb31448/urllib3-1.25.6-py2.py3-none-any.whl (125kB)
    100% |████████████████████████████████| 133kB 15kB/s
Collecting idna<2.9,>=2.5 (from requests)
  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
    100% |████████████████████████████████| 61kB 54kB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests)
  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
    100% |████████████████████████████████| 143kB 14kB/s
Installing collected packages: certifi, urllib3, idna, chardet, requests
Successfully installed certifi-2019.9.11 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.6
You are using pip version 9.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

C:\>pip install lxml
Collecting lxml
  Downloading https://files.pythonhosted.org/packages/a2/bf/6cb6118b6492104723afac0f5c149877c76f03254802448cea1ea37c3781/lxml-4.4.1-cp27-cp27m-win_amd64.whl (3.6MB)
    100% |████████████████████████████████| 3.6MB 100kB/s
Installing collected packages: lxml
Successfully installed lxml-4.4.1
You are using pip version 9.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

C:\>pip install bs4
Requirement already satisfied: bs4 in c:\python27\lib\site-packages
Requirement already satisfied: beautifulsoup4 in c:\python27\lib\site-packages (from bs4)
Requirement already satisfied: soupsieve>=1.2 in c:\python27\lib\site-packages (from beautifulsoup4->bs4)
Requirement already satisfied: backports.functools-lru-cache; python_version < "3" in c:\python27\lib\site-packages (from soupsieve>=1.2->beautifulsoup4->bs4)
You are using pip version 9.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

1.requests模块

requests库用于爬取HTML页面，提交网络请求，基于urllib，但比urllib更方便。
参见 https://requests.kennethreitz.org/en/master/
主要方法：

requests.get()，http请求get方法
requests.post()，http请求post方法
requests.head()，获取HTML网页头信息
requests.put()，http请求put方法
requests.delete()，http请求delete方法

例：获得网页字符串

r = requests.get('https://api.github.com/user', auth=('user', 'pass'))

关于get，post，put，delete区别，参见https://blog.csdn.net/haif_city/article/details/78333213
综述：
1、POST /url 创建
2、DELETE /url/xxx 删除
3、PUT /url/xxx 更新
4、GET /url/xxx 查看

2.bs4

参见 https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/，写的很详细，还是中文版的。

BeautifulSoup4，简称bs4，能够快速方便简单的提取网页中指定的内容。基本使用过程，先通过其他模块（例如：requests）获得一个网页字符串，然后使用b4s的接口将网页字符串生成一个对象，然后通过这个对象的方法来提取数据。

bs4是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方式。

bs4自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。关于这一点可以参见 https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id51。

任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode

markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
soup = BeautifulSoup(markup)
soup.h1
# <h1>Sacré bleu!</h1>
soup.h1.string
# u'Sacr\xe9 bleu!'

Beautiful Soup用了编码自动检测子库来识别当前文档编码并转换成Unicode编码。BeautifulSoup 对象的 .original_encoding 属性记录了自动识别编码的结果。

soup.original_encoding
'utf-8'

通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码

3.lxml

lxml是python的一个解析库，支持HTML和XML的解析，支持XPath解析方式，而且解析效率非常高。参见 https://lxml.de/

4.补充：XPath

XPath，全称XML Path Language，即XML路径语言，它是一门在XML文档中查找信息的语言，它最初是用来搜寻XML文档的，但是它同样适用于HTML文档的搜索。

XPath的选择功能十分强大，它提供了非常简明的路径选择表达式，另外，它还提供了超过100个内建函数，用于字符串、数值、时间的匹配以及节点、序列的处理等，几乎所有我们想要定位的节点，都可以用XPath来选择。

XPath于1999年11月16日成为W3C标准，它被设计为供XSLT、XPointer以及其他XML解析软件使用，更多的文档可以访问其官方网站：https://www.w3.org/TR/xpath/

代码如下，我做了些修改

# -*- coding: utf8 -*-
import codecs
import requests
from bs4 import BeautifulSoup
 
# 下载网址
DOWNLOAD_URL = 'http://movie.douban.com/top250/'
 
# def 定义download_page函数 --- 相同于PHP function
def download_page(url):
    # 使用 requests get方法 .content编码 .text返回页面文本
    return requests.get(url).content
    #return requests.get(url).text
 
# 定义parse_html函数
def parse_html(html):
    # 使用 Beautifulsoup解析, 解析器使用 lxml
    soup = BeautifulSoup(html,"lxml")
    # 标签内容获取 <ol class="grid_view">
    movie_list_soup = soup.find('ol', attrs={'class': 'grid_view'})
    movie_name_list = []
    # 循环获取 li->span下标题
    for movie_li in movie_list_soup.find_all('li'):
        detail = movie_li.find('div', attrs={'class': 'hd'})

        #评分
        star = movie_li.find('span', attrs={'class': 'rating_num'}).getText()
        
        # 获取title
        movie_name = detail.find('span', attrs={'class': 'title'}).getText()
      
        # append函数会在数组后加上相应的元素
        movie_name_list.append(movie_name+','+star)
 
    # 获取分页数据
    next_page = soup.find('span', attrs={'class': 'next'}).find('a')
    if next_page:
        return movie_name_list, DOWNLOAD_URL + next_page['href']
    return movie_name_list, None
 
# 定义main函数
def main():
    url = DOWNLOAD_URL
    # 写法可以避免因读取文件时异常的发生而没有关闭问题的处理了
    '''
    #保存txt
    with codecs.open('movies.txt', 'wb', encoding='utf-8') as fp:
        while url:
            html = download_page(url)
            movies, url = parse_html(html)
            fp.write(u'{movies}\n'.format(movies='\n'.join(movies)))
          '''
    
    #保存csv
    with codecs.open('movies.csv', 'wb', encoding='gbk') as fp:
        while url:
            html = download_page(url)
            movies, url = parse_html(html)
            fp.write(u'{movies}\n'.format(movies='\n'.join(movies)))
 
# _name__ 是当前模块名，当模块被直接运行时模块名为 __main__ 。当模块被直接运行时，代码将被运行，当模块是被导入时，代码不被运行。
if __name__ == '__main__':
    main()

运行结果，movies.csv
在这里插入图片描述

如果有兴趣，还可以追加其他内容。

又，尝试的次数多了，被识别为IP异常，需要登录后才能继续执行。

没人不认识我

发布了122 篇原创文章 · 获赞 7 · 访问量 2万+

私信关注