Python-爬爬爬虫

日常工作中，遇到的爬取数据的需求层出不穷，Python 帮我搞定它。

需求一、爬取某 APP 评分

相关模块：

1、requests ：发送 http 请求，获取 html 页面源代码

2、re ：通过正则表达式把需要的数据字段抠出来

代码：

url = 'http://app.flyme.cn/games/public/detail?package_name=xxxxx'
ret = requests.get(url)
tmp_str = re.findall(r'魅友评分：</span>\r\n.*star_bg" data-num="\d+"', ret.text)
rate = re.findall(r'\d+', tmp_str[0])
rate_value = int(rate[0])/10

需求二、爬取某 APP 评论

难点： 评论数据往往是通过 JS 实现的动态内容，需要的数据不在 html 源代码中，利用开发者工具中的 Network 模块，溯源分析评论获取接口。例如华为的评论数据，可以在 AppStore 官网中分析获得，接口为 https://appgallery.cloud.huawei.com/uowap/index?method=internal.user.commenList3&serviceType=13&reqPageNum=1&maxResults=5&appid=C101886617&locale=zh_CN&LOCALE_NAME=zh_CN&version=10.0.0 。

相关模块：

1、requests ：发送 http 请求，获取评论返回值

2、json ：解析 json 格式的返回数据

代码：

url = "https://appgallery.cloud.huawei.com/uowap/index?method=internal.user.commenList3&serviceType=13&reqPageNum=%d&maxResults=5&appid=C101886617&locale=zh_CN&LOCALE_NAME=zh_CN&version=10.0.0" % (i)
review_json = requests.get(url).text
review_text = json.loads(review_json)
review_entry = review_text['list']
for review in review_entry:
  version = review['versionName']
  title = review['title']
  comment = review['commentInfo']
  rate = float(review['rating'])
  review_id = review['id']
  opertime = review['operTime']

需求三、爬取某网站内容网页

难点： 没有静态数据，也无法溯源得到数据接口，那就必须上大杀器了。headless chrome 即 chrome 的无界面形态，和 selenium 配合使用，可以模拟浏览器的各种点击操作，获取动态计算后得到的 html 源代码

相关模块：

1、 selenium ：和 headless chrome 配合使用，可以模拟浏览器的各种点击操作，获取动态计算后得到的 html 源代码

2、 bs4 ：即 BeautifulSoup，用于解析 html 标签

3、re ：通过正则表达式把需要的数据字段抠出来

代码：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import re
import os

"""
# 模拟点击的方式汇总
# 1. 标签名及id属性值组合定位
driver.find_element_by_css_selector("input#kw")
# 2.  标签名及class属性值组合定位
driver.find_element_by_css_selector("input.s_ipt")
# 3. 标签名及属性（含属性值）组合定位
driver.find_element_by_css_selector("input[name="wd"]")
# 4. 标签及属性名组合定位
driver.find_element_by_css_selector("input[name]")
# 5. 多个属性组合定位
driver.find_element_by_css_selector("[class="s_ipt"][name="wd"]")
"""

url = "https://zz.hnzwfw.gov.cn/zzzw/item/deptInfo.do?deptUnid=001003008002030&deptName=" + \
            "%E5%B8%82%E5%8D%AB%E7%94%9F%E5%81%A5%E5%BA%B7%E5%A7%94%E5%91%98%E4%BC%9A&areaCode=410100000000#sxqdL"

# 第一步、初始化
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
driver.set_window_size(1024, 7680)

# 第二步、打开网页
driver.get(url)

cnt = 11 # 共11页
bianhao = 1
while cnt > 0:
    # 第三步、解析内容
    soup = BeautifulSoup(driver.page_source,"html.parser")
    item_list = soup.find('ul', id='deptItemList')
    span_list = item_list.findAll('span')
    for i in span_list:
        print(str(bianhao) + ". " + i.string.split('.')[1])
        bianhao += 1
    cnt -= 1
    if cnt == 0:
        break
    # 第四步、翻页（模拟点击下一页）
    driver.find_element_by_css_selector('a.laypage_next').click()

参考文档

1、 Python—requests模块详解

2、 Python 正则表达式

3、基于Python+Selenium+Chrome headless 模式入门

4、 python3 使用selenium +webdriver打开chrome失败，报错:No such file or directory: ‘chromedriver’: ‘chromedriver’

5、 Python中使用Beautiful Soup库的超详细教程

6、 Selenium四 find_element_by_css_selector(）的几种方法