Python- Crawling Crawler

Python- Crawling Crawler

In my daily work, there are endless demands for crawling data, and Python helps me handle it.

Requirement 1: Crawling a certain APP score

Related modules:

​ 1. Requests: send http requests to get the source code of the html page

​ 2. re: Cut out the required data fields through regular expressions

Code:

url = 'http://app.flyme.cn/games/public/detail?package_name=xxxxx'
ret = requests.get(url)
tmp_str = re.findall(r'魅友评分:</span>\r\n.*star_bg" data-num="\d+"', ret.text)
rate = re.findall(r'\d+', tmp_str[0])
rate_value = int(rate[0])/10

Requirement 2: Crawling reviews of an APP

Difficulty: The comment data is often dynamic content realized through JS, and the required data is not in the html source code. Use the Network module in the developer tool to trace the source and analyze the comment acquisition interface. For example, Huawei’s comment data can be analyzed and obtained on the AppStore official website with the following interface https://appgallery.cloud.huawei.com/uowap/index?method=internal.user.commenList3&serviceType=13&reqPageNum=1&maxResults=5&appid=C101886617&locale=zh_CN&LOCALE_NAME=zh_CN&version=10.0.0.

Related modules:

​ 1. Requests: Send http request to get the return value of the comment

​ 2. json: parse the returned data in json format

Code:

url = "https://appgallery.cloud.huawei.com/uowap/index?method=internal.user.commenList3&serviceType=13&reqPageNum=%d&maxResults=5&appid=C101886617&locale=zh_CN&LOCALE_NAME=zh_CN&version=10.0.0" % (i)
review_json = requests.get(url).text
review_text = json.loads(review_json)
review_entry = review_text['list']
for review in review_entry:
  version = review['versionName']
  title = review['title']
  comment = review['commentInfo']
  rate = float(review['rating'])
  review_id = review['id']
  opertime = review['operTime']

Requirement 3: Crawling a certain website content page

Difficulty: If there is no static data, and no data interface can be traced back to the source, then it must be a big killer. headless chrome is the non-interface form of chrome. It can be used in conjunction with selenium to simulate various click operations of the browser and obtain the html source code obtained after dynamic calculation.

Related modules:

​ 1. Selenium: Used in conjunction with headless chrome, it can simulate various click operations of the browser and obtain the html source code obtained after dynamic calculation

​ 2. bs4: BeautifulSoup, used to parse html tags

​ 3. re: Cut out the required data fields through regular expressions

Code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import re
import os

"""
# 模拟点击的方式汇总
# 1. 标签名及id属性值组合定位
driver.find_element_by_css_selector("input#kw")
# 2.  标签名及class属性值组合定位
driver.find_element_by_css_selector("input.s_ipt")
# 3. 标签名及属性(含属性值)组合定位
driver.find_element_by_css_selector("input[name="wd"]")
# 4. 标签及属性名组合定位
driver.find_element_by_css_selector("input[name]")
# 5. 多个属性组合定位
driver.find_element_by_css_selector("[class="s_ipt"][name="wd"]")
"""

url = "https://zz.hnzwfw.gov.cn/zzzw/item/deptInfo.do?deptUnid=001003008002030&deptName=" + \
            "%E5%B8%82%E5%8D%AB%E7%94%9F%E5%81%A5%E5%BA%B7%E5%A7%94%E5%91%98%E4%BC%9A&areaCode=410100000000#sxqdL"

# 第一步、初始化
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
driver.set_window_size(1024, 7680)

# 第二步、打开网页
driver.get(url)

cnt = 11 # 共11页
bianhao = 1
while cnt > 0:
    # 第三步、解析内容
    soup = BeautifulSoup(driver.page_source,"html.parser")
    item_list = soup.find('ul', id='deptItemList')
    span_list = item_list.findAll('span')
    for i in span_list:
        print(str(bianhao) + ". " + i.string.split('.')[1])
        bianhao += 1
    cnt -= 1
    if cnt == 0:
        break
    # 第四步、翻页(模拟点击下一页)
    driver.find_element_by_css_selector('a.laypage_next').click()



Reference documents

1. Detailed explanation of the Python-requests module

2. Python regular expressions

3. Getting started based on Python+Selenium+Chrome headless mode

4. Python3 fails to open chrome using selenium +webdriver, and an error is reported: No such file or directory:'chromedriver':'chromedriver'

5. Super detailed tutorial on using Beautiful Soup library in Python

6. Several methods of Selenium four find_element_by_css_selector()

Guess you like

Origin blog.csdn.net/ManWZD/article/details/108710978