Basic crawler actual combat case of obtaining game product data


foreword

When you want to get some data of the website, you can manually copy and paste it. This effect is not very low. When the number is small, you may feel that something is wrong. When the number is large, it will appear very powerless, so the crawler is After being scheduled to play, this article introduces the basic content of reptile learning.


1. What is a reptile?

Web crawlers, also known as web spiders and web robots, are more often called web chasers in the FOAF community. They are programs or scripts that automatically grab information from the World Wide Web according to certain rules. Other names used include ant, autoindex, emulator, or worm.

2. Practical cases of reptiles

1. Import library

First introduce some library
codes to be used next (example):

# requests 用来爬取页面
import requests
# logging 用来输出信息
import logging
# re 用来实现正则表达式解析
import re
# pymongo 用来链接mongodb数据库的
import pymongo
# 随机
import random
from lxml import etree
# urljoin 用来做 URL 的拼接
from urllib.parse import urljoin
# 多线程的引入
import multiprocessing
# pymongo有自带的连接池和自动重连机制,但是仍需要捕捉AutoReconnect异常并重新发起请求。
from pymongo.errors import AutoReconnect
from retry import retry
# 随机获取 UserAgent
from fake_useragent import UserAgent
import time

2. Request web page processing

The scrape_page method is to imitate the browser to send a request to judge whether the requested webpage is successful, and the exception processing
code is as follows (example):

# 开始时间
start = time.time()

'''
如果没有为根日志程序定义处理程序,debug()、info()、warning()、error()和 critical() 函数将自动调用 basicConfig()。

level	将根记录器级别设置为指定的级别。默认生成的 root logger 的 level 是 logging.WARNING,低于该级别的就不输出了。级别排序:CRITICAL > ERROR > WARNING > INFO > DEBUG。(如果需要显示所有级别的内容,可将 level=logging.NOTSET)

format	为处理程序使用指定的格式字符串。
指定输出格式
'''
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s: %(message)s')

# 写入请求头
headers = {
    
    
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cache-control': 'max-age=0',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'User-Agent': UserAgent().chrome
}

# 代理IP列表
Proxies = [
    {
    
    "http": "http://3.221.105.1:80"},
    {
    
    "http": "http://3.221.105.1:80"},
    {
    
    "http": "http://3.221.105.1:80"},
]

def scrape_page(url):

    logging.info('scraping %s (···)', url)
    try:
        # 向url对应的服务器发送相应的get请求,获得对应的相应 。
        response = requests.get(
            url, proxies=random.choice(Proxies), headers=headers)
        # 如果响应码为200则返回网页源码
        if response.status_code == 200:
            return response.text
        # 如果不是200就在日志里打印出响应码和链接
        logging.error('get invalid status code %s while scraping %s',
                      response.status_code, url)
        # requests 的异常处理
    except requests.RequestException:
        # exc_info为布尔值,如果该参数的值为True时,则会将异常信息添加到日志消息中;如果没有则会将None添加到日志信息中。
        logging.error('error occurred while scraping %s', url, exc_info=True)

3. Generate access link

The scrape_index method is to write all the generated urls into the generator for the next use.
The code is as follows (example):

# 要爬取的网站链接
TSY_URL = 'https://www.taoshouyou.com/game/zhaohuanyingxiong-24798-0-3'

GAME_NAME = '召唤英雄'
# 初始化需要爬取的总页码数量
TOTAL_PAGE = 1

def scrape_index(url):
    html = scrape_page(url)
    # 获取页码数
    pattern = r'<li><a>(.*?)</a></li>'
    TOTAL_PAGE = re.findall(pattern, html, re.S)[0]
    TOTAL_PAGE = TOTAL_PAGE.split('/')[1]

    # 生成链接列表
    for page in range(1, int(TOTAL_PAGE) + 1):
        # 拼接url
        game_url = url + '/0-0-0-0-0-1-0-0-0-' + str(page) + '?quotaid=0'
        yield game_url

4. Read data into mongodb

Write the crawled data into the mongodb database.
The code is as follows (example):


# 指定 mongodb 的连接IP,库名,集合
MONGO_CONNECTION_STRING = 'mongodb://192.168.27.101:27017'
MONGO_DB_NAME = 'tsy'
MONGO_COLLECTION_NAME = 'tsy'

client = pymongo.MongoClient(MONGO_CONNECTION_STRING)
db = client['tsy']
collection = db['tsy']

'''
AutoReconnect:捕捉到该错误时进行重试,这个参数可以是一个元组,里面放上多个需要重试的条件
tries:重试次数
delay:两次重试的间隔时间
'''
@retry(AutoReconnect, tries=4, delay=1)
def save_data(data):
    """
    将数据保存到 mongodb
    使用 update_one() 方法修改文档中的记录。该方法第一个参数为查询的条件,第二个参数为要修改的字段。
    upsert:
    是一种特殊的更新,如果没有找到符合条件的更新条件的文档,就会以这个条件和更新文档为基础创建一个新的文档;如果找到了匹配的文档,就正常更新,upsert非常方便,不必预置集合,同一套代码既能用于创建文档又可以更新文档
    """
    # 存在则更新,不存在则新建,
    collection.update_one({
    
    
        # 保证 标题链接数据 是唯一的
        '标题链接': data.get('标题链接')
    }, {
    
    
        '$set': data
    }, upsert=True)

5. Obtain data

Get the data and perform simple processing on the data.
The code is as follows (example):

def parse_detail(url):
    html = scrape_page(url)
    # 格式化html
    selector = etree.HTML(html)
    house_list = selector.xpath(
        '//*[@id="js-b-trade-list-conlist-trade-list"]/div[@class="row b-trade-list-conlist-box"]')
    # 历遍每一条数据
    for house in house_list:
        biaoti = house.xpath("h1/a/span/text()")
        if len(biaoti) >= 1:
            biaoti = biaoti[0]
            zhekou = house.xpath("h1/a/span[2]/text()")
            jiage = re.findall(r'(\d+\.\d{2})', biaoti.strip())
            dianpu = house.xpath('div[1]/dl/dd[1]/span[2]/a/text()')
            shangpinleixing = house.xpath('div[1]/dl/dd[2]/text()')
            kehuduan = house.xpath('div[1]/dl/dd[3]/text()')
            youxiqufu = house.xpath('div[1]/dl/dd[4]/span/text()')
            html_1 = house.xpath('h1/a/@href')
            if zhekou == [] or dianpu == [] or shangpinleixing == [] or kehuduan == [0] or youxiqufu == [0]:
                print("这不是我需要的")
            else:

                zhekou = re.findall(r'\d+\.?\d*', zhekou[0])
                game_date = {
    
    
                    '标题': biaoti.strip(),
                    '折扣': zhekou[0],
                    '价格': jiage[0][:-3],
                    '店铺': dianpu[0],
                    '商品类型': shangpinleixing[0].split(':')[1],
                    '客户端类型': kehuduan[0].split(':')[1],
                    '游戏区服': youxiqufu[0],
                    # 拼接url
                    '标题链接': urljoin('https://www.taoshouyou.com', html_1[0]),
                    '游戏名': GAME_NAME
                }
                logging.info('get detail data %s', game_date)
                logging.info('saving data to mongodb')
                save_data(game_date)
                logging.info('data saved successfully')

6. Add multithreading

The code is as follows (example):

if __name__ == '__main__':
    # 引入多线程
    pool = multiprocessing.Pool()
    detail_urls = scrape_index(TSY_URL)
    # map()函数。需要传递两个参数,第一个参数就是需要引用的函数,第二个参数是一个可迭代对象,它会把需要迭代的元素一个个的传入第一个参数我们的函数中。因为我们的map会自动将数据作为参数传进去

    # 传入一个url列表,parse_detail方法每次只能获取一条url
    pool.map(parse_detail, detail_urls)
    # 关闭mongodb连接
    client.close()

    # 关闭进程池,不再接受新的进程
    pool.close()
    # 主进程阻塞等待子进程的退出
    pool.join()

    # 结束时间
    end = time.time()
    print('Cost time: ', end - start)

Summarize

The above is what I want to talk about today. This article only briefly introduces the use of crawlers that have data available in the source code of web pages.

Guess you like

Origin blog.csdn.net/weixin_45688123/article/details/125930815