[Python crawler series tutorial 29-100] Through the scrapy framework, Miss Sister teaches you to crawl down the entire beauty net, how much you need, and mainly learn custom pipelines (pictures and CSV)

It's 2021 and I haven't crawled your favorite beauty pictures. Does my crawler look OUT? But it’s okay. Although the current interface has changed, then I will talk about how to get the entire beauty network in 2021. This is an improved project. I participated in many of my own ideas, such as some that are difficult to understand. I use it myself. I realized it in a simple way, and I feel that it is a good realization. You can take a look at it. The results of the first crawl.

Introduction

Beauty web crawling based on Scrapy framework

Crawler entry address: http://www.meinv.hk/?cat=2

If your crawler is running normally but there is no data, the possible reason is that you need a ladder to access the site.

Here we mainly learn two technical points, a custom picture pipeline and a custom csv data pipeline

Implementation process

It's too simple to create a project, let's not talk about it.

Open website


In the process of clicking, specific crawling ideas are obtained. First, the popular recommended tags are crawled, and then the URL of the specific picture of each beauty is obtained.

Then use the rules.

The rules for crawling the url in the response are specified in the rules. The url obtained by the crawl will be requested again, and parsed or followed up according to the settings of the callback function and the follow attribute.

Two points are emphasized here:

  • One is to extract URLs from all returned responses, including the response from the first URL request;
  • The second is that all rules specified in the rules list will be executed.

Enter pipeline.py, import from scrapy.pipelines.images import ImagesPipeline, and inherit ImagesPipeline

The custom pipeline can be completed on the basis of the ImagesPipeline that comes with scrapy.
You can rewrite the three methods in ImagesPipeline: get_media_requests(), file_path(), item_completed()

Insert picture description here

Specific code

Because you are using a custom pipeline (pictures and CSV), there is no need to write item.py

mv.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MvSpider(CrawlSpider):
    name = 'mv'
    allowed_domains = ['www.meinv.hk']
    start_urls = ['http://www.meinv.hk/?cat=2']

    # 增加提取 a 标签的href连接的规则
    # 将提取到的href连接,生成新的Request 请求, 同时指定新的请求后解析函数
    rules = (
        # allow 默认使用正则的表达式,查找所有a标签的href
        # follow 为True时,表示在提取规则连接下载完成后,是否再次提取规则中连接
        Rule(LinkExtractor(allow=r'p=\d+'), callback='parse_item', follow=True),

    )

    def parse_item(self, response):
        item = {}
        info = response.xpath('//div[@class="wshop wshop-layzeload"]/text()').extract_first()
        try:
            item['hometown'] = info.split("/")[2].strip().split()[1]
            item['birthday'] = info.split("/")[1].strip().split()[1]
        except:
            item['birthday'] = "未知"
            item['hometown'] = "未知"
        item['name'] = response.xpath('//h1[@class="title"]/text()').extract_first()
        images = response.xpath('//div[@class="post-content"]//img/@src')
        try:
            item['image_urls'] = images.extract()
        except:
            item['image_urls'] = ''
        item['images'] = ''
        item['detail_url'] = response.url
        yield item

middlewares.py

import random

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware


class RandomUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent_list):
        super().__init__()
        self.user_agent_list = user_agent_list

    @classmethod
    def from_crawler(cls, crawler):
        return cls(user_agent_list=crawler.settings.get('USER_AGENT_LIST'))

    def process_request(self, request, spider):
        user_agent = random.choice(self.user_agent_list)
        if user_agent:
            request.headers['User-Agent'] = user_agent
        return None

pipelines.py

import csv
import os
from hashlib import sha1

from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline
from meinv import settings


class MvImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for url in item['image_urls']:
            yield Request(url, meta={'name': item['name']})

    def item_completed(self, results, item, info):
        #将下载完成后的图片路径设置到item中
        item['images'] = [x for ok, x in results if ok]
        return item


    def file_path(self, request, response=None, info=None):
        # 为每位人员创建一个目录,存放她自己所有的图片
        author_name = request.meta['name']
        author_dir = os.path.join(settings.IMAGES_STORE, author_name)
        if not os.path.exists(author_dir):
            os.makedirs(author_dir)
        #从连接中提取文件名和扩展名
        try:
            filename = request.url.split("/")[-1].split(".")[0]
        except:
            filename = sha1(request.url.encode(encoding='utf-8')).hexdigest()
        try:
            ext_name = request.url.split(".")[-1]
        except:
            ext_name = 'jpg'

        # 返回的相对路径
        return '%s/%s.%s' % (author_name, filename, ext_name)


class MeinvPipeline(object):
    def __init__(self):
        self.csv_filename = 'meinv.csv'
        self.existed_header = False
    def process_item(self, item, spider):
        # item dict对象,是spider.detail_parse() yield{}输出模块
        with open(self.csv_filename, 'a', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=(
                'name', 'hometown', 'birthday', 'detail_url'))
            if not self.existed_header:
                # 如果文件不存在,则表示第一次写入
                writer.writeheader()
                self.existed_header = True
            image_urls = ''
            for image_url in item['image_urls']:
                image_urls += image_url + ','
            image_urls.strip("\"").strip("\'")
            data = {
                'name': item['name'].strip(),
                'hometown': item['hometown'],
                'birthday': item['birthday'].replace('年', '-').replace('月', '-').replace('日', ''),
                'detail_url': item['detail_url'],
            }
            writer.writerow(data)
            f.close()
        return item

Setting.py

import os

BOT_NAME = 'meinv'

SPIDER_MODULES = ['meinv.spiders']
NEWSPIDER_MODULE = 'meinv.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DOWNLOADER_MIDDLEWARES = {
   'meinv.middlewares.RandomUserAgentMiddleware': 543,
}

BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# ImagePipeline 存放图片使用的目录位置
IMAGES_STORE = os.path.join(BASE_DIR, 'images')

ITEM_PIPELINES = {
    'meinv.pipelines.MeinvPipeline': 300,
    'meinv.pipelines.MvImagePipeline':100
}


USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'
]

1. File directory

2. Someone's picture

3.csv file content

Insert picture description here

Finally, a few beautiful pictures

Code download

https://download.csdn.net/download/weixin_54707168/15902904

Guess you like

Origin blog.csdn.net/weixin_54707168/article/details/114984536