Scrapy (4) spider helps you find the most beautiful lady

We all know that we usually want to download some beautiful pictures to decorate our desktops, but we find websites that require a fee, which is really annoying, so today I may want to take everyone to implement such a tool that can be used to crawl Take a good-looking picture from a website


I'm not excited, yes, I'm super excited. Now let me tell you something. In the future, "Today's Financial Vocabulary" and "A Python Interview Question" will be updated at the same time every day . Stay tuned, thank you for your attention, welcome to like, follow, and collect three consecutive Hit , just watch, don’t pay attention, not a good guy, haha ​​kidding


Haha, let’s get into the subject

Attach the link address


https://image.so.com/


image


Before creating a project, we need to analyze the website data, enter the homepage, click on the beauty, we can know that we jump to this page, we can see that the data is rendered by ajax in the form of jsonp, and every time the page is refreshed, this function will Random changes, which means that the code that may be written is time-sensitive




Let’s randomly click on a picture to enter a more detailed page,



Come to this page, we f12, we can see that the data is like this, with the detailed information of each picture, click this link to enter the preview


https://image.so.com/zjl?ch=beauty&direction=next&sn=0&pn=30&prevsn=-1

We can see the detailed information of the picture, id, title, imgurl


image


Then we look at the header, which parameters are needed inside, from the picture, we need ch, sn, pn


image


We can splice out such a link, readers can visit by themselves


https://image.so.com/zjl?ch=beauty&direction=next&prevsn=-1&sn=180


Create project


scrapy startproject images



Define our Item.py


We usually crawl pictures need to crawl the link to save the picture, the picture, the name of the picture, so we define such an Item.py class name


# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ImagesItem(scrapy.Item):
   # define the fields for your item here like:
   # name = scrapy.Field()
   collection = table = 'images'
   imageid = scrapy.Field()
   group_title = scrapy.Field()
   url = scrapy.Field()

spider spider


According to our above analysis, we need some fixed parameters, ch, keywords, direction, prevsn, these inherent parameters, of course, we can also crawl the pictures you need by dynamically inputting ch, here is just The setting is dead, sn represents the starting page number, this is dynamically changing


# -*- coding: utf-8 -*-
import json
from urllib.parse import urlencode
from images.items import ImagesItem
from scrapy import Spider, Request


class ImagesSpider(Spider):
   name = 'images'
   allowed_domains = ['image.so.com']

   def start_requests(self):
       # 表单数据
       data = {
           'ch': 'beauty',
           'direction': 'next',
           'prevsn': -1,
       }
       # 爬虫起始地址
       base_url = 'http://image.so.com/zjl?'
       # page列表从1到50页循环递归,其中MAX_PAGE为最大页数
       for page in range(1, self.settings.get('MAX_PAGE') + 1):
           data['sn'] = page*30
           params = urlencode(data)
           # spider实际爬取的地址是api接口,如http://image.so.com/zj?ch=wallpaper&sn=30&listtype=new&temp=1
           url = base_url + params
           yield Request(url, self.parse)

   def parse(self, response):
       # 因response返回的数据为json格式,故json.loads解析
       result = json.loads(response.text)
       res = result.get('list')
       if res:
           for image in res:
               item = ImagesItem()
               item['imageid'] = image.get('id') # 图片id
               item['title'] = image.get('title') # 图片标题
               item['url'] = image.get('qhimg_url') # 图片地址
               yield item

Because this website has anti-scrabble function, we need to prepare some user_agents. Because there are too many user_agents, it is not convenient to display, so I cut three displays here. If necessary, you can follow the official account and reply to "Enter the group" ', add me to WeChat, send you a private message


agents = [
  "Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1;SV1;AcooBrowser;.NETCLR1.1.4322;.NETCLR2.0.50727)",
   "Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0;AcooBrowser;SLCC1;.NETCLR2.0.50727;MediaCenterPC5.0;.NETCLR3.0.04506)",
   "Mozilla/4.0(compatible;MSIE7.0;AOL9.5;AOLBuild4337.35;WindowsNT5.1;.NETCLR1.1.4322;.NETCLR2.0.50727)",
]

When we write middleware, we need to use the user_agents.py file above


Define middleware


We need to select a user_agents randomly, so that the other party does not know that we are accessing the page from the same computer, so that we can not be banned, we can define a random class,


import random
from images.user_agents import agents

class RandomUserAgentMiddelware(object):
   """
   换User-Agent
   """

   def process_request(self, request, spider):
       request.headers['User-Agent'] = random.choice(agents)

Next is to define the pipeline


In fact, the meaning of the pipeline is to act as a database storage. According to my understanding, here we need to filter information, create a database, connect to the database, and insert data into the database.


We need to use the pymongo library we talked about before


import pymongo
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline

# 数据信息存储至 mongo 中
class MongoPipeline(object):
   def __init__(self,mongo_uri, mongo_port, mongo_db):
       self.mongo_uri = mongo_uri
       self.mongo_port = mongo_port
       self.mongo_db = mongo_db

   @classmethod
   def from_crawler(cls, crawler):
       return cls(
           mongo_uri=crawler.settings.get('MONGO_URI'),
           mongo_port=crawler.settings.get('MONGO_PORT'),
           mongo_db=crawler.settings.get('MONGO_DB'),
       )
   # 链接数据库
   def open_spider(self, spider):
       self.client = pymongo.MongoClient(host=self.mongo_uri, port=self.mongo_port)
       self.db = self.client[self.mongo_db]
   # 插入数据
   def process_item(self, item, spider):
       self.db[item.collection].insert(dict(item))
       return item
   # 关闭数据库
   def close_spider(self, spider):
       self.client.close()
# 下载图片
class ImagePipeline(ImagesPipeline):
   def file_path(self, request, response=None, info=None):
       url = request.url
       file_name = url.split('/')[-1]
       return file_name

   def item_completed(self, results, item, info):
       image_paths = [x['path'] for ok, x in results if ok]
       if not image_paths:
           raise DropItem('Image Downloaded Failed')
       return item

   def get_media_requests(self, item, info):
       yield Request(item['url'])


scrapy.setting settings


# -*- coding: utf-8 -*-

# Scrapy settings for images project

BOT_NAME = 'images' # 其实就是爬虫的名字

SPIDER_MODULES = ['images.spiders'] # 爬虫模块
NEWSPIDER_MODULE = 'images.spiders' # 爬虫的命名空间


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36' # 本机浏览器的 user_agent

# Obey robots.txt rules
ROBOTSTXT_OBEY = True # 遵循爬虫规则

# 抓取最大页数
MAX_PAGE = 50 # 设置爬取最大页码
# MONGO信息设置
MONGO_URI = 'localhost' # 本地数据库 uri
MONGO_PORT = 27017 # 数据库端口
MONGO_DB = 'images' # 数据库名字

# 图片下载路径
IMAGES_STORE = './image1' # 图片存储路径

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 2 # 延迟
# The download delay setting will honor only one of:

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = { # 刚刚我们编写的随机 agent 中间件就在这里引用说明了,数字为权限级别
  'images.middlewares.RandomUserAgentMiddelware': 543,
}

ITEM_PIPELINES = { # 管道权限声明
   'images.pipelines.ImagePipeline': 300,
   'images.pipelines.MongoPipeline': 301,
}

Up to this point, our crawler has actually been completed. The so-called wishing to achieve great things, we only owe Dongfeng to help, which is to run the code.


scrapy crawl images

Dangdang, so many beauties


image


Guess you like

Origin blog.51cto.com/15067249/2574447