Scrapy crawls Douban group pictures

I have no mood to work in the year of the dog. After thinking about it, I decided to crawl some information on Douban to pass the time. After all, I have basically never been in contact with reptiles before, so I am still very interested.

Introduction to Scrapy

First, briefly introduce the Scrapy crawler framework, mainly in terms of architecture, which can quickly understand how scrapy works.

The data flow of Scrapy is controlled by the execution engine (Engine), and its basic process is as follows:

  1. The engine gets the initial Requests from the Spider.
  2. The engine puts that Requests into the scheduler and requests the next Requests to crawl.
  3. The scheduler returns the next Requests to be crawled to the engine
  4. The engine forwards Requests to the downloader (Downloader) through the downloader middleware.
  5. Once the page is downloaded, the downloader generates a Response for the page and sends it to the engine through the download middleware (response direction).
  6. The engine receives the Response from the downloader and sends it to the Spider for processing through the Spider middleware (input direction).
  7. Spider processes Response and returns the crawled Item and (follow-up) new Request to the engine.
  8. The engine hands the crawled Item (returned by the Spider) to the ItemPipeline for processing, hands the Request (returned by the Spider) to the scheduler, and requests the next Requests (if any).
  9. Repeat (from step 1) until there are no more Requests in the scheduler.

The project architecture created with Scrapy is as follows

in:

  1. spider folder to write your own crawler;
  2. settings.py Configure the default information of the crawler, function switches, middleware execution order, etc.;
  3. middlewares.py middleware, mainly to expand the function, add custom functions, such as user-agent and proxy
  4. item.py defines fields for fetch processing
  5. piplines.py pipeline file, processing items

Crawling Douban Group

The main core content of the posts in the Douban group is pictures, so they should be downloaded according to different post categories.

settings.py

user-agent is set, middleware and piplines are specified

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

MY_USER_AGENT = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    ]
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'douban.middlewares.MyUserAgentMiddleware': 400,
}
COOKIES_ENABLES = True
DOWNLOAD_DELAY=1
ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 1,
}

 

item.py

Define fields, including author, post name, author homepage address, image address

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    author=scrapy.Field()
    author_homepage=scrapy.Field()
    img_url=scrapy.Field()
    pass

 

middlewares.py

set user-agent

class MyUserAgentMiddleware(UserAgentMiddleware):
    '''
    设置User-Agent
    '''

    def __init__(self, user_agent, ip):
        self.user_agent = user_agent
        self.ip=ip

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            user_agent=crawler.settings.get('MY_USER_AGENT')
            , ip=crawler.settings.get('PROXIES')
        )

    def process_request(self, request, spider):
        agent = random.choice(self.user_agent)
        request.headers['User-Agent'] = agent

spiders/douban_spider.py

The processing code of the crawler, first log in and then crawl, if there is a verification code, download the picture and enter the verification code

import urllib

import scrapy
from scrapy import Request, FormRequest

from douban.items import DoubanItem
import json

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']

    start_urls = []

    def start_requests(self):
        yield Request("https://www.douban.com/login", callback=self.parse, meta={"cookiejar":1})

    def parse(self, response):
        captcha = response.xpath('//img[@id="captcha_image"]/@src').extract()
        if len(captcha)>0:
            print("此时有验证码")
            localpath = "E:/spider/douban/captchar.jpg"
            urllib.request.urlretrieve(captcha[0],filename=localpath)
            print("请查看本地验证码图片并输入验证码")
            captcha_value=input()

            data = {
                "form_email": "*******@126.com",
                "form_password": "*******",
                "captcha-solution": str(captcha_value),
                "redir": "https://www.douban.com/group/haixiuzu/discussion?start=0"  # 登录后要返回的页面
            }
        else:
            print("此时没有验证码")
            data = {
                "form_email": "[email protected]",
                "form_password": "8296926",
                # "redir": "https://www.douban.com/group/haixiuzu/discussion?start=0"  # 登录后要返回的页面
            }
        print("登陆中...")
        yield FormRequest.from_response(response,meta={"cookiejar": response.meta["cookiejar"]}, formdata=data, callback=self.parse_redirect)

    def parse_redirect(self, response):
        print("已登录豆瓣")
        title = response.xpath('//title//text()').extract()

        baseurl='https://www.douban.com/group/haixiuzu/discussion?start='
        for i in range(0, 625, 25):
            pageUrl=baseurl+str(i)
            yield Request(url=pageUrl, callback=self.parse_process,dont_filter = True)

    def parse_process(self, response):
        title = response.xpath('//title//text()').extract()
        items = response.xpath('//td//a/@href').extract()
        for item in items:
            if 'topic' in item:
                url=item
                yield Request(url=item,callback=self.parse_img)

    def parse_img(self,response):
        img = DoubanItem()
        title=response.xpath('//title//text()').extract()
        img['title']=title
        author=response.xpath('//div[@class="topic-doc"]//h3//a//text()').extract()
        img['author']=author
        author_homepage = response.xpath('//div[@class="topic-doc"]//h3//a/@href').extract()
        img['author_homepage'] = author_homepage
        img_url = response.xpath('//div[@class="image-wrapper"]//img/@src').extract()
        img['img_url'] = img_url
        yield img

 

piplines.py

Save the post information here, the main reason for not using the built-in class for saving pictures is that it is not flexible enough.

class DoubanPipeline(object):
    def process_item(self, item, spider):
        author=item["author"][0]
        title=item["title"][0].replace('\n','').strip()
        author_homepage=item["author_homepage"][0]
        #路径
        dir="E:/spider/douban/img/"
        if not os.path.exists(dir):
            os.mkdir(dir)
        author_dir=dir+title
        if not os.path.exists(author_dir):
            os.mkdir(author_dir)
        #用户信息txt
        info=open(author_dir+"/用户信息.txt", "w")
        info.write(author+'\n'+author_homepage)
        info.close()
        #保存图片
        count=1
        for url in item["img_url"]:
            path=author_dir+"/"+str(count)+".jpg"
            urllib.request.urlretrieve(url, filename=path)
            count += 1
        return item

 

crawler results

 

problems encountered

The main problem is that crawling is too frequent and it is banned. Logging in Douban also wants to reduce the probability of being banned, but it is useless. There are many solutions on the Internet, but it is still necessary to forge some user-agents and use proxy agents. I also crawled some proxies and stored them in the database, but the proxies were slow, so I gave up.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325003475&siteId=291194637