Use python scrapy crawl csdn article

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/k331922164/article/details/87859069

First, this paper used tools.

1, sogou browser

2、pycharm 2018.2.3

3、scrapy 1.6.0

Second, set scrapy.

Set in settings.py the following parameters:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
ITEM_PIPELINES = {
   'scrapy.pipelines.images.ImagesPipeline': 300,
}
IMAGES_STORE = 'C:\WorkDir\ProPython\example\example\download'

Set USER_AGENT, you can scrapy disguised as sogou browser. You need to query the browser UA can check the website to get in.

ROBOTSTXT_OBEY set to False can grab any page.

DOWNLOAD_DELAY set the delay download page, further camouflage.

ITEM_PIPELINES open the image processing package.

IMAGES_STORE saved image path.

Third, write reptiles code.

Modifying pipeline.py, code is as follows:

from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_path = [x['path'] for ok, x in results if ok]
        if not image_path:
            raise DropItem('Item contains no images')
        item['image_paths'] = image_path
        return item

Create a csdn.py, code is as follows:

# -*- coding: utf-8 -*-
# scrapy crawl csdn -o csdn.csv --nolog
import scrapy
class CsdnSpider(scrapy.Spider):
    name = "csdn"
    start_urls = ['https://blog.csdn.net/k331922164/article/details/86251743']

    def parse(self, response):
            item = {}
            item['image_urls'] = []
            for article in response.xpath('//*[@id="mainBox"]/main/div[1]/div/div/div[1]'):
                title = article.xpath('./h1/text()').extract_first()
            for article in response.xpath('//*[@id="content_views"]'):
                contentlist = article.xpath('./p//text()').extract()
                content = ''.join(contentlist)
            yield {
                'title': title,
                'content': content,
            }
            for src in response.xpath('//div[@class="blog-content-box"]'):
                item['image_urls'] = src.xpath('.//img/@src').extract()
            yield item

            for next_article in response.xpath('//li[@class="widescreen-hide"]'):
                next_url = next_article.xpath('./a/@href').extract_first()
                if next_url:
                    yield scrapy.Request(next_url, callback=self.parse)
                    break

Use command scrapy crawl csdn -o csdn.csv --nolog, you can grab this blog article.

Principle is as follows:

First open the latest article, as start_url.

Use the links on one of as next_url.

Right-click the section of interest, review elements, you can see the corresponding web source.

Right click on the source, may copy the xpath correspond to, as the case can be used to modify xpath.

Eventually save all the articles in the csv.

Pictures are download / full directory.

Fourth, other issues.

1, when you run scrapy, missing module may occur. You can install pillow-PIL, pywin32 so on their own.

2, the script can not be converted into the article PDF, WORD, could be improved.

 

Guess you like

Origin blog.csdn.net/k331922164/article/details/87859069