Are you not good at scrapy crawler framework? Simple use of crawler framework to collect website data

Preface
The text and image filtering network in this article can be used for learning, communication, and does not have any commercial use. If you have any questions, please contact us for processing.

This article uses the python crawler framework scrapy to collect some data from the website.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://www.xin3721.com/eschool/pythonxin3721/

Basic development environment
Python 3.6
pycharm
how to install scrapy

Pip install scrapy can be installed in the cmd command line. But in general, there will be network timeouts.

It is recommended to switch the domestic conventional source to install pip install -i domestic conventional address package name

E.g:

pip install -i https://mirrors.aliyun.com/pypi/simple/ scrapy

Commonly used domestic source alias addresses:

清华:https://pypi.tuna.tsinghua.edu.cn/simple
阿里云:http://mirrors.aliyun.com/pypi/simple/
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/
华中理工大学:http://pypi.hustunique.com/
山东理工大学:http://pypi.sdutlinux.org/ 
豆瓣:http://pypi.douban.com/simple/

You may get errors:

You may encounter errors such as VC++ during the installation of Scrapy, you can install offline packages that delete modules

Insert picture description here

How Scrapy crawls website data

This article uses the data of Douban Movie Top250 as an example to explain the basic process of scrapy framework to crawl data.

Insert picture description here

The data of Douban Top250 is not analyzed too much. Static websites and webpage structures are very suitable for writing and crawling. Therefore, many basic crawler cases are based on Douban movie data and Maoyan movie data.

Scrapy's crawler project creation process
1. Create a crawler project

Select Terminal in Pycharm and enter the Python basic tutorial in Local

scrapy startproject + (project name <unique>)

Insert picture description here

2.cd to switch to the crawler project directory

Insert picture description here

3. Create a crawler file

scrapy genspider (+ crawler file name <unique>) (+ domain name restriction)

Insert picture description here

Insert picture description here

This completes the creation of scrapy projects and the creation of crawler files.

Scrapy crawler code writing
1, turn off the robots protocol in the settings.py file, the default is True

Insert picture description here

2. Modify the starting URL under the crawler file

start_urls = [‘https://movie.douban.com/top250?filter=’]

Change start_urls to the link of Douban Navigation URL, which is the url address of the first page where you crawl the c# tutorial data

3. Write business logic for analyzing data

The crawl content is as follows:

Insert picture description here

douban_info.py

import scrapy


from ..items import DoubanItem




class DoubanInfoSpider(scrapy.Spider):
    name = 'douban_info'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250?start=0&filter=']


    def parse(self, response):
        lis = response.css('.grid_view li')
        print(lis)
        for li in lis:
            title = li.css('.hd span:nth-child(1)::text').get()
            movie_info = li.css('.bd p::text').getall()
            info = ''.join(movie_info).strip()
            score = li.css('.rating_num::text').get()
            number = li.css('.star span:nth-child(4)::text').get()
            summary = li.css('.inq::text').get()
            print(title)
            yield DoubanItem(title=title, info=info, score=score, number=number, summary=summary)


        href = response.css('#content .next a::attr(href)').get()
        if href:
            next_url = 'https://movie.douban.com/top250' + href
            yield scrapy.Request(url=next_url, callback=self.parse)

itmes.py

import scrapy




class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    info = scrapy.Field()
    score = scrapy.Field()
    number = scrapy.Field()
    summary = scrapy.Field()

middlewares.py

import faker




def get_cookies():
    """获取cookies的函数"""
    headers = {
    
    
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}


    response = requests.get(url='https://movie.douban.com/top250?start=0&filter=',
                            headers=headers)
    return response.cookies.get_dict()




def get_proxies():
    """代理请求的函数"""
    proxy_data = requests.get(url='http://127.0.0.1:5000/get/').json()
    return proxy_data['proxy']




class HeadersDownloaderMiddleware:
    """headers中间件"""


    def process_request(self, request, spider):
        # 可以拿到请求体
        fake = faker.Faker()
        # request.headers  拿到请求头, 请求头是一个字典
        request.headers.update(
            {
    
    
                'user-agent': fake.user_agent(),
            }
        )
        return None


class CookieDownloaderMiddleware:
    """cookie中间件"""


    def process_request(self, request, spider):
        # request.cookies 设置请求的cookies, 是字典
        # get_cookies()  调用获取cookies的方法
        request.cookies.update(get_cookies())
        return None


class ProxyDownloaderMiddleware:
    """代理中间件"""


    def process_request(self, request, spider):
        # 获取请求的 meta , 字典
        request.meta['proxy'] = get_proxies()
        return None

pipelines.py

import csv




class DoubanPipeline:
    def __init__(self):
        self.file = open('douban.csv', mode='a', encoding='utf-8', newline='')
        self.csv_file = csv.DictWriter(self.file, fieldnames=['title', 'info', 'score', 'number', 'summary'])
        self.csv_file.writeheader()


    def process_item(self, item, spider):
        dit = dict(item)
        dit['info'] = dit['info'].replace('\n', "").strip()
        self.csv_file.writerow(dit)
        return item




    def spider_closed(self, spider) -> None:
        self.file.close()

setting.py

# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html


BOT_NAME = 'douban'


SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'




# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'


# Obey robots.txt rules
ROBOTSTXT_OBEY = False


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32


# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16


# Disable cookies (enabled by default)
#COOKIES_ENABLED = False


# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False


# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
    
    
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}


# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
    
    
#    'douban.middlewares.DoubanSpiderMiddleware': 543,
# }


# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    
    
   'douban.middlewares.HeadersDownloaderMiddleware': 543,
}


# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
    
    
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}


# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
   'douban.pipelines.DoubanPipeline': 300,
}


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False


# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

4. Run the crawler vb.net tutorial
program

Insert picture description here

Enter the command scrapy crawl + crawler file name

Insert picture description here

Insert picture description here

Guess you like

Origin blog.csdn.net/chinaherolts2008/article/details/112911997