Python's scrapy framework -----> can make us more powerful, born to break and write a lot of code

Table of contents

scrapy-framework

pipeline-itrm-shell

scrapy simulated login

scrapy download image

Download middleware

scrapy-framework

meaning:

Composition:

 Running process: 1. The scrapy framework gets start_urls and constructs a request request

2. The request is sent to the scrapy engine, passing through the crawler middleware, and the engine sends the request to the scheduler (a queue stores the request)

3. The scheduler then sends the requst request to the engine

4. The engine then sends the requst request to the downloader, passing through the download middleware in the middle

5. The downloader then accesses the Internet and returns a response response

6. The downloader sends the obtained response to the engine, and passes through the download middleware in the middle

7. The engine sends the response to the crawler, passing through the crawler middleware on the way

8. The crawler obtains data through the response, (you can obtain url,....) If you want to send another request, construct a request request to send to the engine and recycle it once. If you do not send a request, send the data to engine, passing through the crawler middleware

9. The engine sends the data to the pipeline

10. Pipeline for saving

Let's first create a project through the cmd page

c:/d:/e: --->Switch network disk

cd file name -----> switch into the file

scrapy startproject project name --------> create project

scrapy genspider crawler file name domain name -------> create crawler file

 scrapy crawl crawler file name ------------> run crawler file

We can also create a start.py file to run the crawler file (to create the first layer under the project)

Where the file was created:

 Code to run crawler file:

from scrapy import cmdline

# cmdline.execute("scrapy crawl baidu".split())
# cmdline.execute("scrapy crawl novel".split())
cmdline.execute("scrapy crawl shiping".split())

import from scrapy import cmdline

cmdline.execute(['scrapy','crawl','crawler file name']) : run crawler file

Let me analyze the files inside

Crawler name.py file

 It can be seen that the scrapy framework provides some class attributes, and the values ​​of these class attributes can be changed, but def parse() cannot change the name and pass parameters at will

settings.py file

 Find this and open it, remove the comment, the smaller the value, the first it will be executed, if you don't open it, you can't transfer the data to the pipelines.py file

The item parameter in process_item() in the MyScrapyPipeline class

Let me demonstrate,

import scrapy


class BaiduSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/review/best/']

    def parse(self, response):
        print(response.text)

result:

 When we click on the first URL, we will jump to the following

 It is because the crawler file complies with a rule. The solution is as follows: find the following code in the settings.py file:

 Change True to False, then run

result:

 It can be seen that one error is reduced

But there is still an error, let's solve it below:

The solution to 403 is to add UA (header request header)

Find it here as shown:

 Change My_scrapy (+http://www.yourdomain.com) to a request header:

result:

 can be accessed normally

middlewares.py file (for adding request headers)

But some cuties think this is too troublesome. If the header request header is changed frequently, it will be very difficult to use. Regarding this problem, we can think about it. If we add a request header during the process of sending the request, it will not be so troublesome. Already, how to add it?

Little cuties can think about whether the middleware can be used:

Then we will find the middleware, which is a middlewares.py file in the scrapy project

 When we open this file we will see:

 The main reason is that this file writes both the crawler middleware and the download middleware in the middlewares.py file

MyScrapyDownloaderMiddleware This is the download middleware
MyScrapySpiderMiddleware This is the crawler middleware

So let me explain MyScrapyDownloaderMiddleware

 The main thing is that these two are more commonly used, let's start with process_crawler

Code screenshot:

When we print, we will find that it is not printed, why is it like this? The reason is that our middleware has not been opened, so let’s find the settings and py files and remove their comments

Code screenshot:

 Once run successfully:

 Then let's try process_response again

Code screenshot:

 result:

It can be seen that the request is in front of the response

Maybe some cuties have thought of some situations, can you create a request and response?

Let's try

 Code screenshot:

 result:

 The careful little cutie will find that it is not what he expected,

Below I intercept the download middleware:

 this is the problem

Let me explain the following:

process_request(request, spider)

# - return None: continue processing this request 
will be passed on when return None, such as duoban's process_request() returns return None will run the process_request() of the download middleware 

# - or return a Request object 
when return (a Request object ) will not be passed on, such as duoban's process_request() returns return (a Request object), it will not run the process_request() of the download middleware but return to the engine, and the engine returns to the scheduler (return in the same way) # - 

or return a Response object 
will not be passed on when return (a Response object), such as duoban's process_request() returns return (a Response object), it will not run the process_request() of the download middleware but return to the engine, and the engine returns to Crawler file (cross-level) 

# - or raise IgnoreRequest: process_exception() methods of If this method throws an exception, the process_exception method will be called 
# installed downloader middleware will be called

process_response(request, response, spider)

# - return a Response object
Return the Response object: scrapy will continue to call the process_response method of other middleware
Law;
# - return a Request object
 Returns the Request object: stop the middleman call, and place it in the scheduler to be scheduled for download; 
# - or raise IgnoreRequest
Throw IgnoreRequest exception: Request.errback will be called to process the function, if there is no
If processed, it will be ignored and will not be logged.

Some cuties will think, can I create a middleware by myself to add request headers: (in the middlewares.py file )

from scrapy import signals
import random
class UsertMiddleware:
        User_Agent=["Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)",
                "Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.7; AOLBuild 4343.27; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"]

        def process_request(self, request, spider):
            # 添加请求头
            print(dir(request))
            request.headers["User-Agent"]=random.choice(self.User_Agent)
            # 添加代理ip
            # request.meta["proxies"]="代理ip"
            return None


class UafgfMiddleware:
    def process_response(self, request, response, spider):
        # 检测请求头是否添加上
        print(request.headers["User-Agent"])
        return response

result

 is runnable

pipelines.py file

process_item(self, item, spider)

item: Receive the data returned by the crawler file, such as a dictionary

Let's crawl to Douban

Practice crawling pictures of Douban movies

Crawler file.py:

import scrapy


class BaiduSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com','doubanio.com']
    start_urls = ['https://movie.douban.com/review/best/']
    a=1

    def parse(self, response):

        divs=response.xpath('//div[@id="content"]//div[@class="review-list chart "]//div[@class="main review-item"]')
        for div in divs:
            # print(div.extract)
            title=div.xpath('./a/img/@title')
            src=div.xpath('./a/img/@src')
            # print(title.extract_first())
            print(src.extract_first())
            yield {
                "title": title.extract_first(),
                "src": src.extract_first(),
                "type": "csv"
            }
            # 再发请求下载图片
            yield scrapy.Request(
                url=src.extract_first(),
                callback=self.parse_url,
                cb_kwargs={"imgg":title.extract_first()}
            )
        #第一种
        # next1=response.xpath(f'//div[@class="paginator"]//a[1]/@href').extract_first()
        # 第二种方法自己构建
        next1="/review/best?start={}".format(20*self.a)
        self.a+=1



        url11='https://movie.douban.com'+next1
        yield scrapy.Request(url=url11,callback=self.parse)
        print(url11)

    def parse_url(self,response,imgg):
        # print(response.body)

        yield {
            "title":imgg,
            "ts":response.body,
            "type":"img"
        }

pipelines.py file:

import csv


class MyScrapyPipeline:
    def open_spider(self,spider): # 当爬虫开启时调用
        header = ["title", "src"]
        self.f = open("move.csv", "a", encoding="utf-8")
        self.wri_t=csv.DictWriter(self.f,header)
        self.wri_t.writeheader()

    def process_item(self, item, spider): # 每次传参都会调用一次
        if item.get("type")=="csv":
            item.pop("type")
            self.wri_t.writerow(item)
        if item.get("type")=="img":
            item.pop("type")
            with open("./图片/{}.png".format(item.get("title")),"wb")as f:
                f.write(item.get("ts"))
                print("{}.png下载完毕".format(item.get("title")))

        return item

    def close_spider(self,spider):
        self.f.close()

settings.py file:

 

 This can only output what you want to output

_____________________________________

 

 

 All of the above are open

Remember that if the send request in the crawler file fails, the function in the pipelines.py file cannot be called back

Ways to Pause and Resume the Crawler

Some cuties think there is a way to pause and resume crawlers? If so, what is it?

Let me talk about it

 scrapy crawl crawler file name -s JOBDIR=file path (just define)

Ctrl+c pauses the crawler

When the little cutie wants to restore again, she will find that the download cannot be run,

What is the reason, because the method we write is different from that given by the framework,

The scrapy.Request is as follows:

 dont_filte (do not filter?) r is a filter, if it is False, it will be filtered (the same url is only accessed once), if it is True, it will not be filtered

Little cutie will think why parse() can send, the result is as follows:

 The result is very clear, if you want not to filter, you have to change

If you want to filter overridden methods:

 

scrapy simulated login

There are two methods:

● 1 directly carry cookies to request the page (semi-automatically, use selenium to obtain or manually obtain cookies)

2 Find an interface to send a post request to store cookies (send account number and password)
Below I use

https://www.1905.com/vod/list/c_178/o3u1p1.html to make a case

The first method is the request page obtained by manual login

Crawler file code example 1 (add cookie to crawler file);

import scrapy


class A17kSpider(scrapy.Spider):
    name = '17k'
    allowed_domains = ['17k.com']
    start_urls = ['https://www.17k.com/']

    # 重写
    def start_requests(self):
        cook="GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse,
            cookies={lis.split("=")[0]:lis.split("=")[1] for lis in cook.split(";")}
        )

    def parse(self, response):
        # print(response.text)
        yield scrapy.Request(url="https://user.17k.com/www/",callback=self.parse_url)

    def parse_url(self,response):
        print(response.text)

result:

 Crawler file code example 2 (add cookie to download middleware file);

class MyaddcookieMiddleware:
    def process_request(self, request, spider):
        cook = "GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
        cookies = {lis.split("=")[0]: lis.split("=")[1] for lis in cook.split(";")}
        request.cookies=cookies
        return None

 Crawler file code example three (add cookie to download middleware file);

def sele():
    #创建一个浏览器
    driver=webdriver.Chrome()
    #打开网页
    driver.get("https://user.17k.com/www/bookshelf/")
    print("你有15秒的时间登入")
    time.sleep(15)
    print(driver.get_cookies())
    print({i.get("name"):i.get("value") for i in driver.get_cookies()})





class MyaddcookieMiddleware:


    def process_request(self, request, spider):
        sele()
        return None

Find an interface to send a post request to store cookies

Code 1:

import scrapy


class A17kSpider(scrapy.Spider):
    name = '17k'
    allowed_domains = ['17k.com']
    start_urls = ['https://www.17k.com/']

    # # 重写
    # def start_requests(self):
    #     cook="GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
    #     yield scrapy.Request(
    #         url=self.start_urls[0],
    #         callback=self.parse,
    #         cookies={lis.split("=")[0]:lis.split("=")[1] for lis in cook.split(";")}
    #     )
    #
    # def parse(self, response):
    #     # print(response.text)
    #     # yield scrapy.Request(url="https://user.17k.com/www/bookshelf/",callback=self.parse_url)
    #     pass
    # def parse_url(self,response):
    #
    #     # print(response.text)
    #     pass


    #发送post请求
    def parse(self, response):
        data={
    "loginName": "15278307585",
    "password": "wasd1234"
}
        yield scrapy.FormRequest(
            url="https://passport.17k.com/ck/user/login",
            callback=self.prase_url,
            formdata=data
                                 )

        #适用于该页面有form表单
        # yield scrapy.FormRequest.from_response(response,formdata=data,callback=self.start_urls)


    def prase_url(self,response):
        print(response.text)

In addition to these, you can return the respose object by downloading middleware

from scrapy import signals
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time
from scrapy.http.response.html import HtmlResponse
lass MyaaacookieMiddleware:
    def process_request(self, request, spider):
        # 创建一个浏览器
        driver=webdriver.Chrome()
        # 打开浏览器
        driver.get("https://juejin.cn/")
        driver.implicitly_wait(3)
        # js语句下拉
        for i in range(3):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(3)
        html=driver.page_source
        return HtmlResponse(url=driver.current_url,body=html,request=request,encoding="utf-8")

That's all for the above.

Summarize

The scrapy framework is to solve the large amount of code rewriting caused by our crawling a lot of data, and solve the problem with a small amount of code

Guess you like

Origin blog.csdn.net/m0_69984273/article/details/130998544