Table of contents
scrapy-framework
pipeline-itrm-shell
scrapy simulated login
scrapy download image
Download middleware
scrapy-framework
meaning:
Composition:
Running process: 1. The scrapy framework gets start_urls and constructs a request request
2. The request is sent to the scrapy engine, passing through the crawler middleware, and the engine sends the request to the scheduler (a queue stores the request)
3. The scheduler then sends the requst request to the engine
4. The engine then sends the requst request to the downloader, passing through the download middleware in the middle
5. The downloader then accesses the Internet and returns a response response
6. The downloader sends the obtained response to the engine, and passes through the download middleware in the middle
7. The engine sends the response to the crawler, passing through the crawler middleware on the way
8. The crawler obtains data through the response, (you can obtain url,....) If you want to send another request, construct a request request to send to the engine and recycle it once. If you do not send a request, send the data to engine, passing through the crawler middleware
9. The engine sends the data to the pipeline
10. Pipeline for saving
Let's first create a project through the cmd page
c:/d:/e: --->Switch network disk
cd file name -----> switch into the file
scrapy startproject project name --------> create project
scrapy genspider crawler file name domain name -------> create crawler file
scrapy crawl crawler file name ------------> run crawler file
We can also create a start.py file to run the crawler file (to create the first layer under the project)
Where the file was created:
Code to run crawler file:
from scrapy import cmdline
# cmdline.execute("scrapy crawl baidu".split())
# cmdline.execute("scrapy crawl novel".split())
cmdline.execute("scrapy crawl shiping".split())
import from scrapy import cmdline
cmdline.execute(['scrapy','crawl','crawler file name']) : run crawler file
Let me analyze the files inside
Crawler name.py file
It can be seen that the scrapy framework provides some class attributes, and the values of these class attributes can be changed, but def parse() cannot change the name and pass parameters at will
settings.py file
Find this and open it, remove the comment, the smaller the value, the first it will be executed, if you don't open it, you can't transfer the data to the pipelines.py file
The item parameter in process_item() in the MyScrapyPipeline class
Let me demonstrate,
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/review/best/']
def parse(self, response):
print(response.text)
result:
When we click on the first URL, we will jump to the following
It is because the crawler file complies with a rule. The solution is as follows: find the following code in the settings.py file:
Change True to False, then run
result:
It can be seen that one error is reduced
But there is still an error, let's solve it below:
The solution to 403 is to add UA (header request header)
Find it here as shown:
Change My_scrapy (+http://www.yourdomain.com) to a request header:
result:
can be accessed normally
middlewares.py file (for adding request headers)
But some cuties think this is too troublesome. If the header request header is changed frequently, it will be very difficult to use. Regarding this problem, we can think about it. If we add a request header during the process of sending the request, it will not be so troublesome. Already, how to add it?
Little cuties can think about whether the middleware can be used:
Then we will find the middleware, which is a middlewares.py file in the scrapy project
When we open this file we will see:
The main reason is that this file writes both the crawler middleware and the download middleware in the middlewares.py file
MyScrapyDownloaderMiddleware This is the download middleware
MyScrapySpiderMiddleware This is the crawler middleware
So let me explain MyScrapyDownloaderMiddleware
The main thing is that these two are more commonly used, let's start with process_crawler
Code screenshot:
When we print, we will find that it is not printed, why is it like this? The reason is that our middleware has not been opened, so let’s find the settings and py files and remove their comments
Code screenshot:
Once run successfully:
Then let's try process_response again
Code screenshot:
result:
It can be seen that the request is in front of the response
Maybe some cuties have thought of some situations, can you create a request and response?
Let's try
Code screenshot:
result:
The careful little cutie will find that it is not what he expected,
Below I intercept the download middleware:
this is the problem
Let me explain the following:
process_request(request, spider)
# - return None: continue processing this request
will be passed on when return None, such as duoban's process_request() returns return None will run the process_request() of the download middleware
# - or return a Request object
when return (a Request object ) will not be passed on, such as duoban's process_request() returns return (a Request object), it will not run the process_request() of the download middleware but return to the engine, and the engine returns to the scheduler (return in the same way) # -
or return a Response object
will not be passed on when return (a Response object), such as duoban's process_request() returns return (a Response object), it will not run the process_request() of the download middleware but return to the engine, and the engine returns to Crawler file (cross-level)
# - or raise IgnoreRequest: process_exception() methods of If this method throws an exception, the process_exception method will be called
# installed downloader middleware will be called
process_response(request, response, spider)
# - return a Response object
# - return a Request object
Returns the Request object: stop the middleman call, and place it in the scheduler to be scheduled for download;
# - or raise IgnoreRequest
Some cuties will think, can I create a middleware by myself to add request headers: (in the middlewares.py file )
from scrapy import signals
import random
class UsertMiddleware:
User_Agent=["Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)",
"Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.7; AOLBuild 4343.27; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"]
def process_request(self, request, spider):
# 添加请求头
print(dir(request))
request.headers["User-Agent"]=random.choice(self.User_Agent)
# 添加代理ip
# request.meta["proxies"]="代理ip"
return None
class UafgfMiddleware:
def process_response(self, request, response, spider):
# 检测请求头是否添加上
print(request.headers["User-Agent"])
return response
result
is runnable
pipelines.py file
process_item(self, item, spider)
item: Receive the data returned by the crawler file, such as a dictionary
Let's crawl to Douban
Practice crawling pictures of Douban movies
Crawler file.py:
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com','doubanio.com']
start_urls = ['https://movie.douban.com/review/best/']
a=1
def parse(self, response):
divs=response.xpath('//div[@id="content"]//div[@class="review-list chart "]//div[@class="main review-item"]')
for div in divs:
# print(div.extract)
title=div.xpath('./a/img/@title')
src=div.xpath('./a/img/@src')
# print(title.extract_first())
print(src.extract_first())
yield {
"title": title.extract_first(),
"src": src.extract_first(),
"type": "csv"
}
# 再发请求下载图片
yield scrapy.Request(
url=src.extract_first(),
callback=self.parse_url,
cb_kwargs={"imgg":title.extract_first()}
)
#第一种
# next1=response.xpath(f'//div[@class="paginator"]//a[1]/@href').extract_first()
# 第二种方法自己构建
next1="/review/best?start={}".format(20*self.a)
self.a+=1
url11='https://movie.douban.com'+next1
yield scrapy.Request(url=url11,callback=self.parse)
print(url11)
def parse_url(self,response,imgg):
# print(response.body)
yield {
"title":imgg,
"ts":response.body,
"type":"img"
}
pipelines.py file:
import csv
class MyScrapyPipeline:
def open_spider(self,spider): # 当爬虫开启时调用
header = ["title", "src"]
self.f = open("move.csv", "a", encoding="utf-8")
self.wri_t=csv.DictWriter(self.f,header)
self.wri_t.writeheader()
def process_item(self, item, spider): # 每次传参都会调用一次
if item.get("type")=="csv":
item.pop("type")
self.wri_t.writerow(item)
if item.get("type")=="img":
item.pop("type")
with open("./图片/{}.png".format(item.get("title")),"wb")as f:
f.write(item.get("ts"))
print("{}.png下载完毕".format(item.get("title")))
return item
def close_spider(self,spider):
self.f.close()
settings.py file:
This can only output what you want to output
_____________________________________
All of the above are open
Remember that if the send request in the crawler file fails, the function in the pipelines.py file cannot be called back
Ways to Pause and Resume the Crawler
Some cuties think there is a way to pause and resume crawlers? If so, what is it?
Let me talk about it
scrapy crawl crawler file name -s JOBDIR=file path (just define)
Ctrl+c pauses the crawler
When the little cutie wants to restore again, she will find that the download cannot be run,
What is the reason, because the method we write is different from that given by the framework,
The scrapy.Request is as follows:
dont_filte (do not filter?) r is a filter, if it is False, it will be filtered (the same url is only accessed once), if it is True, it will not be filtered
Little cutie will think why parse() can send, the result is as follows:
The result is very clear, if you want not to filter, you have to change
If you want to filter overridden methods:
scrapy simulated login
There are two methods:
● 1 directly carry cookies to request the page (semi-automatically, use selenium to obtain or manually obtain cookies)
https://www.1905.com/vod/list/c_178/o3u1p1.html to make a case
The first method is the request page obtained by manual login
Crawler file code example 1 (add cookie to crawler file);
import scrapy
class A17kSpider(scrapy.Spider):
name = '17k'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/']
# 重写
def start_requests(self):
cook="GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
yield scrapy.Request(
url=self.start_urls[0],
callback=self.parse,
cookies={lis.split("=")[0]:lis.split("=")[1] for lis in cook.split(";")}
)
def parse(self, response):
# print(response.text)
yield scrapy.Request(url="https://user.17k.com/www/",callback=self.parse_url)
def parse_url(self,response):
print(response.text)
result:
Crawler file code example 2 (add cookie to download middleware file);
class MyaddcookieMiddleware:
def process_request(self, request, spider):
cook = "GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
cookies = {lis.split("=")[0]: lis.split("=")[1] for lis in cook.split(";")}
request.cookies=cookies
return None
Crawler file code example three (add cookie to download middleware file);
def sele():
#创建一个浏览器
driver=webdriver.Chrome()
#打开网页
driver.get("https://user.17k.com/www/bookshelf/")
print("你有15秒的时间登入")
time.sleep(15)
print(driver.get_cookies())
print({i.get("name"):i.get("value") for i in driver.get_cookies()})
class MyaddcookieMiddleware:
def process_request(self, request, spider):
sele()
return None
Find an interface to send a post request to store cookies
Code 1:
import scrapy
class A17kSpider(scrapy.Spider):
name = '17k'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/']
# # 重写
# def start_requests(self):
# cook="GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
# yield scrapy.Request(
# url=self.start_urls[0],
# callback=self.parse,
# cookies={lis.split("=")[0]:lis.split("=")[1] for lis in cook.split(";")}
# )
#
# def parse(self, response):
# # print(response.text)
# # yield scrapy.Request(url="https://user.17k.com/www/bookshelf/",callback=self.parse_url)
# pass
# def parse_url(self,response):
#
# # print(response.text)
# pass
#发送post请求
def parse(self, response):
data={
"loginName": "15278307585",
"password": "wasd1234"
}
yield scrapy.FormRequest(
url="https://passport.17k.com/ck/user/login",
callback=self.prase_url,
formdata=data
)
#适用于该页面有form表单
# yield scrapy.FormRequest.from_response(response,formdata=data,callback=self.start_urls)
def prase_url(self,response):
print(response.text)
In addition to these, you can return the respose object by downloading middleware
from scrapy import signals
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time
from scrapy.http.response.html import HtmlResponse
lass MyaaacookieMiddleware:
def process_request(self, request, spider):
# 创建一个浏览器
driver=webdriver.Chrome()
# 打开浏览器
driver.get("https://juejin.cn/")
driver.implicitly_wait(3)
# js语句下拉
for i in range(3):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
html=driver.page_source
return HtmlResponse(url=driver.current_url,body=html,request=request,encoding="utf-8")
That's all for the above.
Summarize
The scrapy framework is to solve the large amount of code rewriting caused by our crawling a lot of data, and solve the problem with a small amount of code