Teach you how to deploy web crawlers with Scrapy+Gerapy

Click " Python crawler and data mining " above to follow

Reply to " Books " to receive a total of 10 e-books of Python from beginner to advanced

now

day

Chickens

soup

He stored 10,000 books in his belly, and refused to bow his head in the grass.

Preface

Hi, everyone, meet again, I'm a Python advanced person, don't talk nonsense, just start the liver, Ori here!

Crawler management renderings

Dependent package

file:requirements.txt‍‍‍‍‍‍‍‍‍

The content of the file is posted directly here:

appdirs==1.4.4
APScheduler==3.5.1
attrs==20.1.0
Automat==20.2.0
beautifulsoup4==4.9.1
certifi==2020.6.20
cffi==1.14.2
chardet==3.0.4
constantly==15.1.0
cryptography==3.0
cssselect==1.1.0
Django==1.11.29
django-apscheduler==0.3.0
django-cors-headers==3.2.0
djangorestframework==3.9.2
furl==2.1.0
gerapy==0.9.5
gevent==20.6.2
greenlet==0.4.16
hyperlink==20.0.1
idna==2.10
incremental==17.5.0
itemadapter==0.1.0
itemloaders==1.0.2
Jinja2==2.10.1
jmespath==0.10.0
lxml==4.5.2
MarkupSafe==1.1.1
orderedmultidict==1.0.1
parsel==1.6.0
Protego==0.1.16
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
PyDispatcher==2.0.5
pyee==7.0.2
PyHamcrest==2.0.2
pymongo==3.11.0
PyMySQL==0.10.0
pyOpenSSL==19.1.0
pyppeteer==0.2.2
pyquery==1.4.1
python-scrapyd-api==2.1.2
pytz==2020.1
pywin32==228
queuelib==1.5.0
redis==3.5.3
requests==2.24.0
Scrapy==1.8.0
scrapy-redis==0.6.8
scrapy-splash==0.7.2
scrapyd==1.2.1
scrapyd-client==1.1.0
service-identity==18.1.0
six==1.15.0
soupsieve==2.0.1
tqdm==4.48.2
Twisted==20.3.0
tzlocal==2.1
urllib3==1.25.10
w3lib==1.22.0
websocket==0.2.1
websockets==8.1
wincertstore==0.2
zope.event==4.4
zope.interface==5.1.0

project files

project files:qiushi.zip

Realization function: embarrassing encyclopedia paragraph crawler,

This is the Scrapyproject, the dependency package is as above

Run project steps

  • After installing the dependency package and decompressing the project file,pip install -r requirements.txt

  • Excuting an orderscrapy crawl duanzi --nolog

Configure Scrapyd

It can be understood that it Scrapydis a Scrapyproject that manages the project we wrote. After configuring this, you can control the crawler through commands such as run , pause, etc.

I won’t talk about other things, this one is not used much, all we need to do is to start it.

Start Scrapyd service

  1. Switch to the qiushicrawler project directory, the Scrapycrawler project needs to enter the crawler directory to execute the command

  1. Excuting an orderscrapyd

  1. The browser input http://127.0.0.1:6800/, the following picture appears to be correct

Package Scrapy and upload it to Scrapyd

These are just a start Scrapyd, but will not Scrapydeploy the project to Scrapythe need to configure the following Scrapyin the scrapy.cfgfile

The configuration is as follows

Packing command

scrapyd-deploy <部署名> -p <项目名>

This sample command

scrapyd-deploy qb -p qiushi

As shown in the figure, the following picture appears to indicate success

Note: There may be problems in the process, I will put the solution later!!!

Back to the browser again, there will be one more item qiushi, so far, the Scrapydconfiguration has been completed

Configure Gerapy

After the above configuration is complete, you can configure Gerapy. In fact, Scrapyd has far more functions than the above, but it is a command operation, so it is not friendly.

Gerapy's visual crawler management framework needs to be Scrapydstarted when used and hung in the background. In fact, the essence is still to Scrapydsend requests to the service, but it is just a visual operation.

Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapy-Redis, Scrapyd-API, Scrapy-Splash, Jinjia2, Django, Vue.js development

Configuration steps

GerapyIt Scrapyhas nothing to do with, so you can choose a folder at will, here I created a gerapyDemo folder

  1. Execute command to initialize gerpay

    gerapy init

Will generate a gerapy folder

  1. Enter the generated gerapy folder

  1. Executing the command will generate a table

    gerapy migrate

  2. Start the gerapy service, the default is port 8000, you can specify the port to start

    gerapy runserver
    gerapy runserver 127.0.0.1:9000 本机 9000端口启动

  1. Open the browser, input http://127.0.0.1:8000/, the following interface appears to indicate success

Of course, under normal circumstances, it is probably such an interface, we need to generate an account password

Stop the service, enter the command gerapy creatsuperuser, follow the prompts to create an account password and log in with the account

Add a crawler project in Gerapy

After all the above are configured, we can configure the crawler project. Through a little bit of way, you can run the crawler.

Click 主机管理-->创建, ip is the host of the Scrapyd service, the port is the port of Scrapyd, the default is 6800, fill in and click create


Then in the host list, scheduling, you can run the crawler

Run crawler

Get the result, the result has been written to the local

Package crawler upload

In the above process, we can only play crawlers, but it is not thorough. According to the truth, we still have a packaging process. Only when the crawlers are packaged can it be truly combined.

step

  1. First, you need to copy the crawler project to the projects folder under gerapy

  1. Refresh the page, click Project Management, you can see that the configurable and packaging are in the status of x

Click Deploy, write a description, and click Package

Back to the main interface again, you can find that the packaging is already correct


At this point, basically the entire process ends.

Solve scrapyd-deploy is not an internal and external command

Normally, when scrapyd-deploy is executed, it will prompt that it scrapyd-deployis not an internal or external command, um... this is a normal operation

Resolution steps

  1. Find the new and two files Pythonunder the interpreterScriptsscrapy.batscrapyd-deploy.bat

Modify these two files, the content is as follows

scrapy.bat

@echo off
D:\programFiles\miniconda3\envs\hy_spider\python D:\programFiles\miniconda3\envs\hy_spider\Scripts\scrapy %*

scrapyd-deploy.bat

@echo off
D:\programFiles\miniconda3\envs\hy_spider\python D:\programFiles\miniconda3\envs\hy_spider\Scripts\scrapyd-deploy %*

Note: The red box indicates the position of the interpreter. The above content is one line. How I paste it is two lines..., just one-to-one correspondence.

Gerapy use process summary

1.gerapy init 初始化,会在文件夹下创建一个gerapy文件夹
2.cd gerapy
3.gerapy migrate
4.gerapy runserver 默认是127.0.0.1:8000
5.gerapy createsuperuser 创建账号密码,默认情况下都是没有的
6.游览器输入127.0.0.1:8000 登录账号密码,进入主页
7.各种操作,比如添加主机,打包项目,定时任务等

to sum up

The above is solved in an introductory way and arranges the following how to Gerapy + Scrpyd + Scrapydeploy crawlers through visualization.

If there is a task problem during the operation, remember to leave a message below, we will see that the problem will be solved as soon as possible.

I am a programmer on Monday, if you think it is not bad, remember to give it a thumbs up, thank you for watching.

If you think the article is okay, remember to like and leave a comment to support us. Thank you for reading. If you have any questions, please remember to leave a message below~

If you want to learn more about Python, you can refer to the learning website: http://pdcfighting.com/, click to read the original text, you can go straight to it~

------------------- End -------------------

Recommendations of previous wonderful articles:

Welcome everyone to like , leave a message, forward, reprint, thank you for your company and support

If you want to join the Python learning group, please reply in the background [ Enter the group ]

Thousands of rivers and mountains are always in love, can you click [ Looking ]

/Today's message topic/

Just say a word or two~~

Guess you like

Origin blog.csdn.net/pdcfighting/article/details/113449129