[Crawler Development] Based on actual project combat, experience summary from development to deployment

development process

1. Development through IDE

Chrome plugin

Go online scientifically, then go to the plug-in store and search for selenium IDE installation.

Firefox add-ons

If you don’t have access to the Internet, use the firefox plug-in, and search for selenium IDE to install it.

to develop

After installation, open the plug-in and record your own browser operations. During the recording process, some meaningless operations can be deleted, which can make the export script more concise.
After the recording in the previous step is completed, export the recording script and select python as the language to get the complete python script code.
The python exported by selenium IDE can be used directly, and can be further streamlined, such as removing pytest dependencies.
Later, you will find that if multiple crawlers are running at the same time, chrome consumes a lot of memory when running. So consider throwing away selenium and browsers and directly access the target website api.

2. Developed by postman

Postman development is the essence of API development. To be added later

3. scrapy development

To be added

4. Add proxy

Agent purchase site

Visiting the target website with a large amount of traffic will cause IP banning. The temporary solution is to wait for a few minutes before visiting. A permanent solution is to use a proxy ip.
SmartProxy
is expensive, but really easy to use

deployment process

The deployment environment is CentOS

1. Pagoda deployment

The use of python in the pagoda environment

centOS comes with python 2.7, but our development is based on python3. After centOS installs the pagoda, use the following command
btpython -V
to find that btpython is python 3.8, so there is no need to install python version managers such as pyenv.

Install Chrome on centOS

  1. Modify the yum source
    Create a new file google-chrome.repo in the /etc/yum.repos.d/ directory and add the following content to it
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/$basearch
enabled=1
gpgcheck=1
gpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub
  1. The official source of Google installed by yum
    yum -y install google-chrome-stable
    may not be available in China, resulting in installation failure or failure to update in China, you can add the following parameters to install: 
    yum -y install google-chrome-stable --nogpgcheck
  2. Check the chrome version and install the corresponding chromedriver
    google-chrome -v

Install Firefox browser on centOS

Can be installed directly using yum
yum install firefox

Install Webdrivers

  1. ChromeDriver
    Taobao mirror
    official link
  2. GeckoDriver is suitable for Firefox browser
    Github link
    Geckodriver and Firefox version mapping table
  3. SafariDriver (suitable for mac development)
    Safari has its own webriver, which does not need to be installed separately but needs to be turned on. Menu/Development/Allow Remote Debugging. menu/develop/Allow remote automation
    All webrivers are placed in the /usr/local/bin folder

Set up a headless browser

A headless browser will not be used during development, but it is recommended to use headless mode after deploying to the server. The sample code for headless mode is as follows

from selenium import webdriver

options = webdriver.ChromeOptions()   
# options = webdriver.FirefoxOptions()  
options.add_argument("--headless")   
options.add_argument("--no-sandbox")   
driver = webdriver.Chrome(options=options)   
driver.get("https://www.qq.com")   
driver.get_screenshot_as_file("test.png")   
driver.quit()

2. supervisor monitoring

The reason why supervisor monitoring is used is that the crawler will always cause the process to exit unexpectedly due to various reasons. Using supervisor can automatically restart when the process exits.

supervisor installation

  1. Supervisor installation under the pagoda
    It is recommended to use the supervisor plug-in of the pagoda, foolish installation
  2. Supervisor installation under non-pagoda
    pip install supervisor
  3. Supervisor installation under macOS
    brew install supervisor
    After installing supervisor on mac, you can start the browser and input
    http://localhost:9001
    to visually manage supervisor projects

Supervisor's command line startup

Pagoda visualization is very convenient to use supervisor, but when the monitoring process reaches more than double digits, it is time-consuming to manually start each process one by one. So it is recommended to use the command line to start and pause.
The command line command of supervisor is as follows

supervisorctl reload
supervisorctl stop all

The installation path of the supervisor under the pagoda is /www/server/panel/pyenv/bin/supervisorctl,
so the above command should be changed to

# 查看版本
/www/server/panel/pyenv/bin/supervisord -v  
# 重新启动
/www/server/panel/pyenv/bin/supervisorctl reload
# 停止所有进程
/www/server/panel/pyenv/bin/supervisorctl stop all

supervisor event listener monitoring alarm

After the program is deployed, we always want to know how it is running, such as: how many times the program exits abnormally, whether the crawler failed to start, and so on. Using the supervisor event listener can play a role in monitoring and alarming.
The event listener of supervisor is also a child process of supervisor. Therefore, you need to write a listener.py monitoring script first, and then let the supervisor run the script.
The writing method of listener.py will be dedicated to writing an article later.
After writing, find the supervisor main configuration file and add the following content

[eventlistener:listener]
command=btpython listener.py   
process_name=%(program_name)s 
numprocs=1                    
events=PROCESS_STATE                  
directory=/www/wwwroot/project 
autostart=true               
autorestart=unexpected  
user=root      

Use supervisor in pagoda environment

  1. Install python project manager, install python3.7.8 version, set to default version
    pyenv global 3.7.8

common problem

  1. The supervisor under the pagoda cannot be started normally?
    Go to /www/server/panel/plugin/supervisor/log to view the error log
  2. Obviously installed pytest but shows no named module pytest?
    This is because there are multiple python versions in the server, and the installation directory of pytest is not in the default version. The solution is as follows:
    use which pytest to find the location of the pytest command, and then /root/ .pyenv/shims/python3 -m pytest
  3. What should I do if the sshkey has been changed after the remote server reinstalls the system?
    Execute ssh-keygen -R "ip" locally, where the ip address is replaced with the remote server address
  4. How to view the process in linux
    https://www.linuxprobe.com/linux-look-process.html

custom pagoda

The reason for customizing the pagoda is that many repetitive debugging and installation tasks during the development period are a headache, so we analyzed the source code of the pagoda and made custom modifications according to our own needs. It is estimated that many students should not need this step, so I will not write tutorials. If necessary, you can contact me privately.

Guess you like

Origin blog.csdn.net/weixin_42553583/article/details/124474555