1. Configure the environment
The version of Alibaba Cloud is 2.7.5, so a 3.6.4 environment is newly installed with pyenv. After installation, you can use pyenv global 3.6.4 to use the 3.6.4 environment. I personally like this way. influence.
As shown below:
Next, according to Dacai's article, pip install gerapy can be done, and there is no problem in this step. Students who have questions can submit an issue to Dacai.
2. Start the service
First go to Alibaba Cloud's background to set up a security group, mine is like this:
Then go to the command window to release ports 8000 and 6800.
Then execute
gerapy init
cd gerapy
gerapy migrate
# Pay attention to the next step
gerapy runserver 0.0.0.0:8000 [If you are locally, you can execute gerapy runserver. If you are on Alibaba Cloud, you should change it to the previous execution]
Now visit in the browser: ip:8000 should be able to see the main interface
For the meaning of each of them, see the article of Dacai.
3. Create a project
Create a new scrapy crawler in the projects under gerapy, here I do the simplest:
scrapy startproject gerapy_test
cd gerapy_test
scrapy genspider baidu www.baidu.com
This is the simplest crawler, modify ROBOTSTXT_OBEY=False in a settings.py, and then modify baidu.py under a spider, here is optional, what I set here is the response.url returned by the output
4. Install scrapyd
pip install scrapyd
After installation, execute the command line
scrapyd
Then open ip:6800 in the browser. If you do not modify the configuration, it should not be opened here. When the clients are configured there, it should also be displayed as an error, like this:
Later, I found the reason and found that scrapyd also opens 127.0.0.1 by default.
So at this time, we need to change the configuration. For details, please refer to here . I modified it like this:
vim ~/.scrapyd.conf
[scrapyd]
bind_address = 0.0.0.0
After refreshing, you will see that the previous error has become normal
5. Packaging, deployment, scheduling
These steps are detailed in the articles of great talents. After packaging and deployment, after entering the scheduling interface of clients, click the run button to run the crawler.
You can see the output result.
6. Conclusion
It is recommended that you try to use it, it is very convenient, I just used it very briefly here.