When downloading python, if you download a relatively new version, it usually comes with pip. This is what the pip official website said. We generally only need to update pip, and sometimes we don’t need to update it, because the latest version downloaded Usually comes with the latest pip. pip official website: https://pip.pypa.io/en/stable/installing/
1. The first is to install ours Scrapy
. We generally use pip
commands, but we need to update pip
the version first. pip
See the official website https://pip.pypa.io/en/stable/installing/ for the operation documentation . The upgrade command, we use it on the Windows platform:
2. Then use the following command to install scrapy
, the domestic network speed is relatively slow, and errors are prone to occur timeout
.
——Of course it is not smooth sailing. Generally, on new computers, this error will be prompted.
No way, go to the URL of this prompt to http://landinghub.visualstudio.com/visual-cpp-build-tools
download and install it.
It should be noted here that I installed vs2013 before, so it is no problem to install a build tool according to the above steps, but after I installed vs2015 later, there was a conflict when I installed the build tools according to the above supplement after the error was prompted. The prompt asked me to uninstall vs2015 and then install it. I was stunned at the time. Isn't this something from the Microsoft family. Logically, vs2015 includes build tools, but Microsoft abruptly split the two, which is fine. When the two compete for separate installations, they will conflict and get drunk. Of course, the solution is not to uninstall vs2015, but to change vs2015, that is, to install a general-purpose build tools module under visual c++ when changing, the specific name is forgotten. Prompt about the appearance of 3G, in fact, the installation is still relatively fast.
- After installation, use the help command to test to see if it works:
>scrapy -h
——Currently our development environment is mainly:
Python 3.6.1
pip 9.0.1
Scrapy 1.4.0
3. It seems that it has been installed. Let's use the following command to try it out:
>scrapy fetch http://www.baidu.com
It was found that an error was reported, and it seemed that a win32api
module was missing:
Let's install this module, my dear, this module has a size of 9M+, in short, try it a few times yourself:
>pip install pypiwin32
After installation, run ":
>scrapy fetch http://www.baidu.com
There is no error this time, not only the content is returned, but also the entire execution process is output:
If we only want to see the content and don't want to output the execution process (also called the log), add a --nolog
parameter:
>scrapy fetch http://www.baidu.com --nolog
4. Creating a project is very simple. cd
Go to the folder of your choice and use the following command to create it:
>scrapy startproject test1
——We cd
go to the project and execute the following command to test the theoretical maximum speed of the crawler:
>cd test1
>scrapy bench
After we enter the project, we can create a specific crawler. There are several crawler templates that can be viewed.
>scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
The complete specification for creating a crawler is as follows, you can specify a template, you need to specify a name, and you need to specify the domain name to be crawled.
>scrapy genspider -t basic baidu_spider baidu.com
Created spider 'baidu_spider' using template 'basic' in module:
test1.spiders.baidu_spider
5. To check whether a crawler is running normally, use the check
command as follows. If you see one OK
, um, you can leave work early.
>scrapy check baidu_spider
----------------------------------------------------------------------
Ran 0 contracts in 0.000s
OK
6. Another important command is to execute the crawler we created:
scrapy crawl baidu_spider --nolog
——To see which crawlers are under the project, use the list
command.
>scrapy list
baidu_spider
7. If you use IDE, you can use it casually. You can use Sublime Text or Pycharm.
8. We randomly wrote a title
crawler that crawls to the Baidu homepage. The content is as follows:
- file items.py
:
import scrapy
class Test1Item(scrapy.Item):
title = scrapy.Field()
--Crawler file baidu_spider.py
:
# -*- coding: utf-8 -*-
import scrapy
from test1.items import Test1Item
class BaiduSpiderSpider(scrapy.Spider):
name = 'baidu_spider'
allowed_domains = ['baidu.com']
start_urls = ['http://baidu.com/']
def parse(self, response):
item = Test1Item()
item['title'] = response.xpath('//title/text()').extract()
print(item['title'])
- Let's execute it and find that nothing is printed:
>scrapy crawl baidu_spider --nolog
——Then let's print out the log information to see:
>scrapy crawl baidu_spider
It is found that there is such a line of log information, indicating that websites such as Baidu do not allow crawlers to crawl, then our crawlers comply with the crawler protocol by default:
2017-08-18 15:03:20 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://baidu.com/>
——If you have to crawl, it is not abiding by the agreement, if not, settings.py
modify it in it:
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
- Now execute it, and you will find that there is data output:
>scrapy crawl baidu_spider --nolog
['百度一下,你就知道']
9. If we do not print
come out in the crawler file, but throw the data captured and stored in the object for pipelines.py
processing, then it can be yield xxx
implemented. yield
It is what is sent to pipelines.py
, as follows.
def parse(self, response):
item = Test1Item()
item['title'] = response.xpath('//title/text()').extract()
yield item
——For example, we now item
pass it to the pipelines.py
processing, let's take a look at the default class of this file, this class has three parameters, the second parameter item
is the object from the crawler file yield
, our data is in it, of course, we It can be printed out in the following way.
# -*- coding: utf-8 -*-
class Test1Pipeline(object):
def process_item(self, item, spider):
print(item['title'])
return item
——Executed it and found that there is no output. That is because the Scrapy
project is not turned on by default pipelines.py
. We still need to settings.py
turn it on and remove the comment. Of course, we need to pay attention to whether the class to be processed is our code. For the class written in it, the numbers behind represent the order, and we will talk about this later:
ITEM_PIPELINES = {
'test1.pipelines.Test1Pipeline': 300,
}
10. Here you need to pay attention to the naming rules. The main reason is that it is best not to add the Spider suffix to the name of the reptile, because as follows, then the reptile class will automatically add this suffix.
11. If you use a MySQL database, you also need to install the python and MySQL connection packages. If you install it directly mysql-python
here MySQLdb
, you will get an error. You can replace PyMySQL
it with a bag, who doesn't use it.
not allowed to connect to this server
12. Such prompts may appear when connecting to MySQL remotely . You can modify the access permissions. Don't forget to restart the MySQL service, otherwise it will not take effect, or you will not be able to connect.
13. Now to analyze and extract pictures, etc., another library needs to be used requests
, so if necessary, it needs to be installed pip install requests
.
14. The PIL library for image processing, when installed with pip, the name is not PIL but Pillow, so you need:
pip install Pillow
Then it's ready to use from PIL import Image
.
15. If we are deployed on the server and want to execute it regularly, window server
it can be implemented through a task plan, but it is best to use a bat
script. The same Linux
is achieved on the server crontab
, and scripts can also be used .sh
, but if you think it is too troublesome, you can directly write all the scripts in crontab
it, for example:
0 */3 * * * cd /home/scrapyproject && /usr/local/bin/scrapy crawl xxx