[Scrapy-01] An example of installation, project creation, crawler creation, simple crawling of Baidu title, and workflow introduction

When downloading python, if you download a relatively new version, it usually comes with pip. This is what the pip official website said. We generally only need to update pip, and sometimes we don’t need to update it, because the latest version downloaded Usually comes with the latest pip. pip official website: https://pip.pypa.io/en/stable/installing/

1. The first is to install ours Scrapy. We generally use pipcommands, but we need to update pipthe version first. pipSee the official website https://pip.pypa.io/en/stable/installing/ for the operation documentation . The upgrade command, we use it on the Windows platform:
write picture description here

2. Then use the following command to install scrapy, the domestic network speed is relatively slow, and errors are prone to occur timeout.
write picture description here

——Of course it is not smooth sailing. Generally, on new computers, this error will be prompted.
write picture description here

No way, go to the URL of this prompt to http://landinghub.visualstudio.com/visual-cpp-build-toolsdownload and install it.

It should be noted here that I installed vs2013 before, so it is no problem to install a build tool according to the above steps, but after I installed vs2015 later, there was a conflict when I installed the build tools according to the above supplement after the error was prompted. The prompt asked me to uninstall vs2015 and then install it. I was stunned at the time. Isn't this something from the Microsoft family. Logically, vs2015 includes build tools, but Microsoft abruptly split the two, which is fine. When the two compete for separate installations, they will conflict and get drunk. Of course, the solution is not to uninstall vs2015, but to change vs2015, that is, to install a general-purpose build tools module under visual c++ when changing, the specific name is forgotten. Prompt about the appearance of 3G, in fact, the installation is still relatively fast.

- After installation, use the help command to test to see if it works:

>scrapy -h

——Currently our development environment is mainly:

Python 3.6.1
pip 9.0.1
Scrapy 1.4.0

3. It seems that it has been installed. Let's use the following command to try it out:

>scrapy fetch http://www.baidu.com

It was found that an error was reported, and it seemed that a win32apimodule was missing:
write picture description here

Let's install this module, my dear, this module has a size of 9M+, in short, try it a few times yourself:

>pip install pypiwin32

After installation, run ":

>scrapy fetch http://www.baidu.com

There is no error this time, not only the content is returned, but also the entire execution process is output:
write picture description here

If we only want to see the content and don't want to output the execution process (also called the log), add a --nologparameter:

>scrapy fetch http://www.baidu.com --nolog

4. Creating a project is very simple. cdGo to the folder of your choice and use the following command to create it:

>scrapy startproject test1

——We cdgo to the project and execute the following command to test the theoretical maximum speed of the crawler:

>cd test1

>scrapy bench

After we enter the project, we can create a specific crawler. There are several crawler templates that can be viewed.

>scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

The complete specification for creating a crawler is as follows, you can specify a template, you need to specify a name, and you need to specify the domain name to be crawled.

>scrapy genspider -t basic baidu_spider baidu.com
Created spider 'baidu_spider' using template 'basic' in module:
  test1.spiders.baidu_spider

write picture description here

5. To check whether a crawler is running normally, use the checkcommand as follows. If you see one OK, um, you can leave work early.

>scrapy check baidu_spider

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

6. Another important command is to execute the crawler we created:

scrapy crawl baidu_spider --nolog

——To see which crawlers are under the project, use the listcommand.

>scrapy list
baidu_spider

7. If you use IDE, you can use it casually. You can use Sublime Text or Pycharm.

8. We randomly wrote a titlecrawler that crawls to the Baidu homepage. The content is as follows:

- file items.py:

import scrapy

class Test1Item(scrapy.Item):
    title = scrapy.Field()

--Crawler file baidu_spider.py:

# -*- coding: utf-8 -*-
import scrapy
from test1.items import Test1Item


class BaiduSpiderSpider(scrapy.Spider):
    name = 'baidu_spider'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baidu.com/']

    def parse(self, response):
        item = Test1Item()
        item['title'] = response.xpath('//title/text()').extract()
        print(item['title'])

- Let's execute it and find that nothing is printed:

>scrapy crawl baidu_spider --nolog

——Then let's print out the log information to see:

>scrapy crawl baidu_spider

It is found that there is such a line of log information, indicating that websites such as Baidu do not allow crawlers to crawl, then our crawlers comply with the crawler protocol by default:

2017-08-18 15:03:20 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://baidu.com/>

——If you have to crawl, it is not abiding by the agreement, if not, settings.pymodify it in it:

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

- Now execute it, and you will find that there is data output:

>scrapy crawl baidu_spider --nolog
['百度一下,你就知道']

9. If we do not printcome out in the crawler file, but throw the data captured and stored in the object for pipelines.pyprocessing, then it can be yield xxximplemented. yieldIt is what is sent to pipelines.py, as follows.

def parse(self, response):
    item = Test1Item()
    item['title'] = response.xpath('//title/text()').extract()
    yield item

——For example, we now itempass it to the pipelines.pyprocessing, let's take a look at the default class of this file, this class has three parameters, the second parameter itemis the object from the crawler file yield, our data is in it, of course, we It can be printed out in the following way.

# -*- coding: utf-8 -*-

class Test1Pipeline(object):
    def process_item(self, item, spider):
        print(item['title'])
        return item

——Executed it and found that there is no output. That is because the Scrapyproject is not turned on by default pipelines.py. We still need to settings.pyturn it on and remove the comment. Of course, we need to pay attention to whether the class to be processed is our code. For the class written in it, the numbers behind represent the order, and we will talk about this later:

ITEM_PIPELINES = {
   'test1.pipelines.Test1Pipeline': 300,
}

10. Here you need to pay attention to the naming rules. The main reason is that it is best not to add the Spider suffix to the name of the reptile, because as follows, then the reptile class will automatically add this suffix.
write picture description here

11. If you use a MySQL database, you also need to install the python and MySQL connection packages. If you install it directly mysql-pythonhere MySQLdb, you will get an error. You can replace PyMySQLit with a bag, who doesn't use it.

not allowed to connect to this server12. Such prompts may appear when connecting to MySQL remotely . You can modify the access permissions. Don't forget to restart the MySQL service, otherwise it will not take effect, or you will not be able to connect.
write picture description here

13. Now to analyze and extract pictures, etc., another library needs to be used requests, so if necessary, it needs to be installed pip install requests.

14. The PIL library for image processing, when installed with pip, the name is not PIL but Pillow, so you need:

pip install Pillow

Then it's ready to use from PIL import Image.

15. If we are deployed on the server and want to execute it regularly, window serverit can be implemented through a task plan, but it is best to use a batscript. The same Linuxis achieved on the server crontab, and scripts can also be used .sh, but if you think it is too troublesome, you can directly write all the scripts in crontabit, for example:

0 */3 * * * cd /home/scrapyproject && /usr/local/bin/scrapy crawl xxx

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325826837&siteId=291194637