Install python3.7.9 under centos7 and build scrapy2 environment

I don't understand reptile technology. When you need to obtain page information, you can use the program directly for simple requirements. Complex requirements have been penetrated downwards with chrome-mini. I went to a crawler technology meeting two days ago and found this field very interesting, so I came back to build a scrapy environment and planned to learn.
Since I am a novice, the construction process is step by step. Record it here, hoping to help other new contact students.

1. Install python 3.7.9

centos7 comes with python2.7.5 version. I tried to install scrapy2 directly under this version. There was no error in the building process, but there was a problem when running the program, saying that the version of python2 was too low, so I had to install python3.

1. Install one-time development tools, source code installation software requires this
    yum -y groupinstall "Development tools"

2、安装各种依赖包
    yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel

3. Download the python3.7.9 installation package
    wget "https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz"

4. Unzip
    tar -zxvf Python-3.7.9.tgz

5. Enter the root directory
    cd Python-3.7.9

6. Compile./configure
    --prefix=/usr/local/python3

7. Install
    make && make install

8. Create a soft link
    ln -s /usr/local/python3/bin/python3 /usr/bin/python3
    ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3

9、验证
    [root@seeker-01 ~]# python
    Python 2.7.5 (default, Aug  7 2019, 00:51:29) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>>

    [root@seeker-01 ~]# python3
    Python 3.7.9 (default, Aug 28 2020, 13:28:49) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>>

Two, install scrapy2

1. Install scrapy2
    pip3 install scrapy

2、在python3 shell中验证scrapy2
    [root@seeker-01 ~]# python3
    Python 3.7.9 (default, Aug 28 2020, 13:28:49) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import scrapy
    >>> scrapy.version_info
    (2, 3, 0)
    >>> 

3. Create a soft scrapy link
    ln -s /usr/local/python3/bin/scrapy /usr/bin/scrapy

4. Free download scrapy2
    [root@seeker-01 ~]# scrapy
    Scrapy 2.3.0 - no active project

    Usage:
      scrapy <command> [options] [args]

    Available commands:
      bench         Run quick benchmark test
      commands      
      fetch         Fetch a URL using the Scrapy downloader
      genspider     Generate new spider using pre-defined templates
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy

      [ more ]      More commands available when run from project directory

    Use "scrapy <command> -h" to see more info about a command

3. Create a scrapy project

1. Create a project command
    scrapy startproject project name (my project is called mytest01)

    [root@seeker-01 ~]# scrapy startproject mytest01
    New Scrapy project 'my_test', using template directory '/usr/local/python3/lib/python3.7/site-packages/scrapy/templates/project', created in:
        /root/mytest01

    You can start your first spider with:
        cd mytest01
        scrapy genspider example example.com

    After the execution is complete, a mytest01 directory is created in my /root directory.
    /root/mytest01/mytest01/spiders is where the .py file is placed. The file name I created is test01.py

2. Write a scrapy function
    Create a test01.py file in the /root/mytest01/mytest01/spiders directory and write the following content:

import scrapy

class mingyan(scrapy.Spider): #Need to inherit the scrapy.Spider class

    name = "mytest01" #Spider name
    def start_requests(self): #This method crawls pages through the following links
        urls = [ #Crawled links
            'http://lab.scrapyd.cn/page/1/',
            ' http://lab.scrapyd.cn/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse) #How to deal with crawled pages? Submit to the parse method for processing

    def parse(self, response):

        """
        start_requests has been crawled to the page, so how to extract the content we want? Then it can be defined in this method.
        Here, there is no definition, just simply save the page without involving Extract the data we want, and we will slowly talk about it later,
        that is, use xpath, regular rules, or css to extract correspondingly. This example is to let you see the process of scrapy operation:
        1. Define links;
        2. Crawl through links (Download) page;
        3. Define the rules, and then extract the data;
        it is such a process, does it seem very simple?
        """
        page = response.url.split("/")[-2] #According to the link above Extract pagination, such as: /page/1/, what is extracted is: 1
        filename = 'mingyan-%s.html' % page #Concatenated file names, if it is the first page, the final file name is: mingyan-1. html
        with open(filename, 'wb') as f: #python file operation, not much to say;
            f.write(response.body) #Where did the downloaded page go? response.body represents the page just downloaded!
        self.log('save file: %s'

3. Execute the scrapy file
    scrapy crawl mytest01

2020-08-28 14:27:15 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: mingyan2)
2020-08-28 14:27:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.9 (default, Aug 28 2020, 13:28:49) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.1, Platform Linux-3.10.0-1062.12.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
2020-08-28 14:27:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-08-28 14:27:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'MyTest01',
 'NEWSPIDER_MODULE': 'MyTest01.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['MyTest01.spiders']}
2020-08-28 14:27:15 [scrapy.extensions.telnet] INFO: Telnet Password: 168ea9bb811b4d38
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-28 14:27:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Spider opened
2020-08-28 14:27:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-28 14:27:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://lab.scrapyd.cn/robots.txt> (referer: None)
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://lab.scrapyd.cn/page/1/> (referer: None)
2020-08-28 14:27:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://lab.scrapyd.cn/page/2/> (referer: None)
2020-08-28 14:27:15 [mingyan2] DEBUG: 保存文件: mingyan-1.html
2020-08-28 14:27:15 [mingyan2] DEBUG: 保存文件: mingyan-2.html
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-28 14:27:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 666,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 6436,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 0.344661,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 28, 6, 27, 15, 851756),
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'memusage/max': 47456256,
 'memusage/startup': 47456256,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 8, 28, 6, 27, 15, 507095)}
2020-08-28 14:27:15 [scrapy.core.engine] INFO: Spider closed (finished)


At this time, two files will appear in the /root/MyTest01/MyTest01/spiders directory:
mytest01-1.html
mytest01-2.html
At this point, the environment is created successfully.

Guess you like

Origin blog.csdn.net/ziele_008/article/details/108281493