Scrapy the station data crawling

Scrapy installation

  • Linux
  1. pip install scrapy
  • Windows
  1. pip install wheel
  2. Download twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
  3. Step into the download directory, pip install the downloaded file name
  4. pip install pywin2
  5. pip install scrapy

Create a project, reptiles file

  • New Project

  scrapy startproject crawPro

  • New reptiles file

  Reptiles into the project directory cd crawPro

  scrapy genspider -t craw5i5j www.xxx.com # www.xxx.com starting url, comment out the back of reptiles file

 Writing reptile file

  • Notes starting url
  • Because reptiles address of the first and second pages url inconsistent, so add a new rule parser rules parser using regular expressions
  • Parameters: follow = True representation to follow. All channels will be Page
  • Parameters: callback = 'parse_item', each represents a callback URL to return data, the method needs to be parsed parse_item
  • Parameters: Data type response is returned select type, which requires extract_first () represents taking a first value, () takes a plurality of values ​​Extract, return data type list
  • File: items need to define the parameters, see the code
  • Import item: from crawPro.items import CrawproItem; instantiating item = CrawproItem (), the value of the parsed load item
  • Finally: yield item, the value of incoming items
  • craw5i5j.py Code
# - * - Coding: UTF-8 - * - 
Import scrapy
 from scrapy.linkextractors Import LinkExtractor    # link extractor 
from scrapy.spiders Import CrawlSpider, Rule     # Rule rule parser object 
from crawPro.items Import CrawproItem 

class Craw5i5jSpider (CrawlSpider): 
    name = ' craw5i5j ' 
    # allowed_domains = [ 'www.xxx.com'] 
    start_urls = [ ' https://nj.5i5j.com/xiaoqu/pukouqu/ ' ]
     #Link extractor: the premise of follow = False, the starting role is used to extract the corresponding page URL to meet the requirements of the link 
    # parameters allow is an ongoing expression. 
    = LinkExtractor Link (R & lt the allow = ' ^ HTTPS: //nj.5i5j.com/xiaoqu/pukouqu/n \ + D / $ ' ) 
    link1 = LinkExtractor (the allow = R & lt ' ^ HTTPS: //nj.5i5j.com/xiaoqu / pukouqu / $ ' ) 

    the rules = (
         # rule LinkExtractor parser object instantiation link extractor, callback callback function 
        the rule (link, callback = ' parse_item ' , Follow = True), 
        the rule (link1, callback = ' parse_item ' , Follow = True), 
    ) 

    DEF parse_item(self, response):

        for li in response.xpath('//div[@class="list-con-box"]/ul/li'):
            xq_name = li.xpath(".//h3[@class='listTit']/a/text()").extract_first().strip()
            xq_chengjiao = li.xpath(".//div[@class='listX']/p/span[1]/a/text()").extract_first().strip()
            xq_danjia = li.xpath(".//div[@class='listX']/div[@class='jia']/p[@class='redC']//text()").extract_first().Strip () "= li.xpath (
            xq_zongjiadiv .// [@ class = 'LISTX'] / div [@ class = 'Jia'] / P [2] / text () " ) .extract_first (). Strip () 

            Item = CrawproItem () 
            Item [ ' xq_name ' ] = xq_name 
            Item [ ' xq_chengjiao ' ] = xq_chengjiao 
            Item [ ' xq_danjia ' ] = xq_danjia 
            Item [ ' xq_zongjia ' ] = xq_zongjia 

            the yield Item
View Code
  • items.py Code
import scrapy

class CrawproItem(scrapy.Item):
    # define the fields for your item here like:
    xq_name = scrapy.Field()
    xq_chengjiao = scrapy.Field()
    xq_danjia = scrapy.Field()
    xq_zongjia = scrapy.Field()
View Code

  Write pipe file

  • Override the parent class method: def open_spider (self, spider): the role of open files at once to avoid multiple files open (if database which should be open data connection)
  • Override inherited methods: def close_spider (self, spider): The first step is to open the file, be closed (closing database be a database connection)
  • Method: process_item, write the file format setting (writing operation of the database)
  • Code  
1  class CrawproPipeline (Object):
 2      fp = None
 3      # rewrite the parent class, this method will only be called once, to open the file. 
. 4      DEF open_spider (Self, Spider):
 . 5          self.fp = Open ( " 1.txt " , ' W ' , encoding = ' UTF-. 8 ' )
 . 6  
. 7      DEF process_item (Self, Item, Spider):
 . 8          self.fp .write (Item [ " xq_name " ] + " \ T " + Item [ " xq_chengjiao " ] + "\t"+item["xq_danjia"]+"\t"+item["xq_zongjia"]+"\n")
 9         return item
10 
11     def close_spider(self,spider):
12         self.fp.close()
View Code

 Set Middleware

  • Setting UA agent pool
  • Middleware: middlewares.py, find a way: def process_request (self, request, spider): Set UA agent pool
 1     def process_request(self, request, spider):
 2         # Called for each request that goes through the downloader
 3         # middleware.
 4 
 5         # Must either:
 6         # - return None: continue processing this request
 7         # - or return a Response object
 8         # - or return a Request object
 9         # - or raise IgnoreRequest: process_exception() methods of
10         #   installed downloader middleware will be called
11         user_agents = [
12             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
13             'Opera/8.0 (Windows NT 5.1; U; en)',
14             'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
15             'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
16             'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
17             'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
18             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
19             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
20             'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
21             'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
22             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
23             'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
24             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
25             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
26             'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
27             'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
28             'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
29         ]
30         request.headers['USER-Agent'] =random.choice(user_agents)
31         # print(request.headers)
32         return None
View Code

 Set up a profile

  • ROBOTSTXT_OBEY = False to False non-compliance with the agreement robot
  • Set the user indicates USER_AGENT: USER_AGENT = 'Mozilla / 5.0 (Windows NT 6.1; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 71.0.3578.98 Safari / 537.36'
  • Set the time, depending on the actual arrangement avoids crawling fast: DOWNLOAD_DELAY = 3
  • Open Middleware
DOWNLOADER_MIDDLEWARES = {
   'crawPro.middlewares.CrawproDownloaderMiddleware': 543,
}
View Code
  •  Open the pipeline
1 ITEM_PIPELINES = {
2    'crawPro.pipelines.CrawproPipeline': 300,
3 }
View Code

 Reptile file execution

  • Command line execution: scrapy crawl craw5i5j --nolog (Do not print log)
  • Command line execution: scrapy crawl craw5i5j (print journal)
  • After executing the file to see if the file is generated, or whether there is data in the database

Guess you like

Origin www.cnblogs.com/ygzy/p/11488206.html