Simple examples scrapy

The day before yesterday laboratory seniors asked to write a simple scrapy works out, before also looked at how much knowledge scrapy, but never quite understand, just take advantage of this opportunity to deepen the understanding of what scrapy workflow. As the approaching end of the period, a lot of homework to do (in fact, or they are too .............. main dishes, hee hee), so I decided to search a simple example to imitate it.

Show search of a crawling Tencent recruitment network examples ( https://www.cnblogs.com/xinyangsdut/p/7628770.html ), not run after Qiaowan hands, try tuning, also can not be solved. Went to see a crawling blog Park ( https://www.jianshu.com/p/78f0bc64feb8 ), this example only crawl the first page, under slightly changed, can crawl any number of pages. And then change the time, also encountered a bit of trouble. Still not enough understanding of scrapy (in fact ....... or they are too dishes, shed tears of ignorance), but the best is finally successfully completed. Next, a brief anatomy at this example.

  1. The first is to write item file, based on the content crawling, crawling defined fields. code show as below:
    Import Scrapy 
    
    
    class CnblogItem (scrapy.Item):
         # DEFINE The Fields here Wallpaper for your like Item: 
        # name = scrapy.Field () 
        title = scrapy.Field () # define crawling title 
        Link = scrapy.Field () # define connection crawling
  2. Writing spider file (this is the key), here named cnblog_spider, code is as follows:
    # -*- coding: utf-8 -*-
    import scrapy
    from cnblog.items import CnblogItem
    
    
    class CnblogSpiderSpider(scrapy.Spider):
        name = "cnblog_spider"
        allowed_domains = ["cnblogs.com"]
        url = 'https://www.cnblogs.com/sitehome/p/'
        offset = 1
        start_urls = [url+str(offset)]
    
        def parse(self, response):
    
    
            item = CnblogItem()
    
            item['title' ] Response.xpath = ( ' // A [@ class = "titlelnk"] / text () ' ) .extract ()        # use xpath search 
            Item [ ' Link ' ] response.xpath = ( ' // A [@ = class "titlelnk"] / @ the href ' ) .extract () 
    
            the yield Item 
    
            Print ( " {0} completed page crawling " .format (self.offset))
             IF self.offset <10:         # crawling to a few page 
                self.offset + =. 1 
            URL2 = self.url + STR (self.offset)     # splicing URL 
            Print (URL2)
             the yield scrapy.Request(url=url2, callback=self.parse)

     In this part of the code content, it is nothing difficult to understand, but if you thoroughly understand the entire operating process, it is very helpful to understand scrapy.

  3. Write pipelines file for us crawling to the data is written to a TXT file.
    class FilePipeline(object):
        def process_item(self, item, spider):
    
            data = ''
    
            with open('cnblog.txt', 'a', encoding='utf-8') as f:
                titles = item['title']
                links = item['link']
                for i, j in zip(titles, links):
                    data += i+'     '+j+'\n'
    
                f.write(data)
                f.close()
            return item
  4. Change the setting file
    = False ROBOTSTXT_OBEY           # change must take this parameter is Fals
    DEFAULT_REQUEST_HEADERS = {
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Language': 'en',
        #user-agent新添加
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    }
    #新修改
    ITEM_PIPELINES = {
        'cnblog.pipelines.FilePipeline': 300,    #Achieve saved to txt file 
    
    }
  5. Write a main file, scrapy can not debug the compiler inside, but we can write yourself a master file, the master file can be run as an ordinary project as modal inside the compiler. code show as below
    from scrapy Import cmdline 
    
    cmdline.execute ( " scrapy crawl cnblog_spider --nolog " .split ())        # --nolog is not displayed in the form of logs run, to see if you need more information, you can remove the

     Now, even if our case is finished, run main.py, will generate a cnblog.Ttxt file, the contents of which is that we crawled down the. As shown below

Finally, talk about writing this harvest and feelings example: In the process of knocking code, find themselves in a python is not strong enough for some of the knowledge point, even while loop is to write all of a sudden, after or write multi-py (do not went wrong oh)! There is also that sometimes is really no pressure no motivation, learning scrapy before, have not been able to write out an example of a successful operation, this time at the request of the seniors, finally wrote out a success. Although the road is so difficult to learn, but it should never escape. come on! ! ! ! !

Guess you like

Origin www.cnblogs.com/liangxiyang/p/10960516.html