Specific steps to create and code a spider

In order to make the project framework clear, the added spider's storage location is divided by city.

For example, Ningbo News Network-Comprehensive Channel, create a ningbo (Ningbo) folder under spiders, and write the spider of this layout under this folder.

Project design framework diagram:

    See the attached tree.jpg for the actual project tree picture 

webcrawler:.
|——scrapy.cfg
|——webcrawler:
    |——items.py
    |——pipelines.py
    |——settings.py
    |——__init__.py
    |——spiders
          |__init__.py
          |——cityAAA
              |——website_channel_type_id_spider|——cityBBB
              |——website_channel_type_id_spider
          |——city***
              ***
     |——increment
          |——cityAAA
              |——website_channel_type_id.txt
          |——cityBBB
              |——website_channel_type_id_.txt
          |——city***
              *** 
     |——logs (to be implemented)
          |——cityAAA
              |——website_channel_type_id.log          |——cityBBB
              |——website_channel_type_id.log
          |——warnlog
              |——warn.log
The storage location of the data is specified in the spider

 

 

The creation steps of spider (take the World Wide Web--Overseas View China page as an example):

1. Create a spider under the file of the city where you are located (eg: World Wide Web--Overseas View China)

   Create a folder under spiders named (quanguo) National:

mkdir quanguo
   Enter the folder, create **_spider under the folder, and the naming rule is website_channel_type_id_spider.py,:

 

 

cd quanguo
vim huanqiu_oversea_new_1_spider.py
   Create another file named __init__.py, which is a required file to run py. The creation command is:

 

 

touch __init__.py
2. Enter the created spider file and import the packages required by the program:

 

 

#coding=utf-8
import sys
sys.path.insert(0,'..')
import scrapy
import time
import them
import hashlib
import struct
from scrapy import Spider
import sys
from webcrawler.items import TextDetailItem
reload(sys)
sys.setdefaultencoding('gbk')
 3. Set parameters, including the location of the incremental files, the incrementurls variable that stores the increments, the number of first collections, the location of file storage, and the location of log storage.
incrementurl=[]
indexUrl = ''
fListNum = 0
firstCrawlerNum=100
incrementfile="/download/Scrapy-1.0.1/webcrawler/webcrawler/increment/quanguo/huanqiu_oversea_new_1.txt"

sourceDefault="World Wide Web--Overseas View China"
savefile="/download/datastore/"
 4. Create a spider class, inherit from Spider, and set the spider's running name, allowed_urls, start_urls, save the file name of the collected data, read the incremental file to get the urls (used to collect incremental crawling data to prevent data duplication)
        name="huanqiu_oversea_new_1_spider"
        allowed_urls=["oversea.huanqiu.com"]
        start_urls=["http://oversea.huanqiu.com/"]
        #fetch time start
        tempa=time.strftime('%Y-%m-%d',time.localtime(time.time()))
        filename=time.mktime(time.strptime(tempa,'%Y-%m-%d'))
        filename=str(filename)[0:10]
        global savefile
        savefile+="news_doc_incoming_1_"+filename
        #fetch time end
        #Read increment file start
        global indexUrl
        global increment file
        global incremental
        indexUrl="http://oversea.huanqiu.com/"
        indexUrl=start_urls[0]
        if os.path.isfile(incrementfile):
                print('file have exist...')
        else:
               f=open(incrementfile,"w")
               f.close()
        furl = open(incrementfile, "r")
        while True:
                line = furl.readline()
                if line=='':
                        break
                else:
                        incrementurl.append(line.strip('\n'))
        furl.close()
        #Read increment file end
 5. Rewrite the __init__ function, this function is inherited from Spider, you can write it, the code of the fourth part can also be written into this function
        def __init__(self):
                print('init...')
 6. Write the parse(self, response) function to crawl the data of the list page, including: 1. Get the list url 2). Get the url of the next page 3). The judgment logic of the next page 4). Judging the url Whether it exists in the increment, if it exists, it means that the following data has been collected, break 5). Get the content of its detail page according to the url 6). Set the number of first collection, because many list data is paged, the first collection does not increase The amount of data collected may be very large 7). Store the list url data on the first page as incremental url 8). Enter the collection of the next page of the list page
        def parse(self,response):
                global indexUrl
                global fListNum
                global firstCrawlerNum
                #Get the requested url, ie (start_urls)
                pUrl=str(response.url)
                sel=scrapy.Selector(response)
                #Define get list url
                list=sel.xpath("//div[@class='leftList']/ul[@name='contentList']/li[@name='item']")
                print (str (len (list)))
                #Get the url of the next page
                nextUrl=sel.xpath("//div[@class='turnPage']/a/@href").extract()
                #Judgment logic for the next page
                if nextUrl is not None and len(nextUrl)>1:
                        nextPage=nextUrl[1]
                elif nextUrl is not None and len(nextUrl)>0:
                        nextPage=nextUrl[0].encode('utf-8')
                else:
                        nextPage=None
                #define increment flag
                flag='0'
                #Get every url in the list in turn, the title can not be extracted, just for the convenience of testing
                for single in list:
                        title=single.xpath("a/dl/dt/h3/text()").extract()
                        url=single.xpath("a/@href").extract()
                        tempurl=url[0].encode('utf-8')
                        detailurl=tempurl
                        print(detailurl)
                        #Determine whether the url exists in the increment, if it exists, it means that the following data has been collected, break
                        if detailurl in incrementurl:
                                flag='1'
                                break
                        else:
                                #Get the content of its details page according to the url
                                reqdetail=scrapy.Request(url=detailurl,meta={'pUrl':pUrl},callback=self.parse_detail)
                                yield reqdetail
                #Set the number of first collections, because many list data is paged, there is no increment for the first collection, and the collected data may be very large
                if len(incrementurl)==0:
                        fListNum + = len (list)
                        if(fListNum>firstCrawlerNum):
                                return
                #Store the list url data of the first page as an incremental url
                if indexUrl==str(response.url):
                        writeStr=''
                        for tSingle in list:
                                tUrl=tSingle.xpath("a/@href").extract()
                                ttUrl=tUrl[0].encode('utf-8')
                                writeStr+=ttUrl+'\n'
                        open(incrementfile,'w').write(writeStr+'\n')
                #Enter the collection of the next page of the list page
                if flag=='0':
                        if nextPage is not None:
                                req = scrapy.Request(url=nextPage, callback=self.parse)
                                yield req
                else:
                        return
  7. Define the details page collection function, set the rules for extracting the details page data, and put the data into the object container for pipline storage
        def parse_detail(self,response):
                global savefile
                global sourceDefault
                #Create a data container object to store the collected data
                textDetail = TextDetailItem ()
                sel=scrapy.Selector(response)
                #Define the rules for extracting the title, source, publishing time, and text of the details page
                title=sel.xpath("//div[@class='conText']/h1/text()").extract()
                source=sel.xpath("//*[@id='source_baidu']/a/text()").extract()
                source2=sel.xpath("//*[@id='source_baidu']/text()").extract()
                tTime=sel.xpath("//strong[@id='pubtime_baidu']/text()").extract()
                contents=sel.xpath("//div[@id='text']/p").extract()
                url=response.url
                if not url:
                        return
                else:
                        #Generate MD according to the details page url
                        tmp_msg_id = hashlib.md5(url).hexdigest()[0:8]
                        msg_id = long(struct.unpack('Q',tmp_msg_id)[0])
                        MD=msg_id
                        textDetail['MD']=str(MD)
                textDetail['CL']="news"
                textDetail ['UR'] = str (url)
                PR=response.meta['pUrl']
                textDetail['PR']=str(PR)
                FC="0"
                textDetail['FC']=FC
                HR=str(response.status)
                textDetail['HR']=HR
                ET="0"
                textDetail ['ET'] = ET
                RT="1"
                textDetail['RT']=RT
                title=title[0].encode('utf-8')
                print("text title:"+str(title))
                textDetail ['TI'] = title
                SR= sourceDefault
                if len(source)>0:
                        SR=source[0].encode('utf-8')
                else:
                        if len(source2)>0:
                                SRstr=source2[0].encode('utf-8')
                                SRstr=SRstr.replace("\r\n","")
                                SRstr=SRstr.strip()
                                SR=SRstr[9:]
                textDetail['SR']=SR
                imageurls=''
                imageurls=''
                #Extract the image link in the text
                imagelist=sel.xpath("//div[@id='text']/*/img[@src]/@src").extract()
                for i in range(len(imagelist)):
                        if i==len(imagelist)-1:
                                imageurls+=imagelist[i]
                        else:
                                imageurls+=imagelist[i]+'|'
                imageurls=imageurls.strip()
                textDetail ['IU'] = str (imageurls)
                textDetail['KW']=""
                temptime=tTime[0].strip()
                temptime=time.mktime(time.strptime(temptime,'%Y-%m-%d  %H:%M:%S'))
                PT=(str(temptime))[0:10]
                textDetail ['PT'] = PT
                dTime=time.time()
                DT=(str(dTime))[0:10]
                textDetail['DT']=DT
                textDetail['TY']=""
                content=""
                for i in range(0, len(contents)):
                        content+=contents[i].encode('utf-8')
               # content=content[0].encode('utf-8')

                TX=content
                textDetail['TX']=TX
                textDetail['SP']=savefile
                return textDetail
  8. The method of running all spiders under scrapy at one time: After searching for information, write a crawlall.py, the python script can get all the spiders that can be run under the scrapy framework, and start it, the crawlall script code is as follows:
from scrapy.commands import ScrapyCommand
from scrapy.crawler import CrawlerRunner
from scrapy.utils.conf import arglist_to_dict
class Command(ScrapyCommand):
  requires_project = True
  def syntax(self):
    return '[options]'
  def short_desc(self):
    return 'Runs all of the spiders'
  def add_options(self, parser):
    ScrapyCommand.add_options(self, parser)
    parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
              help="set spider argument (may be repeated)")
    parser.add_option("-o", "--output", metavar="FILE",
              help="dump scraped items into FILE (use - for stdout)")
    parser.add_option("-t", "--output-format", metavar="FORMAT",
              help="format to use for dumping items with -o")
  def process_options(self, args, opts):
    ScrapyCommand.process_options(self, args, opts)
    try:
      opts.spargs = arglist_to_dict(opts.spargs)
    except ValueError:
      raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
  def run(self, args, opts):
    #settings = get_project_settings()

    spider_loader = self.crawler_process.spider_loader
    for spidername in args or spider_loader.list():
      print "*********cralall spidername************" + spidername
      self.crawler_process.crawl(spidername, **opts.spargs)
    self.crawler_process.start()
  9. The continuously running shell script runsspider.sh can run crawlall.py continuously by running the script. The content of the runspider.sh script is as follows:
#!/bin/sh
int=1
while((1))
do
echo $int
scrapy crawlall
let "int++"
done
 

 

     

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326866284&siteId=291194637