In order to make the project framework clear, the added spider's storage location is divided by city.
For example, Ningbo News Network-Comprehensive Channel, create a ningbo (Ningbo) folder under spiders, and write the spider of this layout under this folder.
Project design framework diagram:
See the attached tree.jpg for the actual project tree picture
webcrawler:. |——scrapy.cfg |——webcrawler: |——items.py |——pipelines.py |——settings.py |——__init__.py |——spiders |__init__.py |——cityAAA |——website_channel_type_id_spider|——cityBBB |——website_channel_type_id_spider |——city*** *** |——increment |——cityAAA |——website_channel_type_id.txt |——cityBBB |——website_channel_type_id_.txt |——city*** *** |——logs (to be implemented) |——cityAAA |——website_channel_type_id.log |——cityBBB |——website_channel_type_id.log |——warnlog |——warn.log
The storage location of the data is specified in the spider
The creation steps of spider (take the World Wide Web--Overseas View China page as an example):
1. Create a spider under the file of the city where you are located (eg: World Wide Web--Overseas View China)
Create a folder under spiders named (quanguo) National:
mkdir quanguoEnter the folder, create **_spider under the folder, and the naming rule is website_channel_type_id_spider.py,:
cd quanguo vim huanqiu_oversea_new_1_spider.pyCreate another file named __init__.py, which is a required file to run py. The creation command is:
touch __init__.py2. Enter the created spider file and import the packages required by the program:
#coding=utf-8 import sys sys.path.insert(0,'..') import scrapy import time import them import hashlib import struct from scrapy import Spider import sys from webcrawler.items import TextDetailItem reload(sys) sys.setdefaultencoding('gbk')3. Set parameters, including the location of the incremental files, the incrementurls variable that stores the increments, the number of first collections, the location of file storage, and the location of log storage.
incrementurl=[] indexUrl = '' fListNum = 0 firstCrawlerNum=100 incrementfile="/download/Scrapy-1.0.1/webcrawler/webcrawler/increment/quanguo/huanqiu_oversea_new_1.txt" sourceDefault="World Wide Web--Overseas View China" savefile="/download/datastore/"4. Create a spider class, inherit from Spider, and set the spider's running name, allowed_urls, start_urls, save the file name of the collected data, read the incremental file to get the urls (used to collect incremental crawling data to prevent data duplication)
name="huanqiu_oversea_new_1_spider" allowed_urls=["oversea.huanqiu.com"] start_urls=["http://oversea.huanqiu.com/"] #fetch time start tempa=time.strftime('%Y-%m-%d',time.localtime(time.time())) filename=time.mktime(time.strptime(tempa,'%Y-%m-%d')) filename=str(filename)[0:10] global savefile savefile+="news_doc_incoming_1_"+filename #fetch time end #Read increment file start global indexUrl global increment file global incremental indexUrl="http://oversea.huanqiu.com/" indexUrl=start_urls[0] if os.path.isfile(incrementfile): print('file have exist...') else: f=open(incrementfile,"w") f.close() furl = open(incrementfile, "r") while True: line = furl.readline() if line=='': break else: incrementurl.append(line.strip('\n')) furl.close() #Read increment file end5. Rewrite the __init__ function, this function is inherited from Spider, you can write it, the code of the fourth part can also be written into this function
def __init__(self): print('init...')6. Write the parse(self, response) function to crawl the data of the list page, including: 1. Get the list url 2). Get the url of the next page 3). The judgment logic of the next page 4). Judging the url Whether it exists in the increment, if it exists, it means that the following data has been collected, break 5). Get the content of its detail page according to the url 6). Set the number of first collection, because many list data is paged, the first collection does not increase The amount of data collected may be very large 7). Store the list url data on the first page as incremental url 8). Enter the collection of the next page of the list page
def parse(self,response): global indexUrl global fListNum global firstCrawlerNum #Get the requested url, ie (start_urls) pUrl=str(response.url) sel=scrapy.Selector(response) #Define get list url list=sel.xpath("//div[@class='leftList']/ul[@name='contentList']/li[@name='item']") print (str (len (list))) #Get the url of the next page nextUrl=sel.xpath("//div[@class='turnPage']/a/@href").extract() #Judgment logic for the next page if nextUrl is not None and len(nextUrl)>1: nextPage=nextUrl[1] elif nextUrl is not None and len(nextUrl)>0: nextPage=nextUrl[0].encode('utf-8') else: nextPage=None #define increment flag flag='0' #Get every url in the list in turn, the title can not be extracted, just for the convenience of testing for single in list: title=single.xpath("a/dl/dt/h3/text()").extract() url=single.xpath("a/@href").extract() tempurl=url[0].encode('utf-8') detailurl=tempurl print(detailurl) #Determine whether the url exists in the increment, if it exists, it means that the following data has been collected, break if detailurl in incrementurl: flag='1' break else: #Get the content of its details page according to the url reqdetail=scrapy.Request(url=detailurl,meta={'pUrl':pUrl},callback=self.parse_detail) yield reqdetail #Set the number of first collections, because many list data is paged, there is no increment for the first collection, and the collected data may be very large if len(incrementurl)==0: fListNum + = len (list) if(fListNum>firstCrawlerNum): return #Store the list url data of the first page as an incremental url if indexUrl==str(response.url): writeStr='' for tSingle in list: tUrl=tSingle.xpath("a/@href").extract() ttUrl=tUrl[0].encode('utf-8') writeStr+=ttUrl+'\n' open(incrementfile,'w').write(writeStr+'\n') #Enter the collection of the next page of the list page if flag=='0': if nextPage is not None: req = scrapy.Request(url=nextPage, callback=self.parse) yield req else: return7. Define the details page collection function, set the rules for extracting the details page data, and put the data into the object container for pipline storage
def parse_detail(self,response): global savefile global sourceDefault #Create a data container object to store the collected data textDetail = TextDetailItem () sel=scrapy.Selector(response) #Define the rules for extracting the title, source, publishing time, and text of the details page title=sel.xpath("//div[@class='conText']/h1/text()").extract() source=sel.xpath("//*[@id='source_baidu']/a/text()").extract() source2=sel.xpath("//*[@id='source_baidu']/text()").extract() tTime=sel.xpath("//strong[@id='pubtime_baidu']/text()").extract() contents=sel.xpath("//div[@id='text']/p").extract() url=response.url if not url: return else: #Generate MD according to the details page url tmp_msg_id = hashlib.md5(url).hexdigest()[0:8] msg_id = long(struct.unpack('Q',tmp_msg_id)[0]) MD=msg_id textDetail['MD']=str(MD) textDetail['CL']="news" textDetail ['UR'] = str (url) PR=response.meta['pUrl'] textDetail['PR']=str(PR) FC="0" textDetail['FC']=FC HR=str(response.status) textDetail['HR']=HR ET="0" textDetail ['ET'] = ET RT="1" textDetail['RT']=RT title=title[0].encode('utf-8') print("text title:"+str(title)) textDetail ['TI'] = title SR= sourceDefault if len(source)>0: SR=source[0].encode('utf-8') else: if len(source2)>0: SRstr=source2[0].encode('utf-8') SRstr=SRstr.replace("\r\n","") SRstr=SRstr.strip() SR=SRstr[9:] textDetail['SR']=SR imageurls='' imageurls='' #Extract the image link in the text imagelist=sel.xpath("//div[@id='text']/*/img[@src]/@src").extract() for i in range(len(imagelist)): if i==len(imagelist)-1: imageurls+=imagelist[i] else: imageurls+=imagelist[i]+'|' imageurls=imageurls.strip() textDetail ['IU'] = str (imageurls) textDetail['KW']="" temptime=tTime[0].strip() temptime=time.mktime(time.strptime(temptime,'%Y-%m-%d %H:%M:%S')) PT=(str(temptime))[0:10] textDetail ['PT'] = PT dTime=time.time() DT=(str(dTime))[0:10] textDetail['DT']=DT textDetail['TY']="" content="" for i in range(0, len(contents)): content+=contents[i].encode('utf-8') # content=content[0].encode('utf-8') TX=content textDetail['TX']=TX textDetail['SP']=savefile return textDetail8. The method of running all spiders under scrapy at one time: After searching for information, write a crawlall.py, the python script can get all the spiders that can be run under the scrapy framework, and start it, the crawlall script code is as follows:
from scrapy.commands import ScrapyCommand from scrapy.crawler import CrawlerRunner from scrapy.utils.conf import arglist_to_dict class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def add_options(self, parser): ScrapyCommand.add_options(self, parser) parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", help="set spider argument (may be repeated)") parser.add_option("-o", "--output", metavar="FILE", help="dump scraped items into FILE (use - for stdout)") parser.add_option("-t", "--output-format", metavar="FORMAT", help="format to use for dumping items with -o") def process_options(self, args, opts): ScrapyCommand.process_options(self, args, opts) try: opts.spargs = arglist_to_dict(opts.spargs) except ValueError: raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False) def run(self, args, opts): #settings = get_project_settings() spider_loader = self.crawler_process.spider_loader for spidername in args or spider_loader.list(): print "*********cralall spidername************" + spidername self.crawler_process.crawl(spidername, **opts.spargs) self.crawler_process.start()9. The continuously running shell script runsspider.sh can run crawlall.py continuously by running the script. The content of the runspider.sh script is as follows:
#!/bin/sh int=1 while((1)) do echo $int scrapy crawlall let "int++" done