Scrapy installation
- Linux
- pip install scrapy
- Windows
- pip install wheel
- Download twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
- Step into the download directory, pip install the downloaded file name
- pip install pywin2
- pip install scrapy
Create a project, reptiles file
- New Project
scrapy startproject crawPro
- New reptiles file
Reptiles into the project directory cd crawPro
scrapy genspider -t craw5i5j www.xxx.com # www.xxx.com starting url, comment out the back of reptiles file
Writing reptile file
- Notes starting url
- Because reptiles address of the first and second pages url inconsistent, so add a new rule parser rules parser using regular expressions
- Parameters: follow = True representation to follow. All channels will be Page
- Parameters: callback = 'parse_item', each represents a callback URL to return data, the method needs to be parsed parse_item
- Parameters: Data type response is returned select type, which requires extract_first () represents taking a first value, () takes a plurality of values Extract, return data type list
- File: items need to define the parameters, see the code
- Import item: from crawPro.items import CrawproItem; instantiating item = CrawproItem (), the value of the parsed load item
- Finally: yield item, the value of incoming items
- craw5i5j.py Code
# - * - Coding: UTF-8 - * - Import scrapy from scrapy.linkextractors Import LinkExtractor # link extractor from scrapy.spiders Import CrawlSpider, Rule # Rule rule parser object from crawPro.items Import CrawproItem class Craw5i5jSpider (CrawlSpider): name = ' craw5i5j ' # allowed_domains = [ 'www.xxx.com'] start_urls = [ ' https://nj.5i5j.com/xiaoqu/pukouqu/ ' ] #Link extractor: the premise of follow = False, the starting role is used to extract the corresponding page URL to meet the requirements of the link # parameters allow is an ongoing expression. = LinkExtractor Link (R & lt the allow = ' ^ HTTPS: //nj.5i5j.com/xiaoqu/pukouqu/n \ + D / $ ' ) link1 = LinkExtractor (the allow = R & lt ' ^ HTTPS: //nj.5i5j.com/xiaoqu / pukouqu / $ ' ) the rules = ( # rule LinkExtractor parser object instantiation link extractor, callback callback function the rule (link, callback = ' parse_item ' , Follow = True), the rule (link1, callback = ' parse_item ' , Follow = True), ) DEF parse_item(self, response): for li in response.xpath('//div[@class="list-con-box"]/ul/li'): xq_name = li.xpath(".//h3[@class='listTit']/a/text()").extract_first().strip() xq_chengjiao = li.xpath(".//div[@class='listX']/p/span[1]/a/text()").extract_first().strip() xq_danjia = li.xpath(".//div[@class='listX']/div[@class='jia']/p[@class='redC']//text()").extract_first().Strip () "= li.xpath ( xq_zongjiadiv .// [@ class = 'LISTX'] / div [@ class = 'Jia'] / P [2] / text () " ) .extract_first (). Strip () Item = CrawproItem () Item [ ' xq_name ' ] = xq_name Item [ ' xq_chengjiao ' ] = xq_chengjiao Item [ ' xq_danjia ' ] = xq_danjia Item [ ' xq_zongjia ' ] = xq_zongjia the yield Item
- items.py Code
import scrapy class CrawproItem(scrapy.Item): # define the fields for your item here like: xq_name = scrapy.Field() xq_chengjiao = scrapy.Field() xq_danjia = scrapy.Field() xq_zongjia = scrapy.Field()
Write pipe file
- Override the parent class method: def open_spider (self, spider): the role of open files at once to avoid multiple files open (if database which should be open data connection)
- Override inherited methods: def close_spider (self, spider): The first step is to open the file, be closed (closing database be a database connection)
- Method: process_item, write the file format setting (writing operation of the database)
- Code
1 class CrawproPipeline (Object): 2 fp = None 3 # rewrite the parent class, this method will only be called once, to open the file. . 4 DEF open_spider (Self, Spider): . 5 self.fp = Open ( " 1.txt " , ' W ' , encoding = ' UTF-. 8 ' ) . 6 . 7 DEF process_item (Self, Item, Spider): . 8 self.fp .write (Item [ " xq_name " ] + " \ T " + Item [ " xq_chengjiao " ] + "\t"+item["xq_danjia"]+"\t"+item["xq_zongjia"]+"\n") 9 return item 10 11 def close_spider(self,spider): 12 self.fp.close()
Set Middleware
- Setting UA agent pool
- Middleware: middlewares.py, find a way: def process_request (self, request, spider): Set UA agent pool
1 def process_request(self, request, spider): 2 # Called for each request that goes through the downloader 3 # middleware. 4 5 # Must either: 6 # - return None: continue processing this request 7 # - or return a Response object 8 # - or return a Request object 9 # - or raise IgnoreRequest: process_exception() methods of 10 # installed downloader middleware will be called 11 user_agents = [ 12 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60', 13 'Opera/8.0 (Windows NT 5.1; U; en)', 14 'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50', 15 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50', 16 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0', 17 'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10', 18 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ', 19 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36', 20 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 21 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16', 22 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36', 23 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko', 24 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11', 25 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER', 26 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)', 27 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0', 28 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ', 29 ] 30 request.headers['USER-Agent'] =random.choice(user_agents) 31 # print(request.headers) 32 return None
Set up a profile
- ROBOTSTXT_OBEY = False to False non-compliance with the agreement robot
- Set the user indicates USER_AGENT: USER_AGENT = 'Mozilla / 5.0 (Windows NT 6.1; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 71.0.3578.98 Safari / 537.36'
- Set the time, depending on the actual arrangement avoids crawling fast: DOWNLOAD_DELAY = 3
- Open Middleware
DOWNLOADER_MIDDLEWARES = { 'crawPro.middlewares.CrawproDownloaderMiddleware': 543, }
- Open the pipeline
1 ITEM_PIPELINES = { 2 'crawPro.pipelines.CrawproPipeline': 300, 3 }
Reptile file execution
- Command line execution: scrapy crawl craw5i5j --nolog (Do not print log)
- Command line execution: scrapy crawl craw5i5j (print journal)
- After executing the file to see if the file is generated, or whether there is data in the database