Reptile people do, can not do without a certain framework is scrapy frame, write a small project when you can have the results by requests module, but when the amount of data when crawling, you must use the framework.
Following the first practice your hand, writing programs with scrapy a crawling cat eye film, environment configuration and installation skip scrapy
The first step must be run to create crawler terminal projects and files
1 # Create a crawler project 2 scrapy startproject Maoyan 3 cd Maoyan 4 # Create a file crawler 5 scrapy genspider maoyan maoyan.com
items.py file and folder in the definition requires the generated data structure crawling
1 name = scrapy.Field() 2 star = scrapy.Field() 3 time = scrapy.Field()
After opening maoyan.py files, write files reptiles, remember to import items.py file MaoyanItem class, and instantiate
1 import scrapy 2 from ..items import MaoyanItem 3 4 class MaoyanSpider(scrapy.Spider): 5 name = 'maoyan3' 6 allowed_domains = ['maoyan.com'] 7 # 去掉start_urls变量 8 9 # 重写start_requests()方法 10 def start_requests(self): 11 for offset in range(0,91,10): 12 url = '{} https://maoyan.com/board/4?offset= ' .format (offset) 13 is the yield scrapy.Request (URL = URL, the callback = self.parse) 14 15 DEF the parse (Self, Response): 16 # to items.py classes: MaoyanItem (scrapy.Item) instantiated . 17 Item = MaoyanItem () 18 is . 19 # reference XPath 20 is dd_list response.xpath = ( ' // DL [@ class = "Board-warpper"] / dd ' ) 21 is # , traversing 22 is for dd in dd_list: 23 is # Those in a class variable assignment items.py 24 Item [ ' name ' ] = dd.xpath ( ' ./a/@title ' ) .get (). Strip () 25 Item [ ' Star ' ] = dd. XPath ( ' .// P [@ class = "Star"] / text () ' ) .get (). Strip () 26 is Item [ ' Time ' ] = dd.xpath ( ' .//p[@class= "releasetime"] / text () ' ) .get (). Strip () 27 28 # the target item to the document processing pipeline 29 the yield item
Custom pipeline file pipelines.py, be persistent storage
. 1 class MaoyanPipeline (Object): 2 # item: the yield from the crawler maoyan.py item data file . 3 DEF process_item (Self, item, Spider): . 4 Print (item [ ' name ' ], item [ ' Time ' ], Item [ ' Star ' ]) . 5 . 6 return Item . 7 . 8 . 9 Import pymysql 10 from .settings Import * . 11 12 is # custom pipeline - MySQL database 13 is class MaoyanMysqlPipeline (Object): 14 # execute program starts running crawlers function 15 DEF open_spider (Self, Spider): 16 Print ( ' I open_spider function output " ) 17 # is generally used to establish a database connection 18 is self.db = pymysql.connect ( . 19 Host = MYSQL_HOST, 20 is User = mysql_user, 21 is password = the MYSQL_PWD, 22 is Database = MYSQL_DB, 23 is charset = MYSQL_CHAR 24 ) 25 self.cursor = self.db.cursor () 26 is 27 DEF process_item (Self, Item, Spider): 28 INS = ' INSERT INTO filmtab values (% S,% S,% S) ' 29 # as execute ( ) the second argument is a list of 30 L = [ 31 is Item [ ' name ' ], Item [ ' Star ' ], Item [ ' Time ' ] 32 ] 33 is self.cursor.execute (INS, L) 34 is self.db .commit () 35 36 return Item 37 [ 38 is # executed at the end of this program function crawler 39 DEF close_spider (Self, Spider): 40 Print ( ' I close_spider function output " ) 41 # is generally used disconnect from the database 42 is self.cursor.close () 43 self.db.close ()
The next step is to modify the configuration file settings.py
1 USER_AGENT = 'Mozilla/5.0' 2 ROBOTSTXT_OBEY = False 3 DEFAULT_REQUEST_HEADERS = { 4 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 5 'Accept-Language': 'en', 6 } 7 ITEM_PIPELINES = { 8 'Maoyan.pipelines.MaoyanPipeline': 300, 9 'Maoyan.pipelines.MaoyanMysqlPipeline' : 200 , 10 } 11 # define the relevant variables MySQL 12 is MYSQL_HOST = ' 127.0.0.1 ' 13 is mysql_user = ' the root ' 14 the MYSQL_PWD = ' 123456 ' 15 MYSQL_DB = ' maoyandb ' 16 MYSQL_CHAR = ' UTF8 '
Finally, create run.py file, then you can run
1 from scrapy import cmdline 2 cmdline.execute('scrapy crawl maoyan'.split())