scrapy crawling cat's eye movie list

Reptile people do, can not do without a certain framework is scrapy frame, write a small project when you can have the results by requests module, but when the amount of data when crawling, you must use the framework.

Following the first practice your hand, writing programs with scrapy a crawling cat eye film, environment configuration and installation skip scrapy

The first step must be run to create crawler terminal projects and files

1  # Create a crawler project 
2  scrapy startproject Maoyan
 3  cd Maoyan
 4  # Create a file crawler 
5 scrapy genspider maoyan maoyan.com

items.py file and folder in the definition requires the generated data structure crawling

1 name = scrapy.Field()
2 star = scrapy.Field()
3 time = scrapy.Field()

After opening maoyan.py files, write files reptiles, remember to import items.py file MaoyanItem class, and instantiate

 1 import scrapy
 2 from ..items import MaoyanItem
 3  4 class MaoyanSpider(scrapy.Spider):
 5     name = 'maoyan3'
 6     allowed_domains = ['maoyan.com']
 7     # 去掉start_urls变量
 8  9     # 重写start_requests()方法
10     def start_requests(self):
11         for offset in range(0,91,10):
12             url = '{} https://maoyan.com/board/4?offset= ' .format (offset)
 13 is              the yield scrapy.Request (URL = URL, the callback = self.parse)
 14  15 DEF the parse (Self, Response):
 16 # to items.py classes: MaoyanItem (scrapy.Item) instantiated . 17          Item = MaoyanItem ()
 18 is . 19 # reference XPath 20 is          dd_list response.xpath = ( ' // DL [@ class = "Board-warpper"] / dd ' )
 21 is # , traversing 22 is for dd in dd_list:
 23 is #              
          
         
                      Those in a class variable assignment items.py 
24              Item [ ' name ' ] = dd.xpath ( ' ./a/@title ' ) .get (). Strip ()
 25              Item [ ' Star ' ] = dd. XPath ( ' .// P [@ class = "Star"] / text () ' ) .get (). Strip ()
 26 is              Item [ ' Time ' ] = dd.xpath ( ' .//p[@class= "releasetime"] / text () ' ) .get (). Strip ()
 27  28 # the target item to the document processing pipeline 29 the yield item             
             

Custom pipeline file pipelines.py, be persistent storage

. 1  class MaoyanPipeline (Object):
 2      # item: the yield from the crawler maoyan.py item data file 
. 3      DEF process_item (Self, item, Spider):
 . 4          Print (item [ ' name ' ], item [ ' Time ' ], Item [ ' Star ' ])
 . 5  . 6 return Item
 . 7 . 8 . 9 Import pymysql
 10 from .settings Import *
 . 11 12 is # custom pipeline - MySQL database               
13 is  class MaoyanMysqlPipeline (Object):
 14      # execute program starts running crawlers function 
15      DEF open_spider (Self, Spider):
 16          Print ( ' I open_spider function output " )
 17          # is generally used to establish a database connection 
18 is          self.db = pymysql.connect (
 . 19              Host = MYSQL_HOST,
 20 is              User = mysql_user,
 21 is              password = the MYSQL_PWD,
 22 is              Database = MYSQL_DB,
 23 is              charset = MYSQL_CHAR
 24         )
 25          self.cursor = self.db.cursor ()
 26 is  27 DEF process_item (Self, Item, Spider):
 28          INS = ' INSERT INTO filmtab values (% S,% S,% S) ' 29 # as execute ( ) the second argument is a list of 30          L = [
 31 is              Item [ ' name ' ], Item [ ' Star ' ], Item [ ' Time ' ]
 32         ]
 33 is         self.cursor.execute (INS, L)
 34 is         self.db .commit ()
 35     
         
    36          return Item
 37 [  38 is # executed at the end of this program function crawler 39 DEF close_spider (Self, Spider):
 40 Print ( ' I close_spider function output " )
 41 # is generally used disconnect from the database 42 is         self.cursor.close ()
 43          self.db.close ()     
                       
 

The next step is to modify the configuration file settings.py

 1 USER_AGENT = 'Mozilla/5.0'
 2 ROBOTSTXT_OBEY = False
 3 DEFAULT_REQUEST_HEADERS = {
 4   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 5   'Accept-Language': 'en',
 6 }
 7 ITEM_PIPELINES = {
 8    'Maoyan.pipelines.MaoyanPipeline': 300,
 9    'Maoyan.pipelines.MaoyanMysqlPipeline' : 200 ,
 10  }
 11  # define the relevant variables MySQL 
12 is MYSQL_HOST = ' 127.0.0.1 ' 
13 is mysql_user = ' the root ' 
14 the MYSQL_PWD = ' 123456 ' 
15 MYSQL_DB = ' maoyandb ' 
16 MYSQL_CHAR = ' UTF8 '

Finally, create run.py file, then you can run

1 from scrapy import cmdline
2 cmdline.execute('scrapy crawl maoyan'.split())

 

Guess you like

Origin www.cnblogs.com/lattesea/p/11756552.html