scrapy项目的代码书写流程

scrapy项目的代码书写流程

第一步:选择一个文件夹,进入控制台,输入命令scrapy startproject qidian

第二步:切换到内层的spiders文件加 cd qidian/qidian/spiders   输入命令 scrapy genspider qidianyuedu  qidian.com(域名)

注意点:爬虫的名字 qidianyuedu 不能和工程的名字重复

第三步:在工程的路径下,建立一个启动文件starts.py

1 from scrapy import cmdline
2 cmdline.execute(["scrapy","crawl","qidianyuedu"])

第四步:修改settings文件,主要修改内容如下

 1 # 添加headers
 2 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
 3 
 4 # robot.txt
 5 ROBOTSTXT_OBEY = False
 6 
 7 # 打开pipeline
 8 ITEM_PIPELINES = {
 9    'qidian.pipelines.QidianPipeline': 300,
10 }

第五步:根据要爬取的数据,设置相对应的item字段

1 class QidianItem(scrapy.Item):
2     # define the fields for your item here like:
3     # name = scrapy.Field()
4     title = Field()
5     url = Field()
6     author = Field()
7     category = Field()
8     status = Field()
9     bref = Field()

第六步:书写pipeline,这里以将数据保存到mysql为例

 1 import pymysql
 2 
 3 class QidianPipeline(object):
 4 
 5     def __init__(self):
 6         self.db = pymysql.connect(host="xx.xx.xx.xx",
 7                                   port=3306,
 8                                   user="root",
 9                                   password="xxx",
10                                   db="xxx",
11                                   charset="utf8mb4")
12         self.cur = self.db.cursor()
13 
14 
15 
16 
17     def process_item(self, item, spider):
18 
19         sql = """insert into qqyuedu(title,url,author,category,
20                 status,bref)
21                 VALUES (%s,%s,%s,%s,%s,%s)"""
22         data = (item["title"],item["url"],item["author"],item["category"],item["status"]
23                 ,item["bref"])
24         try:
25             self.cur.execute(sql,data)
26         except:
27             pass
28         else:
29             self.db.commit()
30         return item
31 
32     def __del__(self):
33         self.cur.close()
34         self.db.close()

第七步:书写爬虫主要的程序 spiders 下面的那个文件

分成两种格式进行总结:

1. 使用starts_url的方式,使用offset配合翻页

 1 class Douban250Spider(scrapy.Spider):
 2     name = 'douban250'
 3     offset = 0
 4     allowed_domains = ['movie.douban.com']
 5     start_urls = ['https://movie.douban.com/top250?start=0&filter=']
 6 
 7     def parse(self, response):
 8         item = DoubanItem()
 9         li_list = response.css(".grid_view li")
10         for li in li_list:
11             item["name"] = li.css(".info")[0].xpath(".//span[@class=\"title\"][1]/text()")[0].extract()
12             item["info"] =  "".join("".join(li.css(".info .bd")[0].xpath("./p//text()").extract()).split())
13             item["score"] = float(li.css(".info .star")[0].xpath("./span[@class=\"rating_num\"]/text()")[0].extract())
14             item["access"] = li.css(".info .star")[0].xpath("./span[4]/text()")[0].extract()
15             item["bref"]= li.css(".info .quote")[0].xpath("./span[@class=\"inq\"]/text()")[0].extract()
16             yield item
17 
18         if self.offset < 250:
19             self.offset += 25
20             url = "https://movie.douban.com/top250?start="+str(self.offset)+"&filter="
21             yield scrapy.Request(url,callback=self.parse,dont_filter=True)

2.重写start_requests

 1 class QidianyueduSpider(scrapy.Spider):
 2     name = 'qidianyuedu'
 3     allowed_domains = ['book.qidian.com']
 4 
 5     def start_requests(self):
 6         page_num = self.get_page_num()
 7         for i in range(1,page_num+1):
 8             url = "https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page="+str(i)
 9             yield scrapy.Request(url,callback=self.parse,
10                                  headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"})
11 
12     def parse(self, response):
13         li_list = response.css(".book-img-text li")
14         for li in li_list:
15             item = QidianItem()
16             item["title"] = li.css(".book-mid-info h4 a::text")[0].extract()
17             item["url"] = "https:"+li.css(".book-mid-info h4 a::attr(href)")[0].extract()
18             item["author"] = li.css(".book-mid-info .author a")[0].xpath("./text()")[0].extract()
19             category = ""
20             a_list = li.css(".book-mid-info .author a")[1:]
21             for a in a_list:
22                 a_text = a.css("a::text")[0].extract()
23                 category += a_text
24                 category += " "
25             item["category"] = category.strip()
26             item["status"] = li.css(".book-mid-info .author span::text")[0].extract()
27             yield item

第八步:解析数据,在解析数据的时候我们可以借助着scrapy shell xxxxx 要爬取的网站  进入代码输入区域,首先输入view(response) 查看要爬取的网页是否是目标网页,然后在使用css/xpath的方式进行提取

注意:当我们提取的网络中的数据文字多,想进行拼接操作的时候,会有很多空白字符进行妨碍,解决方法 

 1 "".join("".join(li.css(".info .bd")[0].xpath("./p//text()").extract()).split()) 

从shell中将所有要提取的数据提取成功了,在转移到代码中即可,代码见第七步

深化一个问题,就是item分裂的问题

在一个页面的提取并不满足所有的item数据,需要深层次的网页的数据提取,这个时候就需要进行item的传递,实际上就是Request(url,meta={"meta":item},callback=self.parse_detail)的传递,和item = response.meta["meta"]

的解包,在新的解析函数中继续使用,在yield返回即可

 1 class QidianyueduSpider(scrapy.Spider):
 2     name = 'qidianyuedu'
 3     allowed_domains = ['book.qidian.com']
 4 
 5     def get_page_num(self):
 6         headers = {
 7             "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
 8         url = "https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1"
 9         res = requests.get(url, headers=headers)
10         html = res.content.decode("utf-8")
11         soup = BeautifulSoup(html, "lxml")
12         num = int(soup.select(".count-text span")[0].get_text())
13         if num%20 == 0:
14             page = num//20
15         else:
16             page = (num//20)
17         return page
18 
19     def start_requests(self):
20         page_num = self.get_page_num()
21         for i in range(1,page_num+1):
22             url = "https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page="+str(i)
23             yield scrapy.Request(url,callback=self.parse,
24                                  headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"})
25 
26     def parse(self, response):
27         li_list = response.css(".book-img-text li")
28         for li in li_list:
29             item = QidianItem()
30             item["title"] = li.css(".book-mid-info h4 a::text")[0].extract()
31             item["url"] = "https:"+li.css(".book-mid-info h4 a::attr(href)")[0].extract()
32             item["author"] = li.css(".book-mid-info .author a")[0].xpath("./text()")[0].extract()
33             category = ""
34             a_list = li.css(".book-mid-info .author a")[1:]
35             for a in a_list:
36                 a_text = a.css("a::text")[0].extract()
37                 category += a_text
38                 category += " "
39             item["category"] = category.strip()
40             item["status"] = li.css(".book-mid-info .author span::text")[0].extract()
41             yield scrapy.Request(item["url"],meta={"meta":item},
42                                  callback=self.parse_detial,
43                                  headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"})
44 
45 
46     def parse_detial(self,response):
47         item = response.meta["meta"]
48         item["bref"] = "".join("".join(response.css(".book-intro p")[0].xpath(".//text()").extract()).split())
49         yield item

猜你喜欢

转载自www.cnblogs.com/waws1314/p/12444080.html