上一篇是用scrapy爬艺龙酒店评论,这次用xpath爬爱问知识人,并导入数据库
我需要的是问题和相应的回答,还有相应的URL
爱问知识人是这个样子的
1、定义items.py
import scrapy class SwpItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field() url = scrapy.Field() pass2、定义spider
这下要用到xpath了,附上博客地址,崔庆才的很不错的教程:http://cuiqingcai.com/2621.html
xpath的语法
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点。 |
/ | 从根节点选取。 |
// | 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。 |
. | 选取当前节点。 |
.. | 选取当前节点的父节点。 |
@ | 选取属性。 |
下面上代码了
# -*- coding:utf-8 -*- from scrapy.selector import Selector from scrapy.http import Request from swp.items import SwpItem import sys reload(sys) sys.setdefaultencoding( "utf-8" ) import scrapy class tto(scrapy.Spider): name = "yilong" start_urls=['https://iask.sina.com.cn/c/167.html'] def __init__(self): self.item = SwpItem() def parse(self, response): selector = Selector(response) urls = selector.xpath('//div[@class="list-body-con current"]/ul/li/div/div[@class="question-title"]/a/@href').extract() list=[] for url in urls: url='https://iask.sina.com.cn'+url list.append(url) # print list for url in list: yield Request(url, callback=self.parseContent) page_links = selector.xpath('//div[@class="page mt30"]/a').extract() for link in page_links: if u'下一页' in link: next_link = selector.xpath('//div[@class="page mt30"]/a[@class="btn-page"]/@href').extract() for next_link1 in next_link: next_link = 'https://iask.sina.com.cn' + next_link1 print next_link yield Request(next_link, callback=self.parse) def parseContent(self, response): selector1 = Selector(response) title = selector1.xpath('//div[@class="question_text"]/pre/text()').extract() self.item['title'] = selector1.xpath('//div[@class="question_text"]/pre/text()').extract()[0] self.item['content'] = selector1.xpath('//div[@class="answer_text"]/div/span/pre/text()').extract() self.item['url'] = response.url # print title yield self.item相比于爬取动态页面,静态页面只需要查看页面源代码了
找到问题,回答及URL在源代码里相应位置的标签即可
3、将爬取结果导入数据库
from twisted.enterprise import adbapi class SwpPipeline(object): def __init__(self): dbargs = dict( host='127.0.0.1', port=3306, user='root', passwd='123', db='zhxy', charset='utf8', ) self.dbpool = adbapi.ConnectionPool('MySQLdb',**dbargs) def process_item(self, item,spider): res = self.dbpool.runInteraction(self.insert_into_table,item) return item def insert_into_table(self,conn,item): conn.execute( 'insert into aiask(title,content,url) values("%s","%s","%s")', ( item['title'], item['content'], item['url'])下面是导入数据库的结果
突然发现自己不会写博客,代码上连个注释都没有,大家凑合凑合看吧