python scrapy框架用xpath爬爱问知识人

上一篇是用scrapy爬艺龙酒店评论,这次用xpath爬爱问知识人,并导入数据库

我需要的是问题和相应的回答,还有相应的URL

爱问知识人是这个样子的


1、定义items.py

import scrapy

class SwpItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    url = scrapy.Field()
    pass
2、定义spider

这下要用到xpath了,附上博客地址,崔庆才的很不错的教程:http://cuiqingcai.com/2621.html

xpath的语法

表达式 描述
nodename 选取此节点的所有子节点。
/ 从根节点选取。
// 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。
. 选取当前节点。
.. 选取当前节点的父节点。
@ 选取属性。
会用BS4的朋友们应该知道如何爬标签

下面上代码了

# -*- coding:utf-8 -*-
from scrapy.selector import Selector
from scrapy.http import Request
from swp.items import SwpItem
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
import scrapy

class tto(scrapy.Spider):
    name = "yilong"
    start_urls=['https://iask.sina.com.cn/c/167.html']

    def __init__(self):
        self.item = SwpItem()
    def parse(self, response):
        selector = Selector(response)
        urls = selector.xpath('//div[@class="list-body-con current"]/ul/li/div/div[@class="question-title"]/a/@href').extract()
        list=[]
        for url in urls:
            url='https://iask.sina.com.cn'+url
            list.append(url)
        # print list
        for url in list:
            yield Request(url, callback=self.parseContent)
        page_links = selector.xpath('//div[@class="page mt30"]/a').extract()
        for link in page_links:
            if u'下一页' in link:
                next_link = selector.xpath('//div[@class="page mt30"]/a[@class="btn-page"]/@href').extract()
                for next_link1 in next_link:
                     next_link = 'https://iask.sina.com.cn' + next_link1
                print next_link
                yield Request(next_link, callback=self.parse)

    def parseContent(self, response):
        selector1 = Selector(response)
        title = selector1.xpath('//div[@class="question_text"]/pre/text()').extract()
        self.item['title'] = selector1.xpath('//div[@class="question_text"]/pre/text()').extract()[0]
        self.item['content'] = selector1.xpath('//div[@class="answer_text"]/div/span/pre/text()').extract()
        self.item['url'] = response.url
        # print title
        yield self.item
相比于爬取动态页面,静态页面只需要查看页面源代码了

找到问题,回答及URL在源代码里相应位置的标签即可

3、将爬取结果导入数据库

from twisted.enterprise import adbapi

class SwpPipeline(object):
    def __init__(self):
        dbargs = dict(
                host='127.0.0.1',
                port=3306,
                user='root',
                passwd='123',
                db='zhxy',
                charset='utf8',
            )
        self.dbpool = adbapi.ConnectionPool('MySQLdb',**dbargs)

    def process_item(self, item,spider):
        res = self.dbpool.runInteraction(self.insert_into_table,item)
        return item

    def insert_into_table(self,conn,item):
        conn.execute(
            'insert into aiask(title,content,url) values("%s","%s","%s")', (
                item['title'], item['content'], item['url'])
 
下面是导入数据库的结果



突然发现自己不会写博客,代码上连个注释都没有,大家凑合凑合看吧

猜你喜欢

转载自blog.csdn.net/qq_40024605/article/details/78628610