Python网络爬虫-6

XPath基础

# XPath表达式
'''
XPath与正则对比：
1. XPath效率高
2. 正则功能强大
3. 一般优先选择XPath，但是XPath解决不了的问题，则用正则解决

# 简单说明快速使用，更为完善的版本以后补上
/ 逐层提取
text() 提取标签下面的文本
//标签名A 提取所有名为A的标签
//标签名A[@属性名B='属性值b'] 提取属性B值为b的标签
@属性名 取某个属性

<html>
<head><title>我是标题</title></head>
<body>
<div class='tools'>
    <div class="newhead">
    <ul class="newhead_oprate">
        <li>
            <a target="_blank">我是内容</a>
        </li>
    <ul>
    </div>
</div>
<div><div class="newhead"></div></div>
</body>

示例：
提取标题：/html/head/title/text()->我是标题
提取所有div标签：//div
提取div中<div class='tools'>标签的内容：//div[@class='tools']

提取"我是内容"：//ul[@class='newhead_oprate']/li/a/text()
'''

Scrapy爬虫框架简单实例

一、创建项目

scrapy startproject dangdang

在这里插入图片描述
二、创建爬虫

scrapy genspider dd "dangdang.com"

在这里插入图片描述
三、编写代码
1.item文件编写
items用于存储字段的定义，即爬取的内容存item类。

import scrapy

class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    comment = scrapy.Field()

在这里插入图片描述
2.spider文件编写

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from ..items import DangdangItem

class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=%CE%C0%D2%C2&category_id=10010336&page_index=1']

    def parse(self, response):
        #创建容器
        item = DangdangItem()
        #信息提取
        item["title"] = response.xpath("//a[@name='itemlist-title']/@title").extract()
        item["link"] = response.xpath("//a[@name='itemlist-title']/@href").extract()
        item["comment"] = response.xpath("//a[@name='itemlist-review']/text()").extract()
        #print(item["title"])
        #数据传给pipeline处理 默认pipeline关闭，需要到setting文件中取消注释
        yield item
        for i in range(2,6):
            url = 'http://search.dangdang.com/?key=%CE%C0%D2%C2&category_id=10010336&page_index='+str(i)
            yield Request(url, callback=self.parse)

在这里插入图片描述
3. pipeline文件编写

# -*- coding: utf-8 -*-
import pymysql
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class DangdangPipeline(object):
    def process_item(self, item, spider):
        conn = pymysql.connect(host="127.0.0.1", user="root", passwd="你的数据库密码", db="dd")#, charset="utf8"
        cursor = conn.cursor()
        for i in range(0, len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            comment = item["comment"][i]
            #print(title+" : "+link+" : "+comment)
            sql = "insert into goods(title,link,comment) values ('"+title+"','"+link+"','"+comment+"')"
            sql = "insert into goods(title,link,comment) values (%s, %s, %s)"
            print(sql)
            try:
                #conn.query(sql)
                cursor.execute(sql, (title, link, comment))
                conn.commit()
            except Exception as err:
                print(err)
        conn.close()
        return item

在这里插入图片描述

setting文件修改

四、测试
在这里插入图片描述

牧阳MuYoung

发布了42 篇原创文章 · 获赞 0 · 访问量 1858

私信关注

XPath基础

Scrapy爬虫框架简单实例

猜你喜欢