A general crawler framework based on scrapy and xslt

Brief description of the framework

This framework is built based on the scrapy framework and xslt technology. It matches the content on the web page through an xml file containing xslt style and xpath syntax, and then stores the matched fields and content persistently.
Among them, the depth of the webpage and the next-level links can be defined in the editing of the xml file, and the content of the matched webpage can be transferred between different depths.
Secondly, when the code is stored in the database, it automatically creates a database based on the matched fields and content in the web page and writes the data.

Xml rule file example

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
    <xsl:template match="/">
        <xsl:for-each select="//ul[@class='nei03_04_08_01']/li">
            <!--- links为框架固定字段,在框架中links字段固定表示为下一深度的链接 depth代表当前页深度 -->
            <links depth="1">
                <xsl:value-of select="em/a/@href"/>
            </links>
        </xsl:for-each>

        <xsl:for-each select="//ul[@class='nei03_04_08_01']/li">
            <!--- type为自定义采集的字段 depth代表当前页深度 -->
            <type depth="1">
                <xsl:value-of select="string(span/tt/a)"/>
            </type>
        </xsl:for-each>

        <xsl:for-each select="//ul[@class='nei03_04_08_01']/li">
            <!--- release_time为自定义采集的字段 -->
            <release_time depth="1">
                <xsl:value-of select="string(i)"/>
            </release_time>
        </xsl:for-each>


        <xsl:for-each select="//ul[@class='nei03_04_08_01']/li">
            <city depth="1">
                <xsl:value-of select="string(span/b/a)"/>
            </city>
        </xsl:for-each>

        <project_name depth="2">
            <xsl:value-of select="//div[@class='title']"/>
        </project_name>

        <full_text depth="2">
            <xsl:value-of select="translate(string(//div[@class='nei03_02']),'&#13;','')"/>
        </full_text>

        <money depth="2">
            <xsl:value-of select="//td[@class='tb01']/following-sibling::*[1]"/>
        </money>>

    </xsl:template>

</xsl:stylesheet>

Xml above-described examples, in addition to links fields, all other user-defined field is a fixed field
run a program, the page will automatically match the content in accordance with xpath xml syntax included in the field with

Content analysis class

<?xml version="1.0"?>
<type depth="2">采购公告</type>
<province depth="2">江苏</province>
<city depth="2">南京</city>
<collection_url depth="2">http://www.ccgp-jiangsu.gov.cn/</collection_url>

Above into a Web page after the contents had xml file matching rule, then matching can be seen that generates a xml format text, the text contained in the tag, as well as through the web content matching the specified label
but not easy text format Xml Program processing, so you need a class to process the Xml text after the match and convert it into a list or dictionary type that python can handle

The following is the content analysis source code

import bs4
from bs4 import BeautifulSoup as bs


class ParseXml:
    """
    此类用来解析Xml格式文件,并且从中获取相对应的内容
    """
    def __init__(self, xml_content):

        """
        根据xml内容,初始化BeautifulSoup
        :param xml_content:xml内容
        :arg soup
        :arg depth_tag_dict:xml中所有标签与深度相对的字典s
        :arg all_tags:xml中所有的标签名
        """
        self.soup = bs(xml_content, 'lxml')
        self.depth_tag_dict = self.get_depth_tag_dict()
        self.all_tags = self.get_all_tags()

    def get_item(self, item_name, isEnd=False):
        """
        根据item_name传进来的标签名,查找相应内容
        :param item_name:标签名
        :param isEnd:是否存在结束标识,默认为False
        :return: 如果标签存在内容则且isEnd=False,则返回匹配到的list
                 如果标签存在内容则且isEnd=True,则返回匹配到的String
                 如果标签不存在内容,则返回None
                 如果标签不存在,则返回None
        """
        tags = self.soup.find(item_name)  # 查找标签名
        if tags:  # 如果标签名存在
            if tags.string:  # 如果标签存在文本内容
                if isEnd:  # 如果存在结束标识,则返回字符类型
                    return str(tags.string)
                else:
                    content_list = []
                    for one in self.soup.find_all(item_name):
                        content_list.append(one.string)
                    return content_list
            else:
                # print("此标签名没有内容:",item_name)
                return None
        else:  # 不存在则返回None
            # print("为找到标签名:",item_name)
            return None

    def get_depth_all(self, depth, isEnd=False):
        """
        获取指定深度下的所有内容
        :param depth: 深度,类型为int
        :param isEnd: 是否为最后一级深度
        :return:返回当前深度下 标签名和内容相对应的一个字典
        """
        depth_content_dict = {}
        tag_list = []
        for tag_k, depth_v in self.depth_tag_dict.items():
            if depth_v == str(depth):  # 如果深度为当前深度,则添加相对应的标签名
                tag_list.append(tag_k)
        tag_set = set(tag_list)
        for one_tag in tag_set:
            tag_content = self.get_item(one_tag, isEnd)  # 获取当前标签的内容
            if tag_content:
                depth_content_dict[one_tag] = tag_content
        return depth_content_dict

    def get_depth_tag_dict(self):
        """
        获取xml中所有标签与深度相对的字典
        :return: xml中所有标签与深度相对的字典
        """
        depth_tag_dict = {}
        for on in self.soup.body.contents:
            if isinstance(on, bs4.element.Tag):
                depth = on.attrs.get('depth')
                depth_tag_dict[on.name] = depth
        return depth_tag_dict

    def get_all_tags(self):
        """
        获取xml中所有的标签名
        :return:xml中所有的标签名,类型为list
        """
        tag_list = []
        for o in self.soup.body.contents:
            if isinstance(o, bs4.element.Tag):
                tag_list.append(o.name)
        return list(set(tag_list))

Data persistence class

This type is mainly used for persistent storage of the crawled data. Currently, only the persistent writing of the mysql database is implemented. Users can expand according to their own needs.
At present, my own mysql storage class can automatically build tables and generate insert statements based on the content of the crawled fields

The following is the mysql storage class source code

import pymysql
from scrapy.conf import settings


class DataToMysql:
    def __init__(self, host, user, passwd, db, port):
        try:
            self.conn = pymysql.connect(host=host, user=user, passwd=passwd, db=db,
                                        port=port, charset='utf8')  # 链接数据库
            self.cursor = self.conn.cursor()
        except pymysql.Error as e:
            print("数据库连接信息报错")
            raise e

    def write(self, table_name, info_dict):
        """
        根据table_name与info自动生成建表语句和insert插入语句
        :param table_name: 数据需要写入的表名
        :param info_dict: 需要写入的内容,类型为字典
        :return:
        """
        sql_key = ''  # 数据库行字段
        sql_value = ''  # 数据库值
        for key in info_dict.keys():  # 生成insert插入语句
            sql_value = (sql_value + '"' + pymysql.escape_string(info_dict[key]) + '"' + ',')
            sql_key = sql_key + ' ' + key + ','

        try:
            self.cursor.execute(
                "INSERT INTO %s (%s) VALUES (%s)" % (table_name, sql_key[:-1], sql_value[:-1]))
            self.conn.commit()  # 提交当前事务
        except pymysql.Error as e:

            if str(e).split(',')[0].split('(')[1] == "1146":  # 当表不存在时,生成建表语句并建表
                sql_key_str = ''  # 用于数据库创建语句
                columnStyle = ' text'  # 数据库字段类型
                for key in info_dict.keys():
                    sql_key_str = sql_key_str + ' ' + key + columnStyle + ','
                self.cursor.execute("CREATE TABLE %s (%s)" % (table_name, sql_key_str[:-1]))
                self.cursor.execute("INSERT INTO %s (%s) VALUES (%s)" % 
                                    (table_name, sql_key[:-1], sql_value[:-1]))
                self.conn.commit()  # 提交当前事务
            else:
                raise

Logic code

The following is the crawling logic source code of the entire crawler program

import random
import time

import scrapy
from lxml import etree
from scrapy import Request
from scrapy.conf import settings
from jiangsu.database import DataToMysql
from guize.parse_xml import ParseXml
from urllib.parse import urljoin
from jiangsu.urlproduce import UrlProduce
from bs4 import BeautifulSoup as bs


def html_to_xml(html, xslt):
    """
    此方法将网页经过xslt文件筛选
    :param html: 网页源码,需要经过编码
    :param xslt: xslt文件路径
    :return: 解析过后的xml内容
    """
    html = etree.HTML(html)
    xslt = etree.XML(open(xslt, 'rb').read())
    translate = etree.XSLT(xslt)
    result = translate(html)
    return str(result)


URLPRODUCE = UrlProduce()


class ZhaotoubiaoSpider(scrapy.Spider):
    """
    此类是爬虫的运行逻辑
    """
    name = 'mySpider'
    xslt = settings['GUIZE']  # 规则文件名
    start_urls = URLPRODUCE.get_hainan_url()  # 待爬取网页列表
    table = settings['TABLE']  # 存储的表名
    num = 0  # 记录当前采集次数

    # 初始化DataToMysql类实例,此实例用来将爬取的内容写入到mysql数据库
    mysql_conn = DataToMysql(settings['HOST'], settings['USER'],
                             settings['PASSWD'], settings['DB'], settings['PORT'])

    def start_requests(self):
        """
        此方法作测试用方法,仅使用部分网页检测规则文件是否可用
        :return:
        """
        url = "http://www.ccgp-jiangsu.gov.cn/cgxx/cggg/nanjing/index_10.html"
        # url = "http://www.ccgp-hainan.gov.cn/cgw/cgw_list_gglx.jsp?currentPage=1"
        yield Request(url, callback=self.parse)

    def parse(self, response):
        meta = response.meta  # 获取上一级网页传来的meta内容
        this_depth = meta.get('mydepth')  # 获取当前网页深度
        if not this_depth:  # 如果当前深度不存在,则默认为1
            this_depth = 1
        html = response.body  # 获取网页源码
        try:
            html = html.decode('utf-8')
        except:
            pass
        result_xml = html_to_xml(html, self.xslt)  # 将网页内容经过xslt文件进行筛选
        parse = ParseXml(result_xml)  # 解析筛选过后的xml格式内容
        links = parse.get_item('links')  # 获取当前爬取页面的links标签 (link代表链接)

        if isinstance(links, list):  # 如果links存在且为list对象,说明当前页面还有下一级待爬取的页面
            depth_content_dict = parse.get_depth_all(this_depth)  # 获取当前深度下的所有内容
            try:
                depth_content_dict.pop('links')  # 删除链接
            except:
                pass
            next_depth = this_depth + 1  # 下一级深度
            next_key = list(depth_content_dict.keys())
            len_links = len(links)  # 此级深度链接的长度
            for num in range(len(links)):
                all_depth_meta = meta.get('all_depth_meta')  # 获取上一级深度传过来的内容
                if not all_depth_meta:
                    all_depth_meta = {}
                all_depth_meta[this_depth] = {}  # 为当前深度创建一个列表,存储当前深度所采集到的内容
                next_meta = None
                if len(depth_content_dict) > 0:  # 如果当前深度除了链接没有要采集的内容,则不操作
                    next_meta = {}
                    for k in next_key:
                        if len_links == len(depth_content_dict.get(k)):
                            next_meta[k] = depth_content_dict.get(k)[num]
                        else:
                            next_meta[k] = depth_content_dict.get(k)
                next_url = urljoin(response.url, links[num])  # 将当前链接与下一级链接自动拼接为新的链接
                all_depth_meta[this_depth][num] = next_meta
                yield Request(next_url, 
                              meta={
   
   "all_depth_meta": all_depth_meta, "mydepth": next_depth,
                               "num": num},callback=self.parse)
        else:  # 如果当前不存在下一级链接,则默认判定为最后一级深度
            soup = bs(html, 'lxml')
            all_depth_meta = meta.get('all_depth_meta')  # 获取之前深度传过来的所有的meta数据
            num = meta.get("num")  # 获取页数
            last_depth_dict = all_depth_meta.get(this_depth - 1).get(num)  # 获取上一级深度传过来的内容
            end_depth_dict = parse.get_depth_all(this_depth, isEnd=True)  # 获取当前深度所有的标签内容
            if last_depth_dict:  # 将上一级深度的内容和当前深度的内容整合在一起
                for k, v in last_depth_dict.items():
                    if v:
                        end_depth_dict[k] = v
            self.mysql_conn.write(self.table, end_depth_dict)  # 进行入库操作
            self.num = self.num + 1  # 当前采集次数+1并且打印
            print(self.num)

Limitations and supplements

  • Currently only static web pages can be crawled, and then selenium will be used to improve the function of dynamic web crawling
  • The anti-crawl mechanism is not strong enough. Later, the development and use of proxy pool and cookie pool will be carried out, and a browser header generation class will be added.
  • There is no interactive page yet, and a front-end demo will be written later to interact with the crawler background
  • The crawling logic needs to be improved and optimized

Guess you like

Origin blog.csdn.net/mrliqifeng/article/details/80638974