Scrapy notes-save to database

Scrapy is written in Python. If you are not familiar with this language, please learn the basics first.

Create Scrapy project

Execute the following command in any directory you like

scrapy startproject coolscrapy

Copy

The coolscrapy folder will be created, and its directory structure is as follows:

coolscrapy/
    scrapy.cfg            # 部署配置文件

    coolscrapy/           # Python模块,你所有的代码都放这里面
        __init__.py

        items.py          # Item定义文件

        pipelines.py      # pipelines定义文件

        settings.py       # 配置文件

        spiders/          # 所有爬虫spider都放这个文件夹下面
            __init__.py
            ...

Define our Item

By creating a scrapy.Item class and defining its type as a property of scrapy.Field, we are ready to crawl down the name, link address and summary of the Tiger Sniffing News list.

import scrapy

class HuxiuItem(scrapy.Item):
    title = scrapy.Field()    # 标题
    link = scrapy.Field()     # 链接
    desc = scrapy.Field()     # 简述
    posttime = scrapy.Field() # 发布时间

Copy

Maybe you find it a bit troublesome to define this Item, but you can get many benefits after defining it, so you can use other useful components and helper classes in Scrapy.

The first Spider

Spiders are classes you define, and Scrapy uses them to crawl information from a domain (or domain group). In the spider class, an initial URL download list is defined, as well as how to follow the link and how to parse the content of the page to extract the Item.

To define a Spider, just extend the scrapy.Spiderclass and set some attributes:

  • name: Spider name, must be unique
  • start_urls: Initialize download link URL
  • parse (): Used to parse the downloaded Response object, which is also the only parameter of this method. It is responsible for parsing the returned page data and extracting the corresponding Item (returning the Item object), as well as other legal link URLs (returning the Request object).

We create a new folder under the coolscrapy / spiders folder with the following huxiu_spider.pycontent:  ` python huxiu_spider.py

#!/usr/bin/env python

-- encoding: utf-8 --

“”” Topic: sample Desc : “”” from coolscrapy.items import HuxiuItem import scrapy

class HuxiuSpider(scrapy.Spider): name = “huxiu” allowed_domains = [“huxiu.com”] start_urls = [ “http://www.huxiu.com/index.php" ]

def parse(self, response):
    for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
        item = HuxiuItem()
        item['title'] = sel.xpath('h3/a/text()')[0].extract()
        item['link'] = sel.xpath('h3/a/@href')[0].extract()
        url = response.urljoin(item['link'])
        item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
        print(item['title'],item['link'],item['desc'])

## 运行爬虫
在根目录执行下面的命令,其中huxiu是你定义的spider名字:
``` bash
scrapy crawl huxiu

If everything is normal, you should be able to print out every news

Handling links

If you want to continue to follow each news link and see its details, you can return a Request object in the parse () method, and then register a callback function to parse the news details.

from coolscrapy.items import HuxiuItem
import scrapy

class HuxiuSpider(scrapy.Spider):
    name = "huxiu"
    allowed_domains = ["huxiu.com"]
    start_urls = [
        "http://www.huxiu.com/index.php"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
            item = HuxiuItem()
            item['title'] = sel.xpath('h3/a/text()')[0].extract()
            item['link'] = sel.xpath('h3/a/@href')[0].extract()
            url = response.urljoin(item['link'])
            item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
            # print(item['title'],item['link'],item['desc'])
            yield scrapy.Request(url, callback=self.parse_article)

    def parse_article(self, response):
        detail = response.xpath('//div[@class="article-wrap"]')
        item = HuxiuItem()
        item['title'] = detail.xpath('h1/text()')[0].extract()
        item['link'] = response.url
        item['posttime'] = detail.xpath(
            'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
        print(item['title'],item['link'],item['posttime'])
        yield item

Copy

Now parse only extracts the links of interest, and then parses the link content to another method to process. You can build a more complex crawler program based on this.

Export crawl data

The easiest way to save the captured data is to use a json format file to save it locally and run it as follows:

scrapy crawl huxiu -o items.json

Copy

This approach is sufficient in the demo system. But if you want to build a complex crawler system, it is best to write the Item Pipeline yourself .

Save data to database

Above we introduced that you can export the crawled Item to a json format file, but the most common way is to write Pipeline and store it in the database. We are coolscrapy/pipelines.pydefining

# -*- coding: utf-8 -*-
import datetime
import redis
import json
import logging
from contextlib import contextmanager

from scrapy import signals
from scrapy.exporters import JsonItemExporter
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from sqlalchemy.orm import sessionmaker
from coolscrapy.models import News, db_connect, create_news_table, Article


class ArticleDataBasePipeline(object):
    """保存文章到数据库"""

    def __init__(self):
        engine = db_connect()
        create_news_table(engine)
        self.Session = sessionmaker(bind=engine)

    def open_spider(self, spider):
        """This method is called when the spider is opened."""
        pass

    def process_item(self, item, spider):
        a = Article(url=item["url"],
                    title=item["title"].encode("utf-8"),
                    publish_time=item["publish_time"].encode("utf-8"),
                    body=item["body"].encode("utf-8"),
                    source_site=item["source_site"].encode("utf-8"))
        with session_scope(self.Session) as session:
            session.add(a)

    def close_spider(self, spider):
        pass

Copy

Above, I used SQLAlchemy in Python to save the database. This is a very good ORM library. I wrote an introductory tutorial about it . You can refer to it.

Then setting.pyconfigure this Pipeline in, as well as database link and other information:

ITEM_PIPELINES = {
    'coolscrapy.pipelines.ArticleDataBasePipeline': 5,
}

# linux pip install MySQL-python
DATABASE = {'drivername': 'mysql',
            'host': '192.168.203.95',
            'port': '3306',
            'username': 'root',
            'password': 'mysql',
            'database': 'spider',
            'query': {'charset': 'utf8'@@

Copy

Run the crawler again

scrapy crawl huxiu

Copy

Then all news articles are stored in the database.

发布了150 篇原创文章 · 获赞 149 · 访问量 81万+

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/102509215