Scrapy (3) slams the spider on the ground and rubs it

When you see spiders, you may think of disgusting real spiders, like this, scary enough, he is one of the ten most poisonous spiders in the world.


image


You were wrong, it’s just the hateful spider in your image. You never expected that the spider is quite cute. Like this, Kazilan’s eyes are big, and she can’t bear to rub it against the ground.


image


Oh, wait. Suddenly, my brain is gone, Spider-Man, this is a breath of energy. I think that when Spider-Man was not called Spider-Man, he was bitten by a spider and he was called Spider-Man.



Oh, it seems to be far away, let’s go back to the topic. Today’s topic is that the spider in scrapy refers to web crawlers.


Today we use a complete example to crawl the tiger sniffing net news list, I come in to the website, take a look


https://www.huxiu.com/


It feels like what kind of treasure I have found, as if I can learn the article writing skills inside?


Create project


scrapy startproject coolscrapy


If this order goes on, you can’t obey it smoothly? Let's take a look at the catalog distribution first


coolscrapy/
   scrapy.cfg            # 部署配置文件

   coolscrapy/           # Python模块,你所有的代码都放这里面
       __init__.py

       items.py          # Item定义文件

       pipelines.py      # pipelines定义文件

       settings.py       # 配置文件

       spiders/          # 所有爬虫spider都放这个文件夹下面
           __init__.py
           ...

Define our own Items 


Because we need to crawl the "Title", "Brief Description", "Link", and "Release Time" of the news list of the Tiger Sniff Net, we need to define a spider.Items class to crawl


import scrapy

# 传入 scrapy.Item 说明是继承自 scrapy.Item 基类
class HuXiuItem(scrapy.Item):
   # define the fields for your item here like:
   title = scrapy.Field()
   link = scrapy.Field()
   desc = scrapy.Field()
   posttime = scrapy.Field()

You may think that defining this thing is a bit troublesome and unnecessary, but have you carefully discovered that this is not like a base class in java, which defines various attributes, which may correspond to the data fields of the model layer. In fact, I don’t I know java too much, but the company uses java backend, so I've touched it a bit


Next is our spider


These spiders are actually crawling tools, but the abstraction to the code level is actually a method one by one. In a more abstract way, it is a class (class). Scrapy uses them from the domain (in fact, what we call the URL address) Crawl information, define an initialization url in the spider class, and follow links, how to parse page information


To define a Spider, you only need to inherit the scrapy.Spiderclass and set some attributes:

name: Spider name, must be unique

start_urls: initial download link URL

parse(): Used to parse the downloaded Response object, which is also the only parameter of this method. It is responsible for parsing the returned page data and extracting the corresponding Item (returning Item object), as well as other legal link URLs (returning Request object)



We create a new folder under the coolscrapy/spiders folder huxiu_spider.py, the content is as follows


#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
Topic: sample
Desc :
"""

from coolscrapy.items import HuXiuItem
import scrapy


class HuXiuSpider(scrapy.Spider):
   name = 'huxiu'
   allowed_domains = ['huxiu.com']
   start_urls = [
       'http://www/huxiu.com/index.php'
   ]

   def parse(self, response):
       for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
           item = HuXiuItem()
           item['title'] = sel.xpath('h3/a/text()')[0].extract()
           item['link'] = sel.xpath('h3/a/@href')[0].extract()
           url = response.urljoin(item['link'])
           item['desc'] = sel.xpath(
               'div[@class="mob-sub"]/text()')[0].extract()
           print(item['title'], item['link'], item['desc'])

Run crawler


It’s hard to vote for Buddha, God bless my reptiles to be safe and sound, no bugs, so nervous


Execute the following command in the root directory, where huxiu is the spider name you defined


scrapy crawl huxiu

Oh my goodness, I still reported an error if I didn’t post it, so we can solve the bug.


image

This bug is left for now, let’s get acquainted with the process first, and change it later


Handle link


If you want to keep track of each news link and see its details, then you can return a Request object in the parse() method, and then register a callback function to parse the news details


from coolscrapy.items import HuXiuItem
import scrapy

class HuxiuSpider(scrapy.Spider):
   name = "huxiu"
   allowed_domains = ["huxiu.com"]
   start_urls = [
       "http://www.huxiu.com/index.php"
   ]

   def parse(self, response):
       for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
           item = HuXiuItem()
           item['title'] = sel.xpath('h3/a/text()')[0].extract()
           item['link'] = sel.xpath('h3/a/@href')[0].extract()
           url = response.urljoin(item['link'])
           item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
           # print(item['title'],item['link'],item['desc'])
           yield scrapy.Request(url, callback=self.parse_article)

   def parse_article(self, response):
       detail = response.xpath('//div[@class="article-wrap"]')
       item = HuXiuItem()
       item['title'] = detail.xpath('h1/text()')[0].extract()
       item['link'] = response.url
       item['posttime'] = detail.xpath(
           'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
       print(item['title'],item['link'],item['posttime'])
       yield item

Now parse only extracts the links of interest, and then passes the link content analysis to another method for processing. You can build more complex crawlers based on this


export data


The easiest way to save the captured data is to use a json format file to save it locally, and run it like this:


scrapy crawl huxiu -o items.json

This method is sufficient in the small system demonstrated. But if you want to build a complex crawler system, it’s best to write the Item Pipeline yourself


Save data to database


Above we introduced that you can export the captured items to a json format file, but the most common way is to write a Pipeline and store it in the database. We are coolscrapy/pipelines.pydefining


# -*- coding: utf-8 -*-
import datetime
import redis
import json
import logging
from contextlib import contextmanager

from scrapy import signals
from scrapy.exporters import JsonItemExporter
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from sqlalchemy.orm import sessionmaker
from coolscrapy.models import News, db_connect, create_news_table, Article


class ArticleDataBasePipeline(object):
   """保存文章到数据库"""

   def __init__(self):
       engine = db_connect()
       create_news_table(engine)
       self.Session = sessionmaker(bind=engine)

   def open_spider(self, spider):
       """This method is called when the spider is opened."""
       pass

   def process_item(self, item, spider):
       a = Article(url=item["url"],
                   title=item["title"].encode("utf-8"),
                   publish_time=item["publish_time"].encode("utf-8"),
                   body=item["body"].encode("utf-8"),
                   source_site=item["source_site"].encode("utf-8"))
       with session_scope(self.Session) as session:
           session.add(a)

   def close_spider(self, spider):
       pass

Above, I used SQLAlchemy in python to save the database. This is a very good ORM library. I wrote an introductory tutorial about it . You can refer to it.

Then setting.pyconfigure this Pipeline in, as well as information such as database links:

ITEM_PIPELINES = {
   'coolscrapy.pipelines.ArticleDataBasePipeline': 5,
}

# linux pip install MySQL-python
DATABASE = {'drivername': 'mysql',
           'host': '192.168.203.95',
           'port': '3306',
           'username': 'root',
           'password': 'mysql',
           'database': 'spider',
           'query': {'charset': 'utf8'}
}


Run the crawler again


Fix this bug tomorrow




Guess you like

Origin blog.51cto.com/15067249/2574450