When you see spiders, you may think of disgusting real spiders, like this, scary enough, he is one of the ten most poisonous spiders in the world.
You were wrong, it’s just the hateful spider in your image. You never expected that the spider is quite cute. Like this, Kazilan’s eyes are big, and she can’t bear to rub it against the ground.
Oh, wait. Suddenly, my brain is gone, Spider-Man, this is a breath of energy. I think that when Spider-Man was not called Spider-Man, he was bitten by a spider and he was called Spider-Man.
Oh, it seems to be far away, let’s go back to the topic. Today’s topic is that the spider in scrapy refers to web crawlers.
Today we use a complete example to crawl the tiger sniffing net news list, I come in to the website, take a look
https://www.huxiu.com/
It feels like what kind of treasure I have found, as if I can learn the article writing skills inside?
Create project
scrapy startproject coolscrapy
If this order goes on, you can’t obey it smoothly? Let's take a look at the catalog distribution first
coolscrapy/
scrapy.cfg # 部署配置文件
coolscrapy/ # Python模块,你所有的代码都放这里面
__init__.py
items.py # Item定义文件
pipelines.py # pipelines定义文件
settings.py # 配置文件
spiders/ # 所有爬虫spider都放这个文件夹下面
__init__.py
...
Define our own Items
Because we need to crawl the "Title", "Brief Description", "Link", and "Release Time" of the news list of the Tiger Sniff Net, we need to define a spider.Items class to crawl
import scrapy
# 传入 scrapy.Item 说明是继承自 scrapy.Item 基类
class HuXiuItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
posttime = scrapy.Field()
You may think that defining this thing is a bit troublesome and unnecessary, but have you carefully discovered that this is not like a base class in java, which defines various attributes, which may correspond to the data fields of the model layer. In fact, I don’t I know java too much, but the company uses java backend, so I've touched it a bit
Next is our spider
These spiders are actually crawling tools, but the abstraction to the code level is actually a method one by one. In a more abstract way, it is a class (class). Scrapy uses them from the domain (in fact, what we call the URL address) Crawl information, define an initialization url in the spider class, and follow links, how to parse page information
To define a Spider, you only need to inherit the scrapy.Spider
class and set some attributes:
name: Spider name, must be unique
start_urls: initial download link URL
parse(): Used to parse the downloaded Response object, which is also the only parameter of this method. It is responsible for parsing the returned page data and extracting the corresponding Item (returning Item object), as well as other legal link URLs (returning Request object)
We create a new folder under the coolscrapy/spiders folder huxiu_spider.py
, the content is as follows
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
Topic: sample
Desc :
"""
from coolscrapy.items import HuXiuItem
import scrapy
class HuXiuSpider(scrapy.Spider):
name = 'huxiu'
allowed_domains = ['huxiu.com']
start_urls = [
'http://www/huxiu.com/index.php'
]
def parse(self, response):
for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
item = HuXiuItem()
item['title'] = sel.xpath('h3/a/text()')[0].extract()
item['link'] = sel.xpath('h3/a/@href')[0].extract()
url = response.urljoin(item['link'])
item['desc'] = sel.xpath(
'div[@class="mob-sub"]/text()')[0].extract()
print(item['title'], item['link'], item['desc'])
Run crawler
It’s hard to vote for Buddha, God bless my reptiles to be safe and sound, no bugs, so nervous
Execute the following command in the root directory, where huxiu is the spider name you defined
scrapy crawl huxiu
Oh my goodness, I still reported an error if I didn’t post it, so we can solve the bug.
This bug is left for now, let’s get acquainted with the process first, and change it later
Handle link
If you want to keep track of each news link and see its details, then you can return a Request object in the parse() method, and then register a callback function to parse the news details
from coolscrapy.items import HuXiuItem
import scrapy
class HuxiuSpider(scrapy.Spider):
name = "huxiu"
allowed_domains = ["huxiu.com"]
start_urls = [
"http://www.huxiu.com/index.php"
]
def parse(self, response):
for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
item = HuXiuItem()
item['title'] = sel.xpath('h3/a/text()')[0].extract()
item['link'] = sel.xpath('h3/a/@href')[0].extract()
url = response.urljoin(item['link'])
item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
# print(item['title'],item['link'],item['desc'])
yield scrapy.Request(url, callback=self.parse_article)
def parse_article(self, response):
detail = response.xpath('//div[@class="article-wrap"]')
item = HuXiuItem()
item['title'] = detail.xpath('h1/text()')[0].extract()
item['link'] = response.url
item['posttime'] = detail.xpath(
'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
print(item['title'],item['link'],item['posttime'])
yield item
Now parse only extracts the links of interest, and then passes the link content analysis to another method for processing. You can build more complex crawlers based on this
export data
The easiest way to save the captured data is to use a json format file to save it locally, and run it like this:
scrapy crawl huxiu -o items.json
This method is sufficient in the small system demonstrated. But if you want to build a complex crawler system, it’s best to write the Item Pipeline yourself
Save data to database
Above we introduced that you can export the captured items to a json format file, but the most common way is to write a Pipeline and store it in the database. We are coolscrapy/pipelines.py
defining
# -*- coding: utf-8 -*-
import datetime
import redis
import json
import logging
from contextlib import contextmanager
from scrapy import signals
from scrapy.exporters import JsonItemExporter
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from sqlalchemy.orm import sessionmaker
from coolscrapy.models import News, db_connect, create_news_table, Article
class ArticleDataBasePipeline(object):
"""保存文章到数据库"""
def __init__(self):
engine = db_connect()
create_news_table(engine)
self.Session = sessionmaker(bind=engine)
def open_spider(self, spider):
"""This method is called when the spider is opened."""
pass
def process_item(self, item, spider):
a = Article(url=item["url"],
title=item["title"].encode("utf-8"),
publish_time=item["publish_time"].encode("utf-8"),
body=item["body"].encode("utf-8"),
source_site=item["source_site"].encode("utf-8"))
with session_scope(self.Session) as session:
session.add(a)
def close_spider(self, spider):
pass
Above, I used SQLAlchemy in python to save the database. This is a very good ORM library. I wrote an introductory tutorial about it . You can refer to it.
Then setting.py
configure this Pipeline in, as well as information such as database links:
ITEM_PIPELINES = {
'coolscrapy.pipelines.ArticleDataBasePipeline': 5,
}
# linux pip install MySQL-python
DATABASE = {'drivername': 'mysql',
'host': '192.168.203.95',
'port': '3306',
'username': 'root',
'password': 'mysql',
'database': 'spider',
'query': {'charset': 'utf8'}
}
Run the crawler again
Fix this bug tomorrow