Scrapy framework learning (7)----Scrapy combined with scrapy-splash framework to quickly load js pages

Scrapy framework learning (7)--Scrapy combined with scrapy-splash framework to quickly load js pages

I. Introduction

When we use crawler programs to crawl web pages, it is generally relatively simple to crawl static pages, and we have written many cases before. But how to crawl pages dynamically loaded using js?

There are several ways to crawl dynamic js pages:

  1. achieved by selenium+ phantomjs.

    • phantomjsIt is a headless browser and seleniuman automated testing framework. It requests pages through a headless browser, waits for js to load, and then seleniumobtains data through automated testing. Because headless browsers are very resource-intensive, they are lacking in performance.
  2. Scrapy-splashFrame :

    • As a js rendering service, Splash is a lightweight browser engine developed based on Twisted and QT, and provides direct http api. Fast, lightweight features make it easy for distributed development.

    • The splash and scrapy crawler frameworks are integrated, the two are compatible with each other, and the crawling efficiency is better.

Second, Splash environment construction

Splash service is based on docker container, so we need to install docker container first.

2.1 docker installation (windows 10 home edition)

If it is win 10 professional version or other operating systems, it is relatively easy to install. To install docker in windows 10 home version, you need to install it through the toolbox (the latest) tool.

For the installation of docker, refer to the document: WIN10 install Docker

2.2 splash installation

docker pull scrapinghub/splash

2.3 Start the Splash service

docker run -p 8050:8050 scrapinghub/splash

image

At this time, open your browser, enter and 192.168.99.100:8050you will see this interface appear.

image

You can enter any URL in the red box in the picture above and click Render me! to see what it looks like after rendering

2.4 Install the scrapy-splash package of python

pip install scrapy-splash

3. The scrapy crawler loads the js project test, taking google news as an example.

Because the business needs to crawl some foreign news websites, such as google news. But I found that it was actually js code. So I started to use the scrapy-splashframework and cooperated with Splash's js rendering service to obtain data. Specifically look at the following code:

3.1 settings.py configuration information

# 渲染服务的url
SPLASH_URL = 'http://192.168.99.100:8050'


# 去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

# 使用Splash的Http缓存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'


SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}


#下载器中间件
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}


# 请求头
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}

# 管道
ITEM_PIPELINES = {
   'news.pipelines.NewsPipeline': 300,
}

3.2 items field definition

class NewsItem(scrapy.Item):
    # 标题
    title = scrapy.Field()
    # 图片的url链接
    image_url = scrapy.Field()
    # 新闻来源
    source = scrapy.Field()
    # 点击的url
    action_url = scrapy.Field()

3.3 Spider code

In the spider directory, create a new_spider.py file with the following contents:


from scrapy import Spider
from scrapy_splash import SplashRequest
from news.items import NewsItem


class GoolgeNewsSpider(Spider):
    name = "google_news"

    start_urls = ["https://news.google.com/news/headlines?ned=cn&gl=CN&hl=zh-CN"]

    def start_requests(self):
        for url in self.start_urls:
            # 通过SplashRequest请求等待1秒
            yield SplashRequest(url, self.parse, args={'wait': 1})

    def parse(self, response):
        for element in response.xpath('//div[@class="qx0yFc"]'):
            actionUrl = element.xpath('.//a[@class="nuEeue hzdq5d ME7ew"]/@href').extract_first()
            title = element.xpath('.//a[@class="nuEeue hzdq5d ME7ew"]/text()').extract_first()
            source = element.xpath('.//span[@class="IH8C7b Pc0Wt"]/text()').extract_first()
            imageUrl = element.xpath('.//img[@class="lmFAjc"]/@src').extract_first()

            item = NewsItem()
            item['title'] = title
            item['image_url'] = imageUrl
            item['action_url'] = actionUrl
            item['source'] = source

            yield item

3.4 pipelines.py code

Store item data in the mysql database.

  • Create the db_news database
CREATE DATABASE db_news
  • Create tb_news table
CREATE TABLE tb_google_news(
    id INT AUTO_INCREMENT,
    title VARCHAR(50),
    image_url VARCHAR(200),
    action_url VARCHAR(200),
    source VARCHAR(30),
    PRIMARY KEY(id)
)ENGINE=INNODB DEFAULT CHARSET=utf8;

NewsPipeline class

class NewsPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect(host='localhost', port=3306, user='root', passwd='root', db='db_news',charset='utf8')
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        sql = '''insert into tb_google_news (title,image_url,action_url,source) values(%s,%s,%s,%s)'''
        self.cursor.execute(sql, (item["title"], item["image_url"], item["action_url"], item["source"]))
        self.conn.commit()
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

3.5 Execute scrapy crawler

Execute on the console:

scrapy crawl google_news

The database shows the following figure:

image


Project address: https://github.com/zhang3550545/scrapy-spider/tree/master/news

Reference article:

Splash official documentation

Install Docker on Windows 10 Home Edition

github scrapy-plugins project

Scrapy learning articles (thirteen) scrapy-splash

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325681289&siteId=291194637