In-depth web page analysis: use scrapy_selenium to obtain map information

Yiniu cloud agent

Introduction

Web crawler is a technology that automatically obtains web content. It can be used in various scenarios such as data collection, information analysis, and website monitoring. However, the content of some web pages is not static, but dynamically generated by JavaScript, such as complex elements such as charts and maps. These elements often require user interaction to be displayed, or need to wait for a certain period of time before loading is complete. If you use traditional crawling techniques, such as requests or urllib, you cannot get the content of these elements, because they can only request the source code of the web page, and cannot execute JavaScript code.

In order to solve this problem, we can use the scrapy_selenium tool, which combines two powerful libraries, scrapy and selenium, to crawl dynamic web pages. scrapy is a distributed crawler system based on the Scrapy framework, which can easily manage multiple crawler projects and provides a wealth of middleware and pipeline components. Selenium is an automated testing tool that can simulate the behavior of a browser, such as opening a web page, clicking a button, entering text, etc., and obtaining the rendering result of the web page. By using selenium as the downloader middleware of scrapy, we can let scrapy use selenium to request and parse web pages, so as to obtain dynamically generated content.

overview

This article will introduce how to use scrapy_selenium to crawl web pages containing complex elements such as charts and maps, and take Baidu Map as an example to show how to obtain the label information on the map. This article assumes that readers are familiar with the basic usage of scrapy and selenium, and have installed related dependencies and drivers.

text

Install scrapy_selenium

scrapy_selenium is an open source Python package that can be installed via the pip command:

# 安装scrapy_selenium
pip install scrapy_selenium

Create scrapy project and crawler

Create a project called mapspider using the scrapy command:

# 创建mapspider项目
scrapy startproject mapspider

Enter the project directory and use the genspider command to create a crawler named baidumap:

# 进入项目目录
cd mapspider
# 创建baidumap爬虫
scrapy genspider baidumap baidu.com

Configure the settings.py file

Open the settings.py file in the project directory and modify the following content:

# 导入scrapy_selenium模块
from scrapy_selenium import SeleniumMiddleware

# 设置下载器中间件,使用SeleniumMiddleware替换默认的下载器中间件
DOWNLOADER_MIDDLEWARES = {
    
    
    'scrapy_selenium.SeleniumMiddleware': 800,
}

# 设置selenium相关参数,如浏览器类型、超时时间、窗口大小等
SELENIUM_BROWSER = 'chrome' # 使用chrome浏览器
SELENIUM_TIMEOUT = 30 # 设置超时时间为30秒
SELENIUM_WINDOW_SIZE = (1920, 1080) # 设置窗口大小为1920x1080

# 亿牛云 设置爬虫代理信息
PROXY_HOST = "www.16yun.cn" # 代理服务器地址
PROXY_PORT = "3111" # 代理服务器端口号
PROXY_USER = "16YUN" # 代理用户名
PROXY_PASS = "16IP" # 代理密码

# 设置日志级别为INFO,方便查看运行情况
LOG_LEVEL = 'INFO'

Write baidumap.py file

Open the spiders folder under the project directory, find the baidumap.py file, and modify the following content:

# 导入scrapy和selenium相关的模块
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

# 定义baidumap爬虫类,继承scrapy.Spider类
class BaidumapSpider(scrapy.Spider):
    # 设置爬虫名称
    name = 'baidumap'
    # 设置起始URL,这里以北京市为例
    start_urls = ['https://map.baidu.com/?newmap=1&ie=utf-8&s=s%26wd%3D%E5%8C%97%E4%BA%AC%E5%B8%82']

    # 定义解析方法,接收response参数
    def parse(self, response):
        # 获取selenium的driver对象,用于操作浏览器
        driver = response.meta['driver']
        # 等待地图加载完成,判断地图层是否可见
        WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'BMap_mask')))
        # 获取地图上的所有标注元素,返回一个列表
        markers = driver.find_elements_by_class_name('BMap_Marker')
        # 遍历标注元素列表
        for marker in markers:
            # 获取标注的文本内容,如酒店、餐厅等
            text = marker.get_attribute('textContent')
            # 获取标注的坐标位置,返回一个字典,包含x和y两个键
            position = marker.get_attribute('position')
            # 打印标注的文本和坐标信息
            print(text, position)

run crawler

In the project directory, use the scrapy command to run the crawler:

# 运行baidumap爬虫
scrapy crawl baidumap

the case

After running the crawler, you can see the following output on the console:

酒店 {'x': '116.403119', 'y': '39.914714'}
餐厅 {'x': '116.403119', 'y': '39.914714'}
银行 {'x': '116.403119', 'y': '39.914714'}
超市 {'x': '116.403119', 'y': '39.914714'}
医院 {'x': '116.403119', 'y': '39.914714'}
学校 {'x': '116.403119', 'y': '39.914714'}
公交站 {'x': '116.403119', 'y': '39.914714'}
地铁站 {'x': '116.403119', 'y': '39.914714'}
停车场 {'x': '116.403119', 'y': '39.914714'}
加油站 {'x': '116.403119', 'y': '39.914714'}
...

These outputs are the annotation information on the crawled map, including text and coordinates. We may use this information for further analysis or application.

epilogue

This article introduces how to use scrapy_selenium to crawl web pages containing complex elements such as charts and maps, and takes Baidu Map as an example to show how to obtain the label information on the map. scrapy_selenium is a powerful and flexible tool that can cope with the crawling needs of various dynamic web pages and facilitates data collection. Hope this article helps you.

Guess you like

Origin blog.csdn.net/ip16yun/article/details/132428607