[Scrapy Five Minutes Website] [Travel Industry News] Scrapy Actual Combat Data Capture of Beijing Travel Net

Target website introduction

Beijing Travel Net, a non-profit website supervised by the Beijing Municipal Bureau of Culture and Tourism, provides the most authoritative Beijing travel information in China, and provides comprehensive Beijing travel guides, Beijing travel complaints, travel questions and answers, and Beijing...
Insert picture description here

Start Scrapy

Preparation for data collection

1. For those who don’t understand the idea of ​​quickly crawling the website in 5 minutes, let’s first look at
[Scrapy 5 minutes on the website] the basic knowledge of the entire site data

2. I don’t understand data capture, business management, and sorting. First look at
[Scrapy Five Minutes Website] crawler target sorting and data preparation

3. For those who don’t know the mass production of Scrapy template, see it first (must see)
[Scrapy Five-Minutes Website] Data Capture Project Framework General Template

Data collation results

1. Save screenshots in Excel
Insert picture description here

Template application

<Project>.py file under Spider

1. Create a spider project

scrapy genspider www_visitbeijing_com_cn " "

2. Organize
the CSS styles of the whole site Let's first look at the CSS styles of the page. There are 3 styles for the whole site. The JSON format of the two APIs, and the conventional format.
Insert picture description here
Insert picture description here
Insert picture description here

3. Modify the content of www_visitbeijing_com_cn.py

Here, the areas that need to be modified are explained, and other places refer to the template, and no modification is required.

  • Scope & custom description
    allowed_domains = []
    web_name = "北京旅游网"
  • Add crawl data information
    start_menu = [
        # 购物
        [
            {
    
    "channel_name": "购物-购物攻略", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/3lz9bx5k", },
            {
    
    "channel_name": "购物-商家信息", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/19iere7b", },
        ],
        # 美食
        [
            {
    
    "channel_name": "美食-潮流美食", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/2gmxkd2a", },
            {
    
    "channel_name": "美食-地方美食", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/n0bvf6k2", },
            {
    
    "channel_name": "美食-老北京美食", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/rlwk9pq8", },
            {
    
    "channel_name": "美食-美食资讯", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/18xr1f7c", },
            {
    
    "channel_name": "美食-异域美食", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/lhyv22zr", },
        ],
        # 视频
        [
            {
    
    "channel_name": "视频-北京故事", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/lYIcryRt", },
            {
    
    "channel_name": "视频-京郊游玩", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/8UuPRmMl", },
            {
    
    "channel_name": "视频-特色美食", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/Hszvxnoz", },
            {
    
    "channel_name": "视频-游记攻略", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/Jx96mrZo", },
            {
    
    "channel_name": "视频-展演视频", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/m12rdtUw", },
        ],
        # 文化
        [
            {
    
    "channel_name": "文化-创新文化", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/X6ShiQDg", },
            {
    
    "channel_name": "文化-古都文化", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/UfLNxEFA", },
            {
    
    "channel_name": "文化-红色文化", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/GqJjEBWR", },
            {
    
    "channel_name": "文化-京味文化", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/MT7zkjsv", },
            {
    
    "channel_name": "文化-特色文化", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/hzuUSDSi", },
            {
    
    "channel_name": "文化-演出", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/VsGq9Qv2", },
            {
    
    "channel_name": "文化-影视", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/5qMQ0xSf", },
            {
    
    "channel_name": "文化-阅读", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/LYZP6P7M", },
            {
    
    "channel_name": "文化-展览", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/DXxAEzhZ", },
        ],
        # 游玩
        [
            {
    
    "channel_name": "游玩-北京故事", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/8gez50vw", },
            {
    
    "channel_name": "游玩-城区游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/4eek55dr", },
            {
    
    "channel_name": "游玩-京郊游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/enrbw8do", },
            {
    
    "channel_name": "游玩-特色主题游-古都文化游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/judrtkbd", },
            {
    
    "channel_name": "游玩-特色主题游-红色旅游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/l5tf071h", },
            {
    
    "channel_name": "游玩-特色主题游-科教游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/8ubc2gc4", },
            {
    
    "channel_name": "游玩-特色主题游-体育游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/jtquifc7", },
            {
    
    "channel_name": "游玩-特色主题游-文创艺术游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/4g6nqzpv", },
            {
    
    "channel_name": "游玩-特色主题游-休闲度假游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/6th3pk42", },
            {
    
    "channel_name": "游玩-特色主题游-长城游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/fcj713ds", },
            {
    
    "channel_name": "游玩-特色主题游-中医养生游", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/5c9yssil", },
            {
    
    "channel_name": "游玩-游玩资讯", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/bwzg2a22", },
        ],
        # 住宿
        [
            {
    
    "channel_name": "住宿-京郊度假村", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/ce682gi3", },
            {
    
    "channel_name": "住宿-酒店", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/ie1974su", },
            {
    
    "channel_name": "住宿-民宿", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/njfs2cb7", },
            {
    
    "channel_name": "住宿-农家院", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/70rbttge", },
            {
    
    "channel_name": "住宿-特色住宿", "url": "http://api-hq1712.visitbeijing.com.cn/article/list/1p2ze2jm", },
        ],
        # 环游号
        [
            {
    
    "channel_name": "环游号", "url": "http://mp.visitbeijing.com.cn/api/article/list/recommend", },
            {
    
    "channel_name": "环游号-游玩", "url": "http://mp.visitbeijing.com.cn/api/article/list/play", },
            {
    
    "channel_name": "环游号-美食", "url": "http://mp.visitbeijing.com.cn/api/article/list/food", },
            {
    
    "channel_name": "环游号-住宿", "url": "http://mp.visitbeijing.com.cn/api/article/list/house", },
            {
    
    "channel_name": "环游号-购物", "url": "http://mp.visitbeijing.com.cn/api/article/list/shopping", },
            {
    
    "channel_name": "环游号-娱展演", "url": "http://mp.visitbeijing.com.cn/api/article/list/ent", },
            {
    
    "channel_name": "环游号-行业", "url": "http://mp.visitbeijing.com.cn/api/article/list/industry", },
        ],
        # 旅游图片
        [
            {
    
    "channel_name": "旅游图片", "url": "http://s.visitbeijing.com.cn/html/pic-6-1.shtml", },
        ],
    ]
  • Style finishing

There are as many parseX as there are in the overall website data list and added to

        parse_list = [
            self.parse1,  # 购物
            self.parse1,  # 美食
            self.parse1,  # 视频
            self.parse1,  # 文化
            self.parse1,  # 游玩
            self.parse1,  # 住宿
            self.parse2,  # 环球号
            self.parse3,  # 旅游图片
        ]
  • Title & Link & Cover Page
    Item_thumbImg is not used because there is no picture in the overall website content list
# 样式1
        html = response.body.decode('utf-8')
        db = json.loads(html)

       item['title'] = i['title'].strip()  # 内容标题
       item['url'] = "http://www.visitbeijing.com.cn/a1/" + i['id']  # 拼接正文url

# 样式2
        html = response.body.decode('utf-8')
        db = json.loads(html)
		
		item['title'] = i['title'].strip()  # 内容标题
        item['url'] = "http://mp.visitbeijing.com.cn/a1/" + i['id']  # 拼接正文url
# 样式3
        Item_title = response.xpath('//div[@class="tao"]/a/@title').extract()  # 文章标题列表
        Item_url = response.xpath('//div[@class="tao"]/a/@href').extract()  # 文章链接列表
        Item_thumbImg = response.xpath('//div[@class="tao"]/a/img/@src').extract()  # 文章封面图片列表

Parse_detail.py file under Spider

1. Fetch the content of the detail page

Modify the CSS crawl style of the list data detail page, and summarize 2 styles.
Insert picture description here

    # 处理详情页带格式,这里整个页面进行抓取
    	item['content'] = ""
        if 'class="mod-content"' in response.text and len(None2Str(item['content'])) < 5:
            item['content'] = response.xpath('//div[@class="mod-content"]').extract_first()
        if 'id="Article"' in response.text and len(None2Str(item['content'])) < 5:
            item['content'] = response.xpath('//div[@id="Article"]').extract_first()

2. Special instructions

The programmers of some websites are frantic to a certain extent, 10 pages and 9 styles. Since it is impossible for us to open every page and look at the CSS format of the detail page, there is a general solution.

  • After the content is captured for the first time, open the MongoDB database and execute the following command to filter out the page data containing the body. These are the data that is not captured according to the specified style, but the data of all the pages that are directly captured.
db.你的表名.find({content:/body/})

Insert picture description here

  • Open any link loop to process the content of the details page until the mongo command does not filter out the content.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/114079922