[Scrapy Five Minutes Website] [Energy Industry News] Scrapy actual combat China coal market network site data capture

Target website introduction

China Coal Market Network is a comprehensive coal information platform that integrates coal news, coal market analysis, coal prices, and coal data, covering coal production, coal sales, coal consumption, coal ports, coal stocks, coal import and export, coal transportation, etc.
Insert picture description here

Start Scrapy

Preparation for data collection

1. For those who don’t understand the idea of ​​quickly crawling the website in 5 minutes, let’s first look at
[Scrapy 5 minutes on the website] the basic knowledge of the entire site data

2. I don’t understand data capture, business management, and sorting. First look at
[Scrapy Five Minutes Website] crawler target sorting and data preparation

3. For those who don’t know the mass production of Scrapy template, see it first (must see)
[Scrapy Five-Minutes Website] Data Capture Project Framework General Template

Data collation results

1. Save screenshots in Excel
Insert picture description here

Template application

<Project>.py file under Spider

1. Create a spider project

scrapy genspider www_cctd_com_cn " "

2. Organize
the CSS styles of the entire site. Let's first look at the CSS styles of the page. The entire site is unified and cool.
Insert picture description here

3. Modify the content of www_cctd_com_cn.py

Here, the areas that need to be modified are explained, and other places refer to the template, and no modification is required.

  • Scope & custom description
    allowed_domains = []
    web_name = "中国煤炭市场"
  • Add crawl data information
    start_menu = [
        # 新闻资讯汇总
        [
            {
    
    "channel_name": "煤炭资讯-新闻资讯", "url": "https://www.cctd.com.cn/list-10-1.html", },
            {
    
    "channel_name": "煤炭资讯-资讯中心", "url": "https://www.cctd.com.cn/list-9-1.html", },
            {
    
    "channel_name": "煤炭资讯-CCTD原创", "url": "https://www.cctd.com.cn/list-42-1.html", },
            {
    
    "channel_name": "煤炭资讯-宏观经济", "url": "https://www.cctd.com.cn/list-15-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭行业", "url": "https://www.cctd.com.cn/list-17-1.html", },
            {
    
    "channel_name": "煤炭资讯-钢铁行业", "url": "https://www.cctd.com.cn/list-18-1.html", },
            {
    
    "channel_name": "煤炭资讯-焦炭行业", "url": "https://www.cctd.com.cn/list-139-1.html", },
            {
    
    "channel_name": "煤炭资讯-电力行业", "url": "https://www.cctd.com.cn/list-19-1.html", },
            {
    
    "channel_name": "煤炭资讯-建材行业", "url": "https://www.cctd.com.cn/list-20-1.html", },
            {
    
    "channel_name": "煤炭资讯-交通行业", "url": "https://www.cctd.com.cn/list-23-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤化工行业", "url": "https://www.cctd.com.cn/list-21-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭综合", "url": "https://www.cctd.com.cn/list-176-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭运行", "url": "https://www.cctd.com.cn/list-361-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭进出口", "url": "https://www.cctd.com.cn/list-114-1.html", },
            {
    
    "channel_name": "煤炭资讯-国际煤炭", "url": "https://www.cctd.com.cn/list-113-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭政策", "url": "https://www.cctd.com.cn/list-11-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭企业", "url": "https://www.cctd.com.cn/list-22-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭安全", "url": "https://www.cctd.com.cn/list-108-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭资源", "url": "https://www.cctd.com.cn/list-109-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭科技", "url": "https://www.cctd.com.cn/list-112-1.html", },
            {
    
    "channel_name": "煤炭资讯-节能环保", "url": "https://www.cctd.com.cn/list-115-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭市场分析与评论", "url": "https://www.cctd.com.cn/list-14-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭市场周报", "url": "https://www.cctd.com.cn/list-87-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭市场月报", "url": "https://www.cctd.com.cn/list-88-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭价格分析", "url": "https://www.cctd.com.cn/list-91-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭生产情况", "url": "https://www.cctd.com.cn/list-98-1.html", },
            {
    
    "channel_name": "煤炭资讯-港口煤炭市场", "url": "https://www.cctd.com.cn/list-93-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭销售情况", "url": "https://www.cctd.com.cn/list-138-1.html", },
            {
    
    "channel_name": "煤炭资讯-钢焦煤市场", "url": "https://www.cctd.com.cn/list-92-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭海运市场", "url": "https://www.cctd.com.cn/list-125-1.html", },
            {
    
    "channel_name": "煤炭资讯-煤炭直供电厂", "url": "https://www.cctd.com.cn/list-122-1.html", },
            {
    
    "channel_name": "煤炭资讯-国际煤市点评", "url": "https://www.cctd.com.cn/list-95-1.html", },
            {
    
    "channel_name": "市场分析-煤炭市场分析&评论 ", "url": "https://www.cctd.com.cn/list-13-1.html", },
            {
    
    "channel_name": "市场分析-煤炭市场快报", "url": "https://www.cctd.com.cn/list-128-1.html", },
            {
    
    "channel_name": "市场分析-煤炭市场分析&评论", "url": "https://www.cctd.com.cn/list-13-1.html", },
            {
    
    "channel_name": "煤炭分析-价格", "url": "https://www.cctd.com.cn/list-91-1.html", },
            {
    
    "channel_name": "煤炭分析-钢焦煤", "url": "https://www.cctd.com.cn/list-92-1.html", },
            {
    
    "channel_name": "煤炭分析-港口", "url": "https://www.cctd.com.cn/list-93-1.html", },
            {
    
    "channel_name": "煤炭分析-无烟煤", "url": "https://www.cctd.com.cn/list-94-1.html", },
            {
    
    "channel_name": "煤炭分析-点评", "url": "https://www.cctd.com.cn/list-95-1.html", },
            {
    
    "channel_name": "煤炭分析-扫描", "url": "https://www.cctd.com.cn/list-96-1.html", },
            {
    
    "channel_name": "煤炭分析-上旬", "url": "https://www.cctd.com.cn/list-89-1.html", },
            {
    
    "channel_name": "煤炭分析-中旬", "url": "https://www.cctd.com.cn/list-90-1.html", },
            {
    
    "channel_name": "煤炭分析-生产", "url": "https://www.cctd.com.cn/list-98-1.html", },
            {
    
    "channel_name": "煤炭分析-运输", "url": "https://www.cctd.com.cn/list-99-1.html", },
            {
    
    "channel_name": "煤炭分析-电力", "url": "https://www.cctd.com.cn/list-100-1.html", },
            {
    
    "channel_name": "煤炭分析-钢铁", "url": "https://www.cctd.com.cn/list-101-1.html", },
            {
    
    "channel_name": "煤炭分析-焦炭", "url": "https://www.cctd.com.cn/list-102-1.html", },
            {
    
    "channel_name": "煤炭分析-建材", "url": "https://www.cctd.com.cn/list-103-1.html", },
            {
    
    "channel_name": "煤炭分析-化工", "url": "https://www.cctd.com.cn/list-104-1.html", },
            {
    
    "channel_name": "煤炭分析-煤炭动态", "url": "https://www.cctd.com.cn/list-116-1.html", },
            {
    
    "channel_name": "煤炭分析-动力煤", "url": "https://www.cctd.com.cn/list-133-1.html", },
            {
    
    "channel_name": "煤炭分析-综述", "url": "https://www.cctd.com.cn/list-118-1.html", },
            {
    
    "channel_name": "煤炭分析-国际", "url": "https://www.cctd.com.cn/list-135-1.html", },
            {
    
    "channel_name": "煤炭分析-炼焦煤", "url": "https://www.cctd.com.cn/list-134-1.html", },
            {
    
    "channel_name": "煤炭分析-库存", "url": "https://www.cctd.com.cn/list-120-1.html", },
            {
    
    "channel_name": "煤炭分析-价格", "url": "https://www.cctd.com.cn/list-121-1.html", },
            {
    
    "channel_name": "煤炭分析-进出口", "url": "https://www.cctd.com.cn/list-123-1.html", },
            {
    
    "channel_name": "煤炭分析-直供电厂", "url": "https://www.cctd.com.cn/list-122-1.html", },
            {
    
    "channel_name": "煤炭分析-进出口", "url": "https://www.cctd.com.cn/list-123-1.html", },
            {
    
    "channel_name": "煤炭分析-港口", "url": "https://www.cctd.com.cn/list-124-1.html", },
            {
    
    "channel_name": "煤炭分析-运价", "url": "https://www.cctd.com.cn/list-125-1.html", },
            {
    
    "channel_name": "煤炭分析-冶金", "url": "https://www.cctd.com.cn/list-126-1.html", },
            {
    
    "channel_name": "煤炭分析-生产", "url": "https://www.cctd.com.cn/list-136-1.html", },
            {
    
    "channel_name": "煤炭分析-运输", "url": "https://www.cctd.com.cn/list-137-1.html", },
            {
    
    "channel_name": "煤炭分析-销售", "url": "https://www.cctd.com.cn/list-138-1.html", },
            {
    
    "channel_name": "煤炭分析-市场观察员", "url": "https://www.cctd.com.cn/list-44-1.html", },
            {
    
    "channel_name": "煤炭分析-指数报告", "url": "https://www.cctd.com.cn/list-45-1.html", },
            {
    
    "channel_name": "煤炭分析-全国", "url": "https://www.cctd.com.cn/list-91-1.html", },
            {
    
    "channel_name": "煤炭分析-CCTD秦皇岛", "url": "https://www.cctd.com.cn/list-463-1.html", },
            {
    
    "channel_name": "煤炭分析-内蒙古", "url": "https://www.cctd.com.cn/list-47-1.html", },
            {
    
    "channel_name": "煤炭分析-山西", "url": "https://www.cctd.com.cn/list-49-1.html", },
            {
    
    "channel_name": "煤炭分析-陕西", "url": "https://www.cctd.com.cn/list-48-1.html", },
            {
    
    "channel_name": "煤炭分析-湖北", "url": "https://www.cctd.com.cn/list-642-1.html", },
            {
    
    "channel_name": "煤炭分析-重庆", "url": "https://www.cctd.com.cn/list-556-1.html", },
            {
    
    "channel_name": "煤炭分析-榆林", "url": "https://www.cctd.com.cn/list-543-1.html", },
            {
    
    "channel_name": "煤炭分析-长江口", "url": "https://www.cctd.com.cn/list-423-1.html", },
            {
    
    "channel_name": "动力煤期货-新闻资讯 ", "url": "https://www.cctd.com.cn/list-609-1.html", },
            {
    
    "channel_name": "动力煤期货-投研观点 ", "url": "https://www.cctd.com.cn/list-610-1.html", },
            {
    
    "channel_name": "动力煤期货-交割情况 ", "url": "https://www.cctd.com.cn/list-611-1.html", },
            {
    
    "channel_name": "动力煤期货-动力煤期货高级分析师 ", "url": "https://www.cctd.com.cn/list-625-1.html", },
            {
    
    "channel_name": "动力煤期货-活动动态", "url": "https://www.cctd.com.cn/list-607-1.html", },
            {
    
    "channel_name": "动力煤期货-公告与通知 ", "url": "https://www.cctd.com.cn/list-606-1.html", },
            {
    
    "channel_name": "资讯中心-新闻资讯-热点数据 ", "url": "https://www.cctd.com.cn/list-576-1.html", },
        ]
    ]
  • Style finishing

There are as many parseX as there are in the overall website data list and added to

        parse_list = [
            self.parse1,
        ]
  • Title & Link & Cover Page
    Item_thumbImg is not used because there is no picture in the overall website content list
		Item_title = response.xpath('//td[@style="padding-left: 10px"]/li/a/text()').extract()  # 文章标题列表
        Item_url = response.xpath('//td[@style="padding-left: 10px"]/li/a/@href').extract()  # 文章链接列表
        # Item_thumbImg = response.xpath('//标签[@class="属性]/li/a/img/@src').extract()  # 文章封面图片列表

Parse_detail.py file under Spider

1. Fetch the content of the detail page

Modify the CSS crawling style of the list data detail page
Insert picture description here

    # 处理详情页带格式,这里整个页面进行抓取
    item['content'] = ""
    if 'class="news_show"' in response.text and len(None2Str(item['content'])) < 5:
        item['content'] = response.xpath('//div[@class="news_show"]').extract_first()
    if 'id="Zoom"' in response.text and len(None2Str(item['content'])) < 5:
        item['content'] = response.xpath('//div[@id="Zoom"]').extract_first()

2. Special instructions

The programmers of some websites are frantic to a certain extent, 10 pages and 9 styles. Since it is impossible for us to open every page and look at the CSS format of the detail page, there is a general solution.

  • After the content is captured for the first time, open the MongoDB database and execute the following command to filter out the page data containing the body. These are the data that is not captured according to the specified style, but the data of all the pages that are directly captured.
db.你的表名.find({content:/body/})

Insert picture description here

  • Open any link loop to process the content of the details page until the mongo command does not filter out the content.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113673065