[Scrapy Five Minutes Website] [Energy Industry News] Scrapy actual combat China Coal News Network site-wide data capture

Target website introduction

China Coal News Network is coal, coal, coal mines, coal prices, coal market, coal transportation, China Coal News, coal news, coal net, second-hand equipment, coal supply and demand, coal mine mechanical and electrical equipment, coal mine news, coal mine talents, coal industry technology , Alumni record, coal price report, technology...
Insert picture description here

Start Scrapy

Preparation for data collection

1. For those who don’t understand the idea of ​​quickly crawling the website in 5 minutes, let’s first look at
[Scrapy 5 minutes on the website] the basic knowledge of the entire site data

2. I don’t understand data capture, business management, and sorting. First look at
[Scrapy Five Minutes Website] crawler target sorting and data preparation

3. For those who don’t know the mass production of Scrapy template, see it first (must see)
[Scrapy Five-Minutes Website] Data Capture Project Framework General Template

Data collation results

1. Save screenshots in Excel
Insert picture description here

Template application

<Project>.py file under Spider

1. Create a spider project

scrapy genspider www_cwestc_com " "

2. Organize
the CSS styles of the whole site Let's first look at the CSS styles of the page. The two styles are unified across the site.
Insert picture description here
Insert picture description here

3. Modify the content of www_cwestc_com.py

Here, the areas that need to be modified are explained, and other places refer to the template, and no modification is required.

  • Scope & custom description
    allowed_domains = []
    web_name = "中国煤炭新闻网"
  • Add crawl data information
    start_menu = [
        # 主站
        [
            {
    
    "channel_name": "煤炭新闻", "url": "http://www.cwestc.com/MroeNews.aspx", },
            {
    
    "channel_name": "政策法规-2008年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=17&id=4", },
            {
    
    "channel_name": "政策法规-2007年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=18&id=4", },
            {
    
    "channel_name": "政策法规-2006年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=19&id=4", },
            {
    
    "channel_name": "政策法规-2005年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=20&id=4", },
            {
    
    "channel_name": "政策法规-2004年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=21&id=4", },
            {
    
    "channel_name": "政策法规-2003年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=22&id=4", },
            {
    
    "channel_name": "政策法规-2002年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=23&id=4", },
            {
    
    "channel_name": "政策法规-2001年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=24&id=4", },
            {
    
    "channel_name": "政策法规-2000年政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=25&id=4", },
            {
    
    "channel_name": "政策法规-98-99年政策法规",
             "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=26&id=4", },
            {
    
    "channel_name": "政策法规-97年前的政策法规", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=27&id=4", },
            {
    
    "channel_name": "新闻写作", "url": "http://www.cwestc.com/MroeNews.aspx?gd=35", },
            {
    
    "channel_name": "技术论文-开采方法", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=1&id=1", },
            {
    
    "channel_name": "技术论文-通风和安全", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=2&id=1", },
            {
    
    "channel_name": "技术论文-2006年煤业技术", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=6&id=1", },
            {
    
    "channel_name": "技术论文-开拓与掘进", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=7&id=1", },
            {
    
    "channel_name": "技术论文-地测与资环", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=8&id=1", },
            {
    
    "channel_name": "技术论文-矿山机械", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=9&id=1", },
            {
    
    "channel_name": "技术论文-洗选与综合利用", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=10&id=1", },
            {
    
    "channel_name": "技术论文-矿山电工", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=11&id=1", },
            {
    
    "channel_name": "技术论文-经济管理", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=14&id=1", },
            {
    
    "channel_name": "技术论文-信息与新技术", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=15&id=1", },
            {
    
    "channel_name": "技术论文-其他", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=16&id=1", },
            {
    
    "channel_name": "矿山安全-安全救护信息", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=28&id=3", },
            {
    
    "channel_name": "矿山安全-事故处理分析", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=29&id=3", },
            {
    
    "channel_name": "矿山安全-煤矿安全标准", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=30&id=3", },
            {
    
    "channel_name": "事故案例-顶板事故", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=31&id=2", },
            {
    
    "channel_name": "事故案例-瓦斯事故", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=32&id=2", },
            {
    
    "channel_name": "事故案例-运输事故", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=33&id=2", },
            {
    
    "channel_name": "事故案例-机电事故", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=34&id=2", },
            {
    
    "channel_name": "事故案例-放炮事故", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=35&id=2", },
            {
    
    "channel_name": "事故案例-水害事故", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=36&id=2", },
            {
    
    "channel_name": "事故案例-其他事故", "url": "http://www.cwestc.com/ShowAllContentMian.aspx?sendid=37&id=2", },
            {
    
    "channel_name": "煤市分析", "url": "http://www.cwestc.com/MroeNews.aspx?gd=33", },
            {
    
    "channel_name": "煤价行情", "url": "http://www.cwestc.com/MroeNews.aspx?gd=44", },
        ],
        # 华北频道
        [
            {
    
    "channel_name": "华北频道-每日头条", "url": "http://huabei.cwestc.com/news/3.html", },
            {
    
    "channel_name": "华北频道-企业风采", "url": "http://huabei.cwestc.com/news/5.html", },
            {
    
    "channel_name": "华北频道-行业动态", "url": "http://huabei.cwestc.com/news/6.html", },
            {
    
    "channel_name": "华北频道-局矿快报 ", "url": "http://huabei.cwestc.com/news/4.html", },
            {
    
    "channel_name": "华北频道-矿山文学", "url": "http://huabei.cwestc.com/news/9.html", },
            {
    
    "channel_name": "华北频道-企业镜像", "url": "http://huabei.cwestc.com/news/8.html", },
            {
    
    "channel_name": "华北频道-人物专访", "url": "http://huabei.cwestc.com/news/11.html", },
            {
    
    "channel_name": "华北频道-行业先锋", "url": "http://huabei.cwestc.com/news/10.html", },
            {
    
    "channel_name": "华北频道-党群工作", "url": "http://huabei.cwestc.com/news/7.html", },
        ],
        # 西北频道
        [
            {
    
    "channel_name": "西北频道-每日头条", "url": "http://sx.cwestc.com/news/3.html", },
            {
    
    "channel_name": "西北频道-企业风采", "url": "http://sx.cwestc.com/news/5.html", },
            {
    
    "channel_name": "西北频道-行业动态", "url": "http://sx.cwestc.com/news/6.html", },
            {
    
    "channel_name": "西北频道-局矿快报 ", "url": "http://sx.cwestc.com/news/4.html", },
            {
    
    "channel_name": "西北频道-矿山文学", "url": "http://sx.cwestc.com/news/9.html", },
            {
    
    "channel_name": "西北频道-企业镜像", "url": "http://sx.cwestc.com/news/8.html", },
            {
    
    "channel_name": "西北频道-人物专访", "url": "http://sx.cwestc.com/news/11.html", },
            {
    
    "channel_name": "西北频道-行业先锋", "url": "http://sx.cwestc.com/news/10.html", },
            {
    
    "channel_name": "西北频道-党群工作", "url": "http://sx.cwestc.com/news/7.html", },
        ],
        # 华中频道
        [
            {
    
    "channel_name": "华中频道-每日头条", "url": "http://huazhong.cwestc.com/news/3.html", },
            {
    
    "channel_name": "华中频道-企业风采", "url": "http://huazhong.cwestc.com/news/5.html", },
            {
    
    "channel_name": "华中频道-行业动态", "url": "http://huazhong.cwestc.com/news/6.html", },
            {
    
    "channel_name": "华中频道-局矿快报 ", "url": "http://huazhong.cwestc.com/news/4.html", },
            {
    
    "channel_name": "华中频道-矿山文学", "url": "http://huazhong.cwestc.com/news/9.html", },
            {
    
    "channel_name": "华中频道-企业镜像", "url": "http://huazhong.cwestc.com/news/8.html", },
            {
    
    "channel_name": "华中频道-人物专访", "url": "http://huazhong.cwestc.com/news/11.html", },
            {
    
    "channel_name": "华中频道-行业先锋", "url": "http://huazhong.cwestc.com/news/10.html", },
            {
    
    "channel_name": "华中频道-党群工作", "url": "http://huazhong.cwestc.com/news/7.html", },
        ],
        # 东北频道
        [
            {
    
    "channel_name": "东北频道-每日头条", "url": "http://dongbei.cwestc.com/news/3.html", },
            {
    
    "channel_name": "东北频道-企业风采", "url": "http://dongbei.cwestc.com/news/5.html", },
            {
    
    "channel_name": "东北频道-行业动态", "url": "http://dongbei.cwestc.com/news/6.html", },
            {
    
    "channel_name": "东北频道-局矿快报 ", "url": "http://dongbei.cwestc.com/news/4.html", },
            {
    
    "channel_name": "东北频道-矿山文学", "url": "http://dongbei.cwestc.com/news/9.html", },
            {
    
    "channel_name": "东北频道-企业镜像", "url": "http://dongbei.cwestc.com/news/8.html", },
            {
    
    "channel_name": "东北频道-人物专访", "url": "http://dongbei.cwestc.com/news/11.html", },
            {
    
    "channel_name": "东北频道-行业先锋", "url": "http://dongbei.cwestc.com/news/10.html", },
            {
    
    "channel_name": "东北频道-党群工作", "url": "http://dongbei.cwestc.com/news/7.html", },
        ],
    ]
  • Style finishing

There are as many parseX as there are in the overall website data list and added to

        parse_list = [
            self.parse1,  # 主站
            self.parse2,  # 华北频道
            self.parse2,  # 西北频道
            self.parse2,  # 华中频道
            self.parse2,  # 东北频道
        ]
  • Title & Link & Cover Page
    Item_thumbImg is not used because there is no picture in the overall website content list
        # 主站样式 列表内容抓取
        Item_title = response.xpath('//td[@align="left"]/b/strong/a/text()').extract()  # 文章标题列表
        Item_url = response.xpath('//td[@align="left"]/b/strong/a/@href').extract()  # 文章链接列表
        # Item_thumbImg = response.xpath('//ul[@class="list-soft"]/li/a/img/@src').extract()  # 文章封面图片列表
		
		# 其他分站样式 列表内容抓取
		Item_title = response.xpath('//ul[@class="n-list"]/li/h2/a/text()').extract()  # 文章标题列表
        Item_url = response.xpath('//ul[@class="n-list"]/li/h2/a/@href').extract()  # 文章链接列表
        # Item_thumbImg = response.xpath('//ul[@class="list-soft"]/li/a/img/@src').extract()  # 文章封面图片列表

Parse_detail.py file under Spider

1. Fetch the content of the detail page

Modify the CSS crawling style of the list data detail page
Insert picture description here
Insert picture description here

    # 处理详情页带格式,这里整个页面进行抓取
    item['content'] = ""
    if 'class="newsContent"' in response.text and len(None2Str(item['content'])) < 5:
        item['content'] = response.xpath('//td[@class="newsContent"]').extract_first()
    if 'class="content"' in response.text and len(None2Str(item['content'])) < 5:
        item['content'] = response.xpath('//div[@class="content"]').extract_first()
    if 'class="entry"' in response.text and len(None2Str(item['content'])) < 5:
        item['content'] = response.xpath('//div[@class="entry"]').extract_first()

2. Special instructions

The programmers of some websites are frantic to a certain extent, 10 pages and 9 styles. Since it is impossible for us to open every page and look at the CSS format of the detail page, there is a general solution.

  • After the content is captured for the first time, open the MongoDB database and execute the following command to filter out the page data containing the body. These are the data that is not captured according to the specified style, but the data of all the pages that are directly captured.
db.你的表名.find({content:/body/})

Insert picture description here

  • Open any link loop to process the content of the details page until the mongo command does not filter out the content.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113700124