Scarpy crawls static web page information

Scarpy crawls static web page information

1. Case description

Use Scarpy to crawl the headlines and URLs of the news of the Hubei Institute of Economics, http://news.hbue.edu.cn/jyyw/list.htm.

Note: The source code of the dynamic webpage may be different from the source code after being processed by the browser.

Two, Scarpy operation

(1) In the computer start menu, search for cmd and open

(2) In cmd, switch to the location where the scrapy file is written. For example, I want to write to the python folder of my e-disk. So enter e first: enter the E disk, then enter cd python (cd space + file name), enter the python file

(3) Create a scrapy project: scrapy startproject file name. If the file name is myscrapy, this operation will cause a subdirectory of myscrapy to appear in the python folder. There are a bunch of subdirectories and files in the directory, including spider files.

(4) Continue to enter the directory of the new file in cmd, and generate the spider program: scrapy genspider scrapy file name scrapy project name.io. You can also manually produce crawlers

(5) Rewrite the crawler file just generated

(6) Operation. Run scrapy crawl directly on the command line to crawl the crawler file name. If you run in pycharm, you need to create a new file.

Three, the code

import scrapy
from bs4 import BeautifulSoup
from newscrapy.items import NewscrapyItem
#从item中引用此类,可以存储为任意格式文件
class SecondSpider(scrapy.Spider):
    name = 'Second'
    start_urls = ['http://news.hbue.edu.cn/jyyw/list.htm']
    def parse(self, response):
        item = NewscrapyItem()
        #必须
        newslist = response.xpath('//*[@id="wp_news_w7"]/ul/li').extract()
        urllist = []
        titlelist = []
        #用于存储信息
        for news in newslist:
            bs = BeautifulSoup(news, 'lxml')
            a = bs.find('a')
            theurl = a.attrs['href']
            if 'http://news.hbue.edu.cn/' not in theurl:
                url = 'http://news.hbue.edu.cn/' + theurl
            else:
                url = theurl
            urllist.append(url)
            title = a.attrs['title']
            titlelist.append(title)
        item['url'] = urllist
        item['title'] = titlelist
        return item

Fourth, store as a csv file

If you edit the SecondSpider file in pycharm, you should click the newscrapy directory, and click the Mark Director As → Sources root command in the right-click pop-up menu to change the directory to the Python source code directory (so that python will search in this directory to import Python package file). After clicking this command, the newscrapy directory icon will turn blue so that the program can modify the item.py file.

class NewscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()

Then create a new py file anywhere in the newscrapy subdirectory (but it is best not to create it in spier).

For example, create a script file named executeSecond.py in the directory where the items.py file is located

1 from scrapy import cmdline
2 cmdline.execute('scrapy crawl Second -o news.csv'.split())

Five, effect

title,url
"学生座谈会召开,【抗疫故事】抗疫事迹——校医院,【人民日报】铭记历史 砥砺奋进——写在中国人民抗日战争暨世界反法西斯战争胜利75周年之际,【湖北日报】图书馆门口排长队 这所高校恢复372门次课堂,疫后重启课堂  师生精神饱满,【抗疫故事】抗疫事迹——亿优物业,党风廉政建设宣传教育月活动启动,学校疫情防控工作指挥部研究部署秋季开学后疫情防控工作,我校学子获中国大学生计算机设计大赛一等奖,经院社区党员干部下沉工作动员大会举行,我校迎来2020年秋季学期返校学生,2020年暑期辅导员培训会举办,秋季开学中层干部会议召开,学校召开校党委中心组扩大学习暨《谈治国理政》第三卷学习宣讲会","http://news.hbue.edu.cn//51/89/c7592a217481/page.htm,http://news.hbue.edu.cn/50/b1/c8154a217265/page.htm,http://news.hbue.edu.cn//51/32/c7592a217394/page.htm,http://news.hbue.edu.cn//51/1c/c7592a217372/page.htm,http://news.hbue.edu.cn//50/bb/c7592a217275/page.htm,http://news.hbue.edu.cn/50/a0/c8154a217248/page.htm,http://news.hbue.edu.cn//50/97/c7592a217239/page.htm,http://news.hbue.edu.cn//50/95/c7592a217237/page.htm,http://news.hbue.edu.cn//50/4b/c7592a217163/page.htm,http://news.hbue.edu.cn//50/48/c7592a217160/page.htm,http://news.hbue.edu.cn//50/26/c7592a217126/page.htm,http://news.hbue.edu.cn//50/03/c7592a217091/page.htm,http://news.hbue.edu.cn//4f/ee/c7592a217070/page.htm,http://news.hbue.edu.cn//4f/e9/c7592a217065/page.htm

Guess you like

Origin blog.csdn.net/sgsdsdd/article/details/109325080
Recommended