Crawl 1907 "course learning" data to analyze which types of learning resources are most favored by college students

1 Introduction

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it by yourself

Python free learning materials, codes and exchange answers click to join


Previous article [Take "Station B" as a practical case! Teach you to master the necessary framework for crawlers "Scrapy"] uses scrapy to crawl station B data. This article will improve the code on this basis, crawl more content and save it to csv.

A total of 1907 pieces of "course learning" data were crawled to analyze which types of learning resources are the most popular and favored by college students. And display the results visually!

2. Data acquisition

The procedure is then [Take "Station B" as a practical case! Teach you to master the necessary framework for crawlers "Scrapy"] to improve

1. Each scrapy file

items file


class BiliItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #pass
    # 视频标题
    title = scrapy.Field()
    # 链接
    url = scrapy.Field()
    # 观看量
    watchnum = scrapy.Field()
    # 弹幕数
    dm = scrapy.Field()
    # 上传时间
    uptime = scrapy.Field()
    # 作者
    upname = scrapy.Field()

Four fields have been added (views, barrage, upload time, author)

lyc file


class LycSpider(scrapy.Spider):
    name = 'lyc'
    allowed_domains = ['bilibili.com']
    start_urls = ['https://search.bilibili.com/all?keyword=大学课程&page=40']

    # 爬取的方法
    def parse(self, response):
        item = BiliItem()
        # 匹配
        for jobs_primary in response.xpath('//*[@id="all-list"]/div[1]/ul/li'):
            item['title'] = (jobs_primary.xpath('./a/@title').extract())[0]
            item['url'] = (jobs_primary.xpath('./a/@href').extract())[0]
            item['watchnum'] = (jobs_primary.xpath('./div/div[3]/span[1]/text()').extract())[0].replace("\n", "").replace(" ", "")
            item['dm'] = (jobs_primary.xpath('./div/div[3]/span[2]/text()').extract())[0].replace("\n", "").replace(" ", "")
            item['uptime'] = (jobs_primary.xpath('./div/div[3]/span[3]/text()').extract())[0].replace("\n", "").replace(" ", "")
            item['upname'] = (jobs_primary.xpath('./div/div[3]/span[4]/a/text()').extract())[0]


            # 不能使用return
            yield item

        # 获取当前页的链接
        url = response.request.url
        #page +1

        new_link = url[0:-1]+str(int(url[-1])+1)
        # 再次发送请求获取下一页数据
        yield scrapy.Request(new_link, callback=self.parse)

Perform web page tag analysis for the four new fields

pipelines file


import csv

class BiliPipeline:

    def __init__(self):
        #打开文件,指定方式为写,利用第3个参数把csv写数据时产生的空行消除
        self.f = open("lyc大学课程.csv", "a", newline="")
        # 设置文件第一行的字段名,注意要跟spider传过来的字典key名称相同
        self.fieldnames = ["title", "url","watchnum","dm","uptime","upname"]
        # 指定文件的写入方式为csv字典写入,参数1为指定具体文件,参数2为指定字段名
        self.writer = csv.DictWriter(self.f, fieldnames=self.fieldnames)
        # 写入第一行字段名,因为只要写入一次,所以文件放在__init__里面
        self.writer.writeheader()

    def process_item(self, item, spider):
        # print("title:", item['title'][0])
        # print("url:", item['url'][0])
        # print("watchnum:", item['watchnum'][0].replace("\n","").replace(" ",""))
        # print("dm:", item['dm'][0].replace("\n", "").replace(" ", ""))
        # print("uptime:", item['uptime'][0].replace("\n", "").replace(" ", ""))
        # print("upname:", item['upname'][0])

        print("title:", item['title'])
        print("url:", item['url'])
        print("watchnum:", item['watchnum'])
        print("dm:", item['dm'])
        print("uptime:", item['uptime'])
        print("upname:", item['upname'])


        # 写入spider传过来的具体数值
        self.writer.writerow(item)
        # 写入完返回
        return item

    def close(self, spider):
        self.f.close()

Save the crawled content to a csv file (lyc university course.csv)

2. Start scrapy


scrapy crawl lyc

The scrapy project can be started by the above command

 

3. Crawl the results

 

A total of 1914 pieces of data were crawled, and 1907 pieces of data were finally available after simple cleaning!

3. Data analysis

1. The ranking of college students' learning videos

Read data


dataset  = pd.read_csv('Bili\\lyc大学课程.csv',encoding="gbk")
title = dataset['title'].tolist()
url = dataset['url'].tolist()
watchnum = dataset['watchnum'].tolist()
dm = dataset['dm'].tolist()
uptime = dataset['uptime'].tolist()
upname = dataset['upname'].tolist()

data processing


#分析1:  & 分析2
def getdata1_2():
    watchnum_dict = {}
    dm_dict = {}
    for i in range(0, len(watchnum)):
        if "万" in watchnum[i]:
            watchnum[i] = int(float(watchnum[i].replace("万", "")) * 10000)
        else:
            watchnum[i] = int(watchnum[i])

        if "万" in dm[i]:
            dm[i] = int(float(dm[i].replace("万", "")) * 10000)
        else:
            dm[i] = int(dm[i])

        watchnum_dict[title[i]] = watchnum[i]
        dm_dict[title[i]] = dm[i]

    ###从小到大排序
    watchnum_dict = sorted(watchnum_dict.items(), key=lambda kv: (kv[1], kv[0]))
    dm_dict = sorted(dm_dict.items(), key=lambda kv: (kv[1], kv[0]))
    #分析1:大学生学习视频播放量排名"
    analysis1(watchnum_dict,"大学生学习视频播放量排名")

data visualization


def pie(name,value,picname,tips):
    c = (
        Pie()
            .add(
            "",
            [list(z) for z in zip(name, value)],
            # 饼图的中心(圆心)坐标,数组的第一项是横坐标,第二项是纵坐标
            # 默认设置成百分比,设置成百分比时第一项是相对于容器宽度,第二项是相对于容器高度
            center=["35%", "50%"],
        )
            .set_colors(["blue", "green", "yellow", "red", "pink", "orange", "purple"])  # 设置颜色
            .set_global_opts(
            title_opts=opts.TitleOpts(title=""+str(tips)),
            legend_opts=opts.LegendOpts(type_="scroll", pos_left="70%", orient="vertical"),  # 调整图例位置
        )
            .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
            .render(str(picname)+".html")
    )

analysis

1. [Film] "Classroom on Earth" has the highest broadcast volume, with 2.02 million views.
2. Learning from the content of university courses at station B is far less attractive than some interesting topics in the classroom content.

2. Ranking of college students' learning video barrage

data processing


watchnum_dict = {}
dm_dict = {}
for i in range(0, len(watchnum)):
    if "万" in watchnum[i]:
        watchnum[i] = int(float(watchnum[i].replace("万", "")) * 10000)
    else:
        watchnum[i] = int(watchnum[i])

    if "万" in dm[i]:
        dm[i] = int(float(dm[i].replace("万", "")) * 10000)
    else:
        dm[i] = int(dm[i])

    watchnum_dict[title[i]] = watchnum[i]
    dm_dict[title[i]] = dm[i]

###从小到大排序
watchnum_dict = sorted(watchnum_dict.items(), key=lambda kv: (kv[1], kv[0]))
    dm_dict = sorted(dm_dict.items(), key=lambda kv: (kv[1], kv[0]))
#分析2:大学生学习视频弹幕量排名
analysis1(dm_dict,"大学生学习视频弹幕量排名")

data visualization


###饼状图
def pie(name,value,picname,tips):
    c = (
        Pie()
            .add(
            "",
            [list(z) for z in zip(name, value)],
            # 饼图的中心(圆心)坐标,数组的第一项是横坐标,第二项是纵坐标
            # 默认设置成百分比,设置成百分比时第一项是相对于容器宽度,第二项是相对于容器高度
            center=["35%", "50%"],
        )
            .set_colors(["blue", "green", "yellow", "red", "pink", "orange", "purple"])  # 设置颜色
            .set_global_opts(
            title_opts=opts.TitleOpts(title=""+str(tips)),
            legend_opts=opts.LegendOpts(type_="scroll", pos_left="70%", orient="vertical"),  # 调整图例位置
        )
            .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
            .render(str(picname)+".html")
    )

analysis

1. In the ranking of the number of barrages, "Data Structure and Algorithm Foundation" is the highest, the number of barrages: 33000
2. Looking at the ranking of barrages, you can see what kind of classroom videos everyone likes to leave a message on.
3. Compared with the broadcast volume, college students like to speak on the classroom content learning video!

3.up the number of videos of the main college students learning videos

data processing


#分析3: up主大学生学习视频视频数
def getdata3():
    upname_dict = {}
    for key in upname:
        upname_dict[key] = upname_dict.get(key, 0) + 1
        ###从小到大排序
    upname_dict = sorted(upname_dict.items(), key=lambda kv: (kv[1], kv[0]))
    itemNames = []
    datas = []
    for i in range(len(upname_dict) - 1, len(upname_dict) - 21, -1):
        itemNames.append(upname_dict[i][0])
        datas.append(upname_dict[i][1])
    #绘图
    bars(itemNames,datas)

data visualization


###柱形图
def bars(name,dict_values):

    # 链式调用
    c = (
        Bar(
            init_opts=opts.InitOpts(  # 初始配置项
                theme=ThemeType.MACARONS,
                animation_opts=opts.AnimationOpts(
                    animation_delay=1000, animation_easing="cubicOut"  # 初始动画延迟和缓动效果
                ))
        )
            .add_xaxis(xaxis_data=name)  # x轴
            .add_yaxis(series_name="up主昵称", yaxis_data=dict_values)  # y轴
            .set_global_opts(
            title_opts=opts.TitleOpts(title='李运辰', subtitle='up视频数',  # 标题配置和调整位置
                                      title_textstyle_opts=opts.TextStyleOpts(
                                          font_family='SimHei', font_size=25, font_weight='bold', color='red',
                                      ), pos_left="90%", pos_top="10",
                                      ),
            xaxis_opts=opts.AxisOpts(name='up主昵称', axislabel_opts=opts.LabelOpts(rotate=45)),
            # 设置x名称和Label rotate解决标签名字过长使用
            yaxis_opts=opts.AxisOpts(name='大学生学习视频视频数'),

        )
            .render("up主大学生学习视频视频数.html")
    )

analysis

1. In the main up of university course videos, the number of videos related to the university classroom in the main up video is ranked
2. In the ranking of the number of university course videos, the number of videos is the most: Xiaobai is studying

4. Word Cloudization of University Course Names

data processing


text = "".join(title)
with open("stopword.txt","r", encoding='UTF-8') as f:
    stopword = f.readlines()
for i in stopword:
    print(i)
    i = str(i).replace("\r\n","").replace("\r","").replace("\n","")
    text = text.replace(i, "")

data visualization


word_list = jieba.cut(text)
result = " ".join(word_list)  # 分词用 隔开
# 制作中文云词
icon_name = 'fab fa-qq'
"""
# icon_name='',#国旗
# icon_name='fas fa-dragon',#翼龙
icon_name='fas fa-dog',#狗
# icon_name='fas fa-cat',#猫
# icon_name='fas fa-dove',#鸽子
# icon_name='fab fa-qq',#qq
"""
gen_stylecloud(text=result, icon_name=icon_name, font_path='simsun.ttc', output_name="大学课程名称词云化.png")  # 必须加中文字体,否则格式错误

analysis

1. Mainly the courses of Peking University and Tsinghua University, and the title of the course contains mostly those of the two universities.
2. Most of these video titles are based on keywords such as basics, open courses, courseware, postgraduate entrance examinations, and college physics.

4. Summary

1. Use the Scrapy framework to crawl 1907 "Station B" university course learning resource data.
2. Visual display and condensed analysis of data.
3. There may be more unanalyzed or excavated information in the data, please leave a message below and put forward your valuable suggestions!

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/114665728