Python crawls 44130 pieces of user viewing data, analyzes and mines the hidden information between users and movies!

1 Introduction

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it by yourself

Python free learning materials, codes and exchange answers click to join


Before watching a movie, many people like to go to "Douban" to watch movie reviews, so I crawled 44130 pieces of "Douban" user viewing data to analyze the relationship between users, the connection between movies, and the concealment between users and movies. relationship.

 

2. Crawl viewing data

Data Sources


https://movie.douban.com/

 

Crawl user viewing data on the "Douban" platform.

Crawl user list

Web analytics

 

In order to obtain users, I chose the film reviews of one of the movies, so that the user name can be obtained according to the user who commented (the user name is only needed to crawl the user's viewing records later).

https://movie.douban.com/subject/24733428/reviews?start=0

The start parameter in the url is the number of pages (page*20, 20 data per page), so start=0, 20, 40..., which is a multiple of 20, you can get these 4614 users by changing the start parameter value The name.

 

 

Check the label of the webpage, you can find the label attribute corresponding to the value of "User Name".

Programming realization


i=0
url = "https://movie.douban.com/subject/24733428/reviews?start=" + str(i * 20)
r = requests.get(url, headers=headers)
r.encoding = 'utf8'
s = (r.content)
selector = etree.HTML(s)

for item in selector.xpath('//*[@class="review-list  "]/div'):
    userid = (item.xpath('.//*[@class="main-hd"]/a[2]/@href'))[0].replace("https://www.douban.com/people/","").replace("/", "")
    username = (item.xpath('.//*[@class="main-hd"]/a[2]/text()'))[0]
    print(userid)
    print(username)
    print("-----")

Crawl users' viewing records

In the previous step, the "user name" is crawled, and then the "user name" is needed to crawl the user's viewing records.

Web analytics


#https://movie.douban.com/people/{用户名称}/collect?start=15&sort=time&rating=all&filter=all&mode=grid
https://movie.douban.com/people/mumudancing/collect?start=15&sort=time&rating=all&filter=all&mode=grid

By changing the "user name", you can get the viewing records of different users.

The start parameter in the url is the number of pages (page*15, 15 pieces of data per page), so start=0, 15, 30..., which is a multiple of 15, you can get these 1768 views by changing the value of the start parameter The film record said.

 

 

Check the tags of the webpage, you can find the tag attribute corresponding to the value of "movie name".

Programming realization


url = "https://movie.douban.com/people/mumudancing/collect?start=15&sort=time&rating=all&filter=all&mode=grid"
r = requests.get(url, headers=headers)
r.encoding = 'utf8'
s = (r.content)
selector = etree.HTML(s)

for item in selector.xpath('//*[@class="grid-view"]/div[@class="item"]'):
    text1 = item.xpath('.//*[@class="title"]/a/em/text()')
    text2 = item.xpath('.//*[@class="title"]/a/text()')
    text1 = (text1[0]).replace(" ", "")
    text2 = (text2[1]).replace(" ", "").replace("\n", "")
    print(text1+text1)
    print("-----")

Save to excel

Define header


# 初始化execl表
def initexcel(filename):
    # 创建一个workbook 设置编码
    workbook = xlwt.Workbook(encoding='utf-8')
    # 创建一个worksheet
    worksheet = workbook.add_sheet('sheet1')
    workbook.save(str(filename)+'.xls')
    ##写入表头
    value1 = [["用户", "影评"]]
    book_name_xls = str(filename)+'.xls'
    write_excel_xls_append(book_name_xls, value1)

The excel sheet has two titles (user, movie review)

Write to excel


# 写入execl
def write_excel_xls_append(path, value):
    index = len(value)  # 获取需要写入数据的行数
    workbook = xlrd.open_workbook(path)  # 打开工作簿
    sheets = workbook.sheet_names()  # 获取工作簿中的所有表格
    worksheet = workbook.sheet_by_name(sheets[0])  # 获取工作簿中所有表格中的的第一个表格
    rows_old = worksheet.nrows  # 获取表格中已存在的数据的行数
    new_workbook = copy(workbook)  # 将xlrd对象拷贝转化为xlwt对象
    new_worksheet = new_workbook.get_sheet(0)  # 获取转化后工作簿中的第一个表格
    for i in range(0, index):
        for j in range(0, len(value[i])):
            new_worksheet.write(i+rows_old, j, value[i][j])  # 追加写入数据,注意是从i+rows_old行开始写入
    new_workbook.save(path)  # 保存工作簿

Defines the write excel function, so that when each page of data is picked up, the write function is called to save the data in excel.

 

 

Finally, 44130 pieces of data were collected (originally 4614 users, each user has about 500 to 1,000 pieces of data, and it is estimated that 4 million pieces of data). But in order to demonstrate the analysis process, only the first 30 viewing records of each user are crawled (because the first 30 are the latest).

3. Data analysis and mining

Read data set


def read_excel():
    # 打开workbook
    data = xlrd.open_workbook('豆瓣.xls')
    # 获取sheet页
    table = data.sheet_by_name('sheet1')
    # 已有内容的行数和列数
    nrows = table.nrows
    datalist=[]
    for row in range(nrows):
        temp_list = table.row_values(row)
        if temp_list[0] != "用户" and temp_list[1] != "影评":
            data = []
            data.append([str(temp_list[0]), str(temp_list[1])])
            datalist.append(data)

    return datalist

 

Read all the data from Douban.xls and put it in the datalist collection.

Analysis 1: Ranking of movie views


###分析1:电影观看次数排行
def analysis1():
    dict ={}
    ###从excel读取数据
    movie_data = read_excel()
    for i in range(0, len(movie_data)):
        key = str(movie_data[i][0][1])
        try:
            dict[key] = dict[key] +1
        except:
            dict[key]=1
    ###从小到大排序
    dict = sorted(dict.items(), key=lambda kv: (kv[1], kv[0]))
    name=[]
    num=[]
    for i in range(len(dict)-1,len(dict)-16,-1):
        print(dict[i])
        name.append(((dict[i][0]).split("/"))[0])
        num.append(dict[i][1])

    plt.figure(figsize=(16, 9))
    plt.title('电影观看次数排行(高->低)')
    plt.bar(name, num, facecolor='lightskyblue', edgecolor='white')
    plt.savefig('电影观看次数排行.png')

analysis

1. Since the user information comes from the comments of "Spirit Travel", its users have the most views.
2. Among the most recent hit movies, the second most popular is "Send You A Little Red Flower", followed by Creed and Bomb Disposal Expert 2.

Analysis 2: User portraits (the same rate of users watching movies is the highest)


###分析2:用户画像(用户观影相同率最高)
def analysis2():
    dict = {}
    ###从excel读取数据
    movie_data = read_excel()

    userlist=[]
    for i in range(0, len(movie_data)):
        user = str(movie_data[i][0][0])
        moive = (str(movie_data[i][0][1]).split("/"))[0]
        #print(user)
        #print(moive)

        try:
            dict[user] = dict[user]+","+str(moive)
        except:
            dict[user] =str(moive)
            userlist.append(user)

    num_dict={}
    # 待画像用户(取第一个)
    flag_user=userlist[0]
    movies = (dict[flag_user]).split(",")
    for i in range(0,len(userlist)):
        #判断是否是待画像用户
        if flag_user != userlist[i]:
            num_dict[userlist[i]]=0
            #待画像用户的所有电影
            for j in range(0,len(movies)):
                #判断当前用户与待画像用户共同电影个数
                if movies[j] in dict[userlist[i]]:
                    # 相同加1
                    num_dict[userlist[i]] = num_dict[userlist[i]]+1
    ###从小到大排序
    num_dict = sorted(num_dict.items(), key=lambda kv: (kv[1], kv[0]))
    #用户名称
    username = []
    #观看相同电影次数
    num = []
    for i in range(len(num_dict) - 1, len(num_dict) - 9, -1):
        username.append(num_dict[i][0])
        num.append(num_dict[i][1])

    plt.figure(figsize=(25, 9))
    plt.title('用户画像(用户观影相同率最高)')
    plt.scatter(username, num, color='r')
    plt.plot(username, num)
    plt.savefig('用户画像(用户观影相同率最高).png')

analysis

Take the user "mumudancing" as an example to make a user portrait
1. It can be seen from the figure that the highest rate of viewing the same movie as the user "mumudancing" is: "Please take me back to Prague", followed by "Li Xiaowei".
2. Users:'Desperate Solitaire','Stupid Kid','Private History','Wen Heng','Shen Tang','Xiu Zuo', the same rate of viewing movies.

Analysis 3: Movie recommendation among users


###分析3:用户之间进行电影推荐(与其他用户同时被观看过)
def analysis3():
    dict = {}
    ###从excel读取数据
    movie_data = read_excel()

    userlist=[]
    for i in range(0, len(movie_data)):
        user = str(movie_data[i][0][0])
        moive = (str(movie_data[i][0][1]).split("/"))[0]
        #print(user)
        #print(moive)

        try:
            dict[user] = dict[user]+","+str(moive)
        except:
            dict[user] =str(moive)
            userlist.append(user)

    num_dict={}
    # 待画像用户(取第2个)
    flag_user=userlist[0]
    print(flag_user)
    movies = (dict[flag_user]).split(",")
    for i in range(0,len(userlist)):
        #判断是否是待画像用户
        if flag_user != userlist[i]:
            num_dict[userlist[i]]=0
            #待画像用户的所有电影
            for j in range(0,len(movies)):
                #判断当前用户与待画像用户共同电影个数
                if movies[j] in dict[userlist[i]]:
                    # 相同加1
                    num_dict[userlist[i]] = num_dict[userlist[i]]+1
    ###从小到大排序
    num_dict = sorted(num_dict.items(), key=lambda kv: (kv[1], kv[0]))

    # 去重(用户与观影率最高的用户两者之间重复的电影去掉)
    user_movies = dict[flag_user]
    new_movies = dict[num_dict[len(num_dict)-1][0]].split(",")
    for i in range(0,len(new_movies)):
        if new_movies[i] not in user_movies:
            print("给用户("+str(flag_user)+")推荐电影:"+str(new_movies[i]))

analysis

User "mumudancing" for example, between users recommend movies
1. conducted according to the highest association of users' mumudancing "viewing ratio (A) with the user, and then get the user (A) of all viewing records
2. The The user (A)'s movie viewing record is recommended to the user "mumudancing" (removal of duplicate movies between the two).

Analysis 4: Movie recommendation between movies


###分析4:电影之间进行电影推荐(与其他电影同时被观看过)
def analysis4():
    dict = {}
    ###从excel读取数据
    movie_data = read_excel()

    userlist=[]
    for i in range(0, len(movie_data)):
        user = str(movie_data[i][0][0])
        moive = (str(movie_data[i][0][1]).split("/"))[0]
        try:
            dict[user] = dict[user]+","+str(moive)
        except:
            dict[user] =str(moive)
            userlist.append(user)

    movie_list=[]
    # 待获取推荐的电影
    flag_movie = "送你一朵小红花"
    for i in range(0,len(userlist)):
        if flag_movie in dict[userlist[i]]:
             moives = dict[userlist[i]].split(",")
             for j in range(0,len(moives)):
                 if moives[j] != flag_movie:
                     movie_list.append(moives[j])

    data_dict = {}
    for key in movie_list:
        data_dict[key] = data_dict.get(key, 0) + 1

    ###从小到大排序
    data_dict = sorted(data_dict.items(), key=lambda kv: (kv[1], kv[0]))
    for i in range(len(data_dict) - 1, len(data_dict) -16, -1):
            print("根据电影"+str(flag_movie)+"]推荐:"+str(data_dict[i][0]))

analysis

Take the movie "Send You A Little Red Flower" as an example to recommend movies between movies
1. Get all users who have watched "Send You a Little Red Flower", and then get the respective viewing records of these users.
2. Statistically summarize these viewing records (remove "Send you a little red flower"), and then sort from high to low, and finally get the highest correlation with the movie "Send you a little red flower" Sorted collection.
3. Recommend the top 15 movies with the highest degree of relevance to users.

4. Summary

1. Analyze the idea of ​​crawling Douban platform data and implement it by programming.
2. Analyze the crawled data (ranking movie views, user portraits, movie recommendation among users, movie recommendation among movies)

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/114581290