用pandas在海量数据中找出最热的50本书和最热的10个标签

任务一：找出最多人想读的50本书的名称
任务二：找出这50本书对应最热门的10个标签

已有数据：
文件1：to_read.csv
每行两个数据，用户id和该用户想读的书籍id
文件2：books.csv
书籍的各类id，名称，作者等信息
文件3：tags.csv
每行两个数据，标签id和标签名称
文件4：book_tags.csv
每行三个数据，goodreads_book_id（和to_read中的书籍id的对应关系可以在books.csv里找到），标签id，标记次数。

思路：
任务一用文件1和文件2就能提供充分的信息了。
我们先用文件1用.value_count()来进行统计每本书的期待用户数，然后用sort_values()进行排序后切片(([:50]))前50。注意这个时候我们得到的是一个Series，我们需要提取其中的index中（id号）和value（count值，其实是非必须的，只不过是一个依据）来生成一个多列的DataFrame数据帧作为一张基础表，这个表中的id是主键。然后我们需要通过主键关联到其他表上来对应找出id对应的书名（title）。我们发现在文件2中有书名信息，但是文件2中的内容太多了，我们需要简化以保证信息的间接性，所以我们可以从文件二中用DataFrame的读取方法读取其中我们需要的列生成一张id和书名对应的书名表（books_id_and_title），然后通过主键和刚才生成的基础表做一个merge操作就可以了。注意我们这里用了on='book_id’参数设定了主键，how='left’设定了取值范围。

实现的代码如下：

import pandas as pd
to_read = pd.read_csv('to_read.csv')
to_read_counts_series = to_read['book_id'].value_counts().sort_values(ascending=False)
hottest_50_books_id_series = to_read_counts_series[:50]
print(hottest_50_books_id_series)
hottest_50_books_id_col = hottest_50_books_id_series.index
hottest_50_books_counts = hottest_50_books_id_series.values
print(hottest_50_books_id_col)
print(hottest_50_books_counts)

to_read_counts_df = pd.DataFrame({
		'book_id':hottest_50_books_id_col,
		'counts':hottest_50_books_counts
	})
print(to_read_counts_df)
books = pd.read_csv('books.csv',encoding="ISO-8859-1")
books_id_and_title = books[['book_id','goodreads_book_id','title']]
print(books_id_and_title)
hottest_books_list = pd.merge(to_read_counts_df,books_id_and_title,on ='book_id',how='left')
hottest_books_list.to_csv('hottest_50_books.csv')
print(hottest_books_list)

最后的输出结果如下：

  book_id  ...                                              title
0        47  ...                                     The Book Thief
1       143  ...                        All the Light We Cannot See
2       113  ...                                           Catch-22
3        13  ...                                               1984
4        11  ...                                    The Kite Runner
5        45  ...                                         Life of Pi
6       139  ...  Miss Peregrineâ€™s Home for Peculiar Children ...
7        39  ...     A Game of Thrones (A Song of Ice and Fire, #1)
8        65  ...                                Slaughterhouse-Five
9        35  ...                                      The Alchemist
10      342  ...                                 The Casual Vacancy
11      185  ...                                   The Night Circus
12      119  ...                                The Handmaid's Tale
13        8  ...                             The Catcher in the Rye
14        6  ...                             The Fault in Our Stars
15        4  ...                              To Kill a Mockingbird
16       94  ...                      One Hundred Years of Solitude
17       89  ...                                The Princess Bride 
18       55  ...                                    Brave New World
19       61  ...                              The Girl on the Train
20      109  ...                                    Les MisÃ©rables
21       16  ...   The Girl with the Dragon Tattoo (Millennium, #1)
22       31  ...                                           The Help
23       67  ...                           A Thousand Splendid Suns
24      146  ...                                      The Goldfinch
25       54  ...  The Hitchhiker's Guide to the Galaxy (Hitchhik...
26       46  ...                                Water for Elephants
27      121  ...                                             Lolita
28        5  ...                                   The Great Gatsby
29      173  ...                                 A Clockwork Orange
30      115  ...                                          Middlesex
31       68  ...                    The Perks of Being a Wallflower
32       36  ...                          The Giver (The Giver, #1)
33       95  ...                         The Picture of Dorian Gray
34      167  ...                  American Gods (American Gods, #1)
35      129  ...                    One Flew Over the Cuckoo's Nest
36      265  ...                           A Tree Grows in Brooklyn
37      137  ...                          Outlander (Outlander, #1)
38      277  ...                   The Ocean at the End of the Lane
39       66  ...                                 Gone with the Wind
40      267  ...                                    The Nightingale
41      268  ...                                    Never Let Me Go
42       28  ...                                  Lord of the Flies
43       38  ...                           The Time Traveler's Wife
44       60  ...  The Curious Incident of the Dog in the Night-Time
45       14  ...                                        Animal Farm
46      225  ...                                       East of Eden
47       10  ...                                Pride and Prejudice
48      233  ...                        Love in the Time of Cholera
49      252  ...                  Cinder (The Lunar Chronicles, #1)

[50 rows x 4 columns]

这样我们就完成了任务一。总结一下，大致的思路就是用大表按照需求提取数据生成小表，然后将两个小表合并获取有用信息。如果用人工的方法，这个工作几乎是不可能完成的，用计算机可以在几秒钟时间内完成。

任务二：则需要做一些简单的数据筛选处理，我们需要从book_tag表中筛选hottest_book对应的tag，这里需要使用isin()方法来进行判断，然后再用索引方法留下True的记录，删掉False的记录。这句话是比较绕比较复杂的是这个任务的一个难点。然后得到一个新的DF数据帧，我们需要对其中的tag的出现次数进行统计并且通过排序和切片的方法获取其中的前10个tags。然后我们就得到了一个hottest_10_tags的表格，我们把它转换成DF数据结构和后面的tags生成的DF表进行merge操作。这样我们就可以通过merge的关联把tag_id对应的tag_name统计到新表中了。
代码如下：

import pandas as pd
to_read = pd.read_csv('to_read.csv')
to_read_counts_series = to_read['book_id'].value_counts().sort_values(ascending=False)
hottest_50_books_id_series = to_read_counts_series[:50]
hottest_50_books_id_col = hottest_50_books_id_series.index
hottest_50_books_counts = hottest_50_books_id_series.values
to_read_counts_df = pd.DataFrame({
		'book_id':hottest_50_books_id_col,
		'counts':hottest_50_books_counts
	})
books = pd.read_csv('books.csv',encoding="ISO-8859-1")
books_id_and_title = books[['book_id','goodreads_book_id','title']]
hottest_books_list = pd.merge(to_read_counts_df,books_id_and_title,on ='book_id',how='left')
hottest_books_list.to_csv('hottest_50_books.csv')
book_tags = pd.read_csv	('book_tags.csv') #这是一个DataFrame结构，有3个columns
book_tags_in_hottest_books = book_tags[book_tags['_goodreads_book_id_']
						.isin(hottest_books_list['goodreads_book_id'])]#这句操作很关键,相当于是筛选出了最热门书中出现的tags
del book_tags_in_hottest_books['_goodreads_book_id_'] #这句del可删可不删，没什么大的关系，为了间接性还是删除了，浪费空间内存
hottest_10_book_tags = book_tags_in_hottest_books.groupby('tag_id').sum()#这句根据tag_id分组，相当于统计每个id出现的count值的sum
hottest_10_book_tags =hottest_10_book_tags.sort_values(by='count',ascending=False)[:10] #这句根据tag_id出现次数的sum进行排序取前十个
hottest_tag_id_list = hottest_10_book_tags.index
hottest_tag_id_count= hottest_10_book_tags['count']
hottest_10_tag_id_df = pd.DataFrame({
		'hottest_tag_id':hottest_tag_id_list,
		'count': hottest_tag_id_count
	}) #把最热的10个tag_id和他们出现的次数count出现的次数存在DF表中等待后期merge操作
book_tags_names = pd.read_csv('tags.csv')
hottest_10_tags_with_names = pd.merge(hottest_10_tag_id_df,book_tags_names,
	left_on='hottest_tag_id',right_on='tag_id',how='left')
hottest_10_tags_with_names.to_csv("hottest_10_tags_with_names.csv")
print(hottest_10_tags_with_names)

运行结果如下：

   hottest_tag_id    count  tag_id            tag_name
0           30574  6061902   30574             to-read
1            8717   412677    8717   currently-reading
2           11557   404973   11557           favorites
3           11743   299256   11743             fiction
4            7457   279578    7457            classics
5           14487    80671   14487  historical-fiction
6           11305    71863   11305             fantasy
7            5207    71629    5207         books-i-own
8           22743    65001   22743               owned
9           33114    56829   33114         young-adult

梧桐雪

发布了127 篇原创文章 · 获赞 5 · 访问量 3450

私信关注

用pandas在海量数据中找出最热的50本书和最热的10个标签

猜你喜欢