Python crawling cat-eye film "nobodies" and its data analysis

Foreword

The text of text and images from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.

Author: Luozhao Cheng

PS: If necessary Python learning materials can be added to a small partner click the link below to obtain their own

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

Interface data acquisition cat's eye

As a long-term stay home programmers, for a variety of packet capture simply come in handy. View the original code in Chrome mode, you can clearly see the interface, the interface is the address:

http://m.maoyan.com/mmdb/comments/movie/1208282.json?_v_=yes&offset=15

In Python, we can easily request to use the network to transmit request, in turn returns the result to get:

1 def getMoveinfo(url):
2     session = requests.Session()
3     headers = {
4         "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X)"
5     }
6     response = session.get(url, headers=headers)
7     if response.status_code == 200:
8         return response.text
9     return None

 

According to the above request, we could get to this interface returns data, the data content has a lot of information, but there is a lot of information that we do not need, let's take a look at the overall data returned:

 1 {
 2     "cmts":[
 3         {
 4             "approve":0,
 5             "approved":false,
 6             "assistAwardInfo":{
 7                 "avatar":"",
 8                 "celebrityId":0,
 9                 "celebrityName":"",
10                 "rank":0,
11                 "title":"" 
12              },
 13              " authInfo " : "" ,
 14              " cityName " : " Guiyang " ,
 15              " Content " : " must be very, must borrow money to see a movie. " ,
 16              " filmView " : false,
 . 17              " ID " : 1045570589 ,
 18 is              " isMajor " : to false,
 . 19              " juryLevel " : 0,
20             "majorType":0,
21             "movieId":1208282,
22             "nick":"nick",
23             "nickName":"nickName",
24             "oppose":0,
25             "pro":false,
26             "reply":0,
27             "score":5,
28             "spoiler":0,
29             "startTime":"2018-11-22 23:52:58",
30             "supportComment":true,
31             "supportLike":true,
32             "sureViewed":1,
33             "tagList":{
34                 "fixed":[
35                     {
36                         "id":1,
37                         "name":"好评"
38                     },
39                     {
40                         "id":4,
41                         "name":"购票"
42                     }
43                 ]
44             },
45             "time":"2018-11-22 23:52",
46             "userId":1871534544,
47             "userLevel":2,
48             "videoDuration":0,
49             "vipInfo":"",
50             "vipType":0
51         }
52     ]
53 }
54

 

So much data, we are interested in only the following several fields:

nickName, cityName, content, startTimescore

Next, we compare the importance of data processing, parsing out the JSON data from the field need to get in:

 1 def parseInfo(data): 
 2     data = json.loads(html)['cmts']
 3     for item in data:
 4         yield{
 5             'date':item['startTime'],
 6             'nickname':item['nickName'],
 7             'city':item['cityName'],
 8             'rate':item['score'],
 9             'conment':item['content']
10         }

 

After getting the data, we can begin to analyze the data. However, in order to avoid the cat frequently requested data, the data needs to be stored, where the author using the SQLite3, into the database, the subsequent processing easier. Code stored data is as follows:

 1 def saveCommentInfo(moveId, nikename, comment, rate, city, start_time)
 2     conn = sqlite3.connect('unknow_name.db')
 3     conn.text_factory=str
 4     cursor = conn.cursor()
 5     ins="insert into comments values (?,?,?,?,?,?)"
 6     v = (moveId, nikename, comment, rate, city, start_time)
 7     cursor.execute(ins,v)
 8     cursor.close()
 9     conn.commit()
10     conn.close()

 


data processing

Because earlier we use the database for data storage, so you can use SQL to directly query the results you want, such as the top five cities have comments which:

SELECT  city, count(*) rate_count  FROM comments GROUP BY city ORDER BY rate_count DESC LIMIT 5

 

The results are as follows:

Here Insert Picture Description

From the above data, we can see that the greatest number of comments from Beijing.

Not only that, you can also use more SQL statements to query the desired results. For example, the number of each score, the proportion of share. If I interested, you can try to check what data is so simple.

In order to better display data, we use Pyecharts this library for data visual display.

According to data from the cat's eye to get, according to geographic location, directly Pyecharts to display data on a Chinese map:

 1 data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
 2 city = data.groupby(['city'])
 3 city_com = city['rate'].agg(['mean','count'])
 4 city_com.reset_index(inplace=True)
 5 data_map = [(city_com['city'][i],city_com['count'][i]) for i in range(0,city_com.shape[0])]
 6 geo = Geo("GEO 地理位置分析",title_pos = "center",width = 1200,height = 800)
 7 while True:
 8     try:
 9         attr,val = geo.cast(data_map)
10         geo.add("",attr,val,visual_range=[0,300],visual_text_color="#fff",
11                 symbol_size=10, is_visualmap=True,maptype='china')
12 13     except ValueError as e:
14         e = e.message.split("No coordinate is specified for ")[1]
15         data_map = filter(lambda item: item[0] != e, data_map)
16     else :
17         break
18 geo.render('geo_city_location.html')

 

Note: Use Pyecharts provide data map, some cat's eye in the city can not find the data from the corresponding standard, so in your code, GEO added wrong city, we will delete filter out a lot of data.

Use Python, is so simple to generate the following map: Here Insert Picture Description

As can be seen from the visualization of data in both cinema and comments of people mainly in the eastern China, Youyi Beijing, Shanghai, Chengdu, Shenzhen most. Although the data from the chart out a lot, but still not intuitive, if you want to see the distribution of each province / city, we also need to further process the data.

In the cat's eye to get from the data, the data contained in the city with county data, so you need to get the data to do a conversion, to convert all of the county go to the corresponding provinces, and then the same provinces the number of comments added to obtain the final result.

 1 data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
 2 city = data.groupby(['city'])
 3 city_com = city['rate'].agg(['mean','count'])
 4 city_com.reset_index(inplace=True)
 5 fo = open("citys.json",'r')
 6 citys_info = fo.readlines()
 7 citysJson = json.loads(str(citys_info[0]))
 8 data_map_all = [(getRealName(city_com['city'][i], citysJson),city_com['count'][i]) for i in range(0,city_com.shape[0])]
 9 data_map_list = {}
10 for item in data_map_all:
11     if data_map_list.has_key(item[0]):
12         value = data_map_list[item[0]]
13         value += item[1]
14         data_map_list[item[0]] = value
15     else:
16         data_map_list[item[0]] = item[1]
17 data_map = [(realKeys(key), data_map_list[key] ) for key in data_map_list.keys()]
18 def getRealName(name, jsonObj):    
19     for item in jsonObj:
20         if item.startswith(name) :
21             return jsonObj[item]
22     return name
23 def realKeys(name):
24     return name.replace(u"", "").replace(u"", "")
25                .replace(u"回族自治区", "").replace(u"维吾尔自治区", "")
26                .replace(u"壮族自治区", "").replace(u"自治区", "")

 

 

经过上面的数据处理,使用 Pyecharts 提供的 map 来生成一个按省/市来展示的地图:

 1 def generateMap(data_map):
 2     map = Map("城市评论数", width= 1200, height = 800, title_pos="center")
 3     while True:
 4         try:
 5             attr,val = geo.cast(data_map)
 6             map.add("",attr,val,visual_range=[0,800],
 7                     visual_text_color="#fff",symbol_size=5,
 8                     is_visualmap=True,maptype='china',
 9                     is_map_symbol_show=False,is_label_show=True,is_roam=False, 
10                     )
11         except ValueError as e:
12             e = e.message.split("No coordinate is specified for ")[1]
13             data_map = filter(lambda item: item[0] != e, data_map)
14         else :
15             break
16     map.render('city_rate_count.html')

 

Here Insert Picture Description

当然,我们还可以来可视化一下每一个评分的人数,这个地方采用柱状图来显示:

 1 data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
 2 # 按评分分类
 3 rateData = data.groupby(['rate'])
 4 rateDataCount = rateData["date"].agg([ "count"])
 5 rateDataCount.reset_index(inplace=True)
 6 count = rateDataCount.shape[0] - 1
 7 attr = [rateDataCount["rate"][count - i] for i in range(0, rateDataCount.shape[0])]    
 8 v1 = [rateDataCount["count"][count - i] for i in range(0, rateDataCount.shape[0])]
 9 bar = Bar("评分数量")
10 bar.add("数量",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,
11         xaxis_interval=0,is_splitline_show=True)
12 bar.render("html/rate_count.html")

 

画出来的图,如下所示,在猫眼的数据中,五星好评的占比超过了 50%,比豆瓣上 34.8% 的五星数据好很多。

Here Insert Picture Description

从以上观众分布和评分的数据可以看到,这一部剧,观众朋友还是非常地喜欢。前面,从猫眼拿到了观众的评论数据。现在,笔者将通过 jieba 把评论进行分词,然后通过 Wordcloud 制作词云,来看看,观众朋友们对《无名之辈》的整体评价:

 1 data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
 2 comment = jieba.cut(str(data['comment']),cut_all=False)
 3 wl_space_split = " ".join(comment)
 4 backgroudImage = np.array(Image.open(r"./unknow_3.png"))
 5 stopword = STOPWORDS.copy()
 6 wc = WordCloud(width=1920,height=1080,background_color='white',
 7     mask=backgroudImage,
 8     font_path="./Deng.ttf",
 9     stopwords=stopword,max_font_size=400,
10     random_state=50)
11 wc.generate_from_text(wl_space_split)
12 plt.imshow(wc)
13 plt.axis("off")
14 wc.to_file('unknow_word_cloud.png')

 

Export:

Here Insert Picture Description .

Guess you like

Origin www.cnblogs.com/Qqun821460695/p/11953720.html