The text of text and images from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.
Author: Luozhao Cheng
PS: If necessary Python learning materials can be added to a small partner click the link below to obtain their own
http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef
Interface data acquisition cat's eye
As a long-term stay home programmers, for a variety of packet capture simply come in handy. View the original code in Chrome mode, you can clearly see the interface, the interface is the address:
http://m.maoyan.com/mmdb/comments/movie/1208282.json?_v_=yes&offset=15
In Python, we can easily request to use the network to transmit request, in turn returns the result to get:
1 def getMoveinfo(url): 2 session = requests.Session() 3 headers = { 4 "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X)" 5 } 6 response = session.get(url, headers=headers) 7 if response.status_code == 200: 8 return response.text 9 return None
According to the above request, we could get to this interface returns data, the data content has a lot of information, but there is a lot of information that we do not need, let's take a look at the overall data returned:
1 { 2 "cmts":[ 3 { 4 "approve":0, 5 "approved":false, 6 "assistAwardInfo":{ 7 "avatar":"", 8 "celebrityId":0, 9 "celebrityName":"", 10 "rank":0, 11 "title":"" 12 }, 13 " authInfo " : "" , 14 " cityName " : " Guiyang " , 15 " Content " : " must be very, must borrow money to see a movie. " , 16 " filmView " : false, . 17 " ID " : 1045570589 , 18 is " isMajor " : to false, . 19 " juryLevel " : 0, 20 "majorType":0, 21 "movieId":1208282, 22 "nick":"nick", 23 "nickName":"nickName", 24 "oppose":0, 25 "pro":false, 26 "reply":0, 27 "score":5, 28 "spoiler":0, 29 "startTime":"2018-11-22 23:52:58", 30 "supportComment":true, 31 "supportLike":true, 32 "sureViewed":1, 33 "tagList":{ 34 "fixed":[ 35 { 36 "id":1, 37 "name":"好评" 38 }, 39 { 40 "id":4, 41 "name":"购票" 42 } 43 ] 44 }, 45 "time":"2018-11-22 23:52", 46 "userId":1871534544, 47 "userLevel":2, 48 "videoDuration":0, 49 "vipInfo":"", 50 "vipType":0 51 } 52 ] 53 } 54
So much data, we are interested in only the following several fields:
nickName
, cityName
, content
, startTime
, score
Next, we compare the importance of data processing, parsing out the JSON data from the field need to get in:
1 def parseInfo(data): 2 data = json.loads(html)['cmts'] 3 for item in data: 4 yield{ 5 'date':item['startTime'], 6 'nickname':item['nickName'], 7 'city':item['cityName'], 8 'rate':item['score'], 9 'conment':item['content'] 10 }
After getting the data, we can begin to analyze the data. However, in order to avoid the cat frequently requested data, the data needs to be stored, where the author using the SQLite3, into the database, the subsequent processing easier. Code stored data is as follows:
1 def saveCommentInfo(moveId, nikename, comment, rate, city, start_time) 2 conn = sqlite3.connect('unknow_name.db') 3 conn.text_factory=str 4 cursor = conn.cursor() 5 ins="insert into comments values (?,?,?,?,?,?)" 6 v = (moveId, nikename, comment, rate, city, start_time) 7 cursor.execute(ins,v) 8 cursor.close() 9 conn.commit() 10 conn.close()
data processing
Because earlier we use the database for data storage, so you can use SQL to directly query the results you want, such as the top five cities have comments which:
SELECT city, count(*) rate_count FROM comments GROUP BY city ORDER BY rate_count DESC LIMIT 5
The results are as follows:
From the above data, we can see that the greatest number of comments from Beijing.
Not only that, you can also use more SQL statements to query the desired results. For example, the number of each score, the proportion of share. If I interested, you can try to check what data is so simple.
In order to better display data, we use Pyecharts this library for data visual display.
According to data from the cat's eye to get, according to geographic location, directly Pyecharts to display data on a Chinese map:
1 data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment']) 2 city = data.groupby(['city']) 3 city_com = city['rate'].agg(['mean','count']) 4 city_com.reset_index(inplace=True) 5 data_map = [(city_com['city'][i],city_com['count'][i]) for i in range(0,city_com.shape[0])] 6 geo = Geo("GEO 地理位置分析",title_pos = "center",width = 1200,height = 800) 7 while True: 8 try: 9 attr,val = geo.cast(data_map) 10 geo.add("",attr,val,visual_range=[0,300],visual_text_color="#fff", 11 symbol_size=10, is_visualmap=True,maptype='china') 12 13 except ValueError as e: 14 e = e.message.split("No coordinate is specified for ")[1] 15 data_map = filter(lambda item: item[0] != e, data_map) 16 else : 17 break 18 geo.render('geo_city_location.html')
Note: Use Pyecharts provide data map, some cat's eye in the city can not find the data from the corresponding standard, so in your code, GEO added wrong city, we will delete filter out a lot of data.
Use Python, is so simple to generate the following map:
As can be seen from the visualization of data in both cinema and comments of people mainly in the eastern China, Youyi Beijing, Shanghai, Chengdu, Shenzhen most. Although the data from the chart out a lot, but still not intuitive, if you want to see the distribution of each province / city, we also need to further process the data.
In the cat's eye to get from the data, the data contained in the city with county data, so you need to get the data to do a conversion, to convert all of the county go to the corresponding provinces, and then the same provinces the number of comments added to obtain the final result.
1 data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment']) 2 city = data.groupby(['city']) 3 city_com = city['rate'].agg(['mean','count']) 4 city_com.reset_index(inplace=True) 5 fo = open("citys.json",'r') 6 citys_info = fo.readlines() 7 citysJson = json.loads(str(citys_info[0])) 8 data_map_all = [(getRealName(city_com['city'][i], citysJson),city_com['count'][i]) for i in range(0,city_com.shape[0])] 9 data_map_list = {} 10 for item in data_map_all: 11 if data_map_list.has_key(item[0]): 12 value = data_map_list[item[0]] 13 value += item[1] 14 data_map_list[item[0]] = value 15 else: 16 data_map_list[item[0]] = item[1] 17 data_map = [(realKeys(key), data_map_list[key] ) for key in data_map_list.keys()] 18 def getRealName(name, jsonObj): 19 for item in jsonObj: 20 if item.startswith(name) : 21 return jsonObj[item] 22 return name 23 def realKeys(name): 24 return name.replace(u"省", "").replace(u"市", "") 25 .replace(u"回族自治区", "").replace(u"维吾尔自治区", "") 26 .replace(u"壮族自治区", "").replace(u"自治区", "")
经过上面的数据处理,使用 Pyecharts 提供的 map 来生成一个按省/市来展示的地图:
1 def generateMap(data_map): 2 map = Map("城市评论数", width= 1200, height = 800, title_pos="center") 3 while True: 4 try: 5 attr,val = geo.cast(data_map) 6 map.add("",attr,val,visual_range=[0,800], 7 visual_text_color="#fff",symbol_size=5, 8 is_visualmap=True,maptype='china', 9 is_map_symbol_show=False,is_label_show=True,is_roam=False, 10 ) 11 except ValueError as e: 12 e = e.message.split("No coordinate is specified for ")[1] 13 data_map = filter(lambda item: item[0] != e, data_map) 14 else : 15 break 16 map.render('city_rate_count.html')
当然,我们还可以来可视化一下每一个评分的人数,这个地方采用柱状图来显示:
1 data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment']) 2 # 按评分分类 3 rateData = data.groupby(['rate']) 4 rateDataCount = rateData["date"].agg([ "count"]) 5 rateDataCount.reset_index(inplace=True) 6 count = rateDataCount.shape[0] - 1 7 attr = [rateDataCount["rate"][count - i] for i in range(0, rateDataCount.shape[0])] 8 v1 = [rateDataCount["count"][count - i] for i in range(0, rateDataCount.shape[0])] 9 bar = Bar("评分数量") 10 bar.add("数量",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2, 11 xaxis_interval=0,is_splitline_show=True) 12 bar.render("html/rate_count.html")
画出来的图,如下所示,在猫眼的数据中,五星好评的占比超过了 50%,比豆瓣上 34.8% 的五星数据好很多。
从以上观众分布和评分的数据可以看到,这一部剧,观众朋友还是非常地喜欢。前面,从猫眼拿到了观众的评论数据。现在,笔者将通过 jieba 把评论进行分词,然后通过 Wordcloud 制作词云,来看看,观众朋友们对《无名之辈》的整体评价:
1 data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment']) 2 comment = jieba.cut(str(data['comment']),cut_all=False) 3 wl_space_split = " ".join(comment) 4 backgroudImage = np.array(Image.open(r"./unknow_3.png")) 5 stopword = STOPWORDS.copy() 6 wc = WordCloud(width=1920,height=1080,background_color='white', 7 mask=backgroudImage, 8 font_path="./Deng.ttf", 9 stopwords=stopword,max_font_size=400, 10 random_state=50) 11 wc.generate_from_text(wl_space_split) 12 plt.imshow(wc) 13 plt.axis("off") 14 wc.to_file('unknow_word_cloud.png')
Export: