New Year's Day is coming soon. It is a rare three-day holiday. It must be played, but where to go is a problem. Therefore, the author took Xiamen, a popular tourist city as an example, and used Python to obtain the relevant scenic spot data of Qunar, including the scenic spot name, area, rating, sales, price, coordinates and other fields, visualized the data and made a simple analysis to obtain Find attractions with high cost performance.
data collection
Qunar.com data collection is relatively simple. After finding the real URL, construct parameter splicing, use request to request json data, and store the data as a csv file in append mode.
The core code of the crawler is as follows:
# -*- coding = uft-8 -*- # @Time: 2020/12/25 9:47 PM # @Author: Public number dish J learn Python # @File: Where to go.py import requests import random from time import sleep import csv import pandas as pd from fake_useragent import UserAgent def get_data(keyword,page): ua = UserAgent(verify_ssl=False) headers = {"User-Agent": ua.random} url = f'http://piao.qunar.com/ticket/list.json?keyword={keyword}®ion=&from=mpl_search_suggest&page={page}' res = requests.request("GET", url,headersheaders=headers) sleep(random.uniform(1, 2)) try: resres_json = res.json() #print(res_json) sight_List = res_json['data']['sightList'] print(sight_List) except: pass if __name__ == '__main__': keyword = "Xiamen" for page in range(1,100): #Control the number of pages print(f"Extracting page {page}") sleep(random.uniform(1, 2)) get_data(keyword,page)
data processing
Import related packages
First, import data processing and data visualization related third-party libraries to facilitate subsequent operations.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline plt.rcParams['font.sans-serif'] = ['SimHei'] # Set the loaded font name plt.rcParams['axes.unicode_minus'] = False # Solve the problem that the minus sign'-' is displayed as a square in the saved image import jieba import re from pyecharts.charts import * from pyecharts import options as opts from pyecharts.globals import ThemeType import stylecloud from IPython.display import Image
Import attractions data
Use pandas to read and preview the crawled csv format scenic spot data.
df = pd.read_csv("/caiJ learn Python/tourism/Xiamen tourist attractions.csv",names=['name','star','score','qunarPrice','saleCount','districts',' point','intro']) df.head()
Remove duplicate data
There is a certain amount of duplicate data in the website, which needs to be eliminated.
dfdf = df.drop_duplicates()
View data information
View the field type and missing value situation, which meets the needs of analysis, without additional processing.
df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 422 entries, 0 to 423 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 422 non-null object 1 star 422 non-null object 2 score 422 non-null float64 3 qunarPrice 422 non-null float64 4 saleCount 422 non-null int64 5 districts 422 non-null object 6 point 422 non-null object 7 intro 377 non-null object dtypes: float64(2), int64(1), object(5) memory usage: 29.7+ KB
Descriptive statistics
It can be seen from the descriptive statistics table that after excluding duplicate data, the average ticket price for the remaining 424 scenic spots is 40 yuan.
color_map = sns.light_palette('orange', as_cmap=True) # light_palette palette df.describe().style.background_gradient(color_map)
Visual analysis
Attractions
We can easily see the characteristics of Xiamen by drawing word cloud map of the introduction text of Xiamen attractions. In a typical coastal leisure city, terms such as sailing boat, Gulangyu Island, and yacht are mentioned a lot, and terms such as architecture and museum are also mentioned to some extent, reflecting the strong cultural atmosphere of Xiamen.
#Drawing word cloud diagram text1 = get_cut_words(content_series=df['intro']) stylecloud.gen_stylecloud (text = '' .join (text1), max_words = 100, collocations=False, font_path='simhei.ttf', icon_name='fas fa-heart', size=653, #palette='matplotlib.Inferno_9', output_name='./offer.png') Image(filename='./xiamen.png')
Attractions distribution
Use kepler.gl to draw the distribution map of Xiamen's tourist attractions. At the same time, the size of the circle indicates the size of the monthly ticket sales. We can clearly see that Xiamen's attractions are concentrated in Siming District and Huli District, while other regions are more distributed dispersion. Especially in Siming District, ticket sales are far ahead of other regions.
df["lon"] = df["point"].str.split(",",expand=True)[0] df["lat"] = df["point"].str.split(",",expand=True)[1] df.to_csv("/菜J学Python/data.csv")
Rating TOP10 attractions
Judging from the scores of attractions, Xiamen University scored the highest with a full score of 5 points. Followed by Gulangyu Island and Nanputuo Temple with 4.9 points and 4.6 points respectively. No wonder some people say that having never been to Xiamen University and Gulangyu is equivalent to never being to Xiamen.
dfdf_score = df.pivot_table(index='name',values='score') df_score.sort_values('score',inplace=True,ascending=False) df_score[:10]
Monthly sales TOP10 attractions
From the perspective of monthly ticket sales, Gulangyu ranked first, with monthly sales of 1,230, followed by Xiamen Botanical Garden and Gulangyu round-trip ferry. Xiamen Fantawild Fantasy Kingdom also has monthly sales of more than 600.
dfdf_saleCount = df.pivot_table(index='name',values='saleCount') df_saleCount.sort_values('saleCount',inplace=True,ascending=False) df_saleCount[:10]
Price TOP20 attractions
From the perspective of the price of scenic spots, activities like yachts, helicopters, and sailing are expensive. In addition, Xiamen Fantawild is not cheap. If you are not sensitive to prices, you can consider it. If you are a poor tourist, you can avoid it in advance.
dfdf_qunarPrice = df.pivot_table(index='name',values='qunarPrice') df_qunarPrice.sort_values('qunarPrice',inplace=True,ascending=False) df_qunarPrice[:20]
Monthly sales TOP20 attractions
Since the change in sales of scenic spots in Xiamen in the past month is smaller than the change in prices, sales are more affected by prices. It can also be seen from the figure below that the attractions with large monthly sales are still yachts and Fantawild.
df["saleTotal"] = df["qunarPrice"]*df["saleCount"] dfdf_saleTotal = df.pivot_table(index='name',values='saleTotal') df_saleTotal.sort_values('saleTotal',inplace=True,ascending=False) df_saleTotal [: 20]
Attraction level distribution
Judging from the distribution of Xiamen's scenic spots, scenic spots above 3A account for less than 5%.
dfdf_star = df["star"].value_counts() df_stardf_star = df_star.sort_values(ascending=False) #print(df_star) c = ( Pie(init_opts=opts.InitOpts(theme=ThemeType.WALDEN)) .add( "", [list(z) for z in zip(df_star.index.to_list(),df_star.to_list())] ) .set_global_opts(legend_opts = opts.LegendOpts(is_show = False),title_opts=opts.TitleOpts(title="attraction level distribution",subtitle="data source: Qunar.com\nMap: CaiJ learns Python",pos_top=" 0.5%",pos_left ='left')) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%",font_size=16)) ) c.render_notebook()
df[df["star"]!='无'].sort_values("star",ascending=False)
The following are selected 3A and above attractions:
summary
Through the above simple analysis, we can roughly get the following inspirations:
1. Xiamen is a typical coastal leisure city with rich ocean and cultural landscape;
2. Xiamen tourist attractions are mainly concentrated in Siming District, and other areas are scattered;
3. Xiamen University has the highest reputation, followed by Gulangyu;
4. Gulangyu ticket sales are far ahead of other attractions in Xiamen;
5. Scenic spots or activities with higher consumption include yachts, sailing boats and Fangte.
Reminder: The epidemic has not completely dissipated. Try to avoid risk areas when playing on New Year's Day.
Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself