Use Python to analyze popular cities on New Year's Day! There are so many people in Changsha!

New Year's Day is coming soon. It is a rare three-day holiday. It must be played, but where to go is a problem. Therefore, the author took Xiamen, a popular tourist city as an example, and used Python to obtain the relevant scenic spot data of Qunar, including the scenic spot name, area, rating, sales, price, coordinates and other fields, visualized the data and made a simple analysis to obtain Find attractions with high cost performance.

data collection

Qunar.com data collection is relatively simple. After finding the real URL, construct parameter splicing, use request to request json data, and store the data as a csv file in append mode.

 

 

The core code of the crawler is as follows:

# -*- coding = uft-8 -*-  
# @Time: 2020/12/25 9:47 PM  
# @Author: Public number dish J learn Python  
# @File: Where to go.py  
import requests  
import random  
from time import sleep  
import csv  
import pandas as pd  
from fake_useragent import UserAgent  
def get_data(keyword,page):  
    ua = UserAgent(verify_ssl=False)  
    headers = {"User-Agent": ua.random}  
    url = f'http://piao.qunar.com/ticket/list.json?keyword={keyword}&region=&from=mpl_search_suggest&page={page}'  
    res = requests.request("GET", url,headersheaders=headers)  
    sleep(random.uniform(1, 2))  
    try:  
        resres_json = res.json()  
        #print(res_json) 
         sight_List = res_json['data']['sightList']  
        print(sight_List)  
    except:  
        pass  
if __name__ == '__main__':  
    keyword = "Xiamen"  
    for page in range(1,100): #Control the number of pages  
        print(f"Extracting page {page}")  
        sleep(random.uniform(1, 2))  
        get_data(keyword,page) 

data processing

Import related packages

First, import data processing and data visualization related third-party libraries to facilitate subsequent operations.

import pandas as pd   
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  
%matplotlib inline  
plt.rcParams['font.sans-serif'] = ['SimHei'] # Set the loaded font name  
plt.rcParams['axes.unicode_minus'] = False # Solve the problem that the minus sign'-' is displayed as a square in the saved image   
import jieba  
import re  
from pyecharts.charts import *  
from pyecharts import options as opts   
from pyecharts.globals import ThemeType    
import stylecloud  
from IPython.display import Image 

Import attractions data

Use pandas to read and preview the crawled csv format scenic spot data.

df = pd.read_csv("/caiJ learn Python/tourism/Xiamen tourist attractions.csv",names=['name','star','score','qunarPrice','saleCount','districts',' point','intro'])  
df.head() 

 

 

Remove duplicate data

There is a certain amount of duplicate data in the website, which needs to be eliminated.

dfdf = df.drop_duplicates() 

View data information

View the field type and missing value situation, which meets the needs of analysis, without additional processing.

df.info()    

<class 'pandas.core.frame.DataFrame'>  
   Int64Index: 422 entries, 0 to 423  
   Data columns (total 8 columns):  
    #   Column      Non-Null Count  Dtype    
   ---  ------      --------------  -----    
    0   name        422 non-null    object   
    1   star        422 non-null    object   
    2   score       422 non-null    float64  
    3   qunarPrice  422 non-null    float64  
    4   saleCount   422 non-null    int64    
    5   districts   422 non-null    object   
    6   point       422 non-null    object   
    7   intro       377 non-null    object   
   dtypes: float64(2), int64(1), object(5)  
   memory usage: 29.7+ KB 

Descriptive statistics

It can be seen from the descriptive statistics table that after excluding duplicate data, the average ticket price for the remaining 424 scenic spots is 40 yuan.

color_map = sns.light_palette('orange', as_cmap=True) # light_palette palette  
df.describe().style.background_gradient(color_map) 

 

 

Visual analysis

Attractions

We can easily see the characteristics of Xiamen by drawing word cloud map of the introduction text of Xiamen attractions. In a typical coastal leisure city, terms such as sailing boat, Gulangyu Island, and yacht are mentioned a lot, and terms such as architecture and museum are also mentioned to some extent, reflecting the strong cultural atmosphere of Xiamen.

#Drawing word cloud diagram  
text1 = get_cut_words(content_series=df['intro'])  
stylecloud.gen_stylecloud (text = '' .join (text1), max_words = 100,  
                          collocations=False,  
                          font_path='simhei.ttf',  
                          icon_name='fas fa-heart',  
                          size=653,  
                          #palette='matplotlib.Inferno_9',  
                          output_name='./offer.png')  
Image(filename='./xiamen.png') 

 

 

Attractions distribution

Use kepler.gl to draw the distribution map of Xiamen's tourist attractions. At the same time, the size of the circle indicates the size of the monthly ticket sales. We can clearly see that Xiamen's attractions are concentrated in Siming District and Huli District, while other regions are more distributed dispersion. Especially in Siming District, ticket sales are far ahead of other regions.

df["lon"] = df["point"].str.split(",",expand=True)[0]   
df["lat"] = df["point"].str.split(",",expand=True)[1]   
df.to_csv("/菜J学Python/data.csv") 

 

 

Rating TOP10 attractions

Judging from the scores of attractions, Xiamen University scored the highest with a full score of 5 points. Followed by Gulangyu Island and Nanputuo Temple with 4.9 points and 4.6 points respectively. No wonder some people say that having never been to Xiamen University and Gulangyu is equivalent to never being to Xiamen.

dfdf_score = df.pivot_table(index='name',values='score')  
df_score.sort_values('score',inplace=True,ascending=False)  
df_score[:10] 

 

 

Monthly sales TOP10 attractions

From the perspective of monthly ticket sales, Gulangyu ranked first, with monthly sales of 1,230, followed by Xiamen Botanical Garden and Gulangyu round-trip ferry. Xiamen Fantawild Fantasy Kingdom also has monthly sales of more than 600.

dfdf_saleCount = df.pivot_table(index='name',values='saleCount')  
df_saleCount.sort_values('saleCount',inplace=True,ascending=False)  
df_saleCount[:10] 

 

 

Price TOP20 attractions

From the perspective of the price of scenic spots, activities like yachts, helicopters, and sailing are expensive. In addition, Xiamen Fantawild is not cheap. If you are not sensitive to prices, you can consider it. If you are a poor tourist, you can avoid it in advance.

dfdf_qunarPrice = df.pivot_table(index='name',values='qunarPrice')  
df_qunarPrice.sort_values('qunarPrice',inplace=True,ascending=False)  
df_qunarPrice[:20] 

 

 

Monthly sales TOP20 attractions

Since the change in sales of scenic spots in Xiamen in the past month is smaller than the change in prices, sales are more affected by prices. It can also be seen from the figure below that the attractions with large monthly sales are still yachts and Fantawild.

df["saleTotal"] = df["qunarPrice"]*df["saleCount"]  
dfdf_saleTotal = df.pivot_table(index='name',values='saleTotal')  
df_saleTotal.sort_values('saleTotal',inplace=True,ascending=False)  
df_saleTotal [: 20] 

 

 

Attraction level distribution

Judging from the distribution of Xiamen's scenic spots, scenic spots above 3A account for less than 5%.

dfdf_star = df["star"].value_counts() 
df_stardf_star = df_star.sort_values(ascending=False)  
#print(df_star)  
c = (  
        Pie(init_opts=opts.InitOpts(theme=ThemeType.WALDEN))  
        .add(  
            "", 
             [list(z) for z in zip(df_star.index.to_list(),df_star.to_list())]  
        )  
        .set_global_opts(legend_opts = opts.LegendOpts(is_show = False),title_opts=opts.TitleOpts(title="attraction level distribution",subtitle="data source: Qunar.com\nMap: CaiJ learns Python",pos_top=" 0.5%",pos_left ='left')) 
         .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%",font_size=16))  
    )  
c.render_notebook() 

 

 

df[df["star"]!='无'].sort_values("star",ascending=False) 

The following are selected 3A and above attractions:

 

 

summary

Through the above simple analysis, we can roughly get the following inspirations:

1. Xiamen is a typical coastal leisure city with rich ocean and cultural landscape;

2. Xiamen tourist attractions are mainly concentrated in Siming District, and other areas are scattered;

3. Xiamen University has the highest reputation, followed by Gulangyu;

4. Gulangyu ticket sales are far ahead of other attractions in Xiamen;

5. Scenic spots or activities with higher consumption include yachts, sailing boats and Fangte.

Reminder: The epidemic has not completely dissipated. Try to avoid risk areas when playing on New Year's Day.

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112242760