Python data analysis case 03—weather K-means cluster analysis

The commonly used algorithm for clustering must be K-means clustering. In this case, the weather data of ten regions in Shaanxi was used to construct features and perform cluster analysis.

First of all, the data is installed in the folder 'weather data', as shown in the figure:

Open one of the excel, it looks like this

 

 Start data processing below


data preprocessing 

import package

import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import datetime as dt
import re
#from sklearn.preprocessing import MinMaxScaler
%matplotlib inline

pd.options.display.float_format = '{:,.4f}'.format
np.set_printoptions(precision=4)
plt.rcParams ['font.sans-serif'] ='SimHei'               #显示中文
plt.rcParams ['axes.unicode_minus']=False               #显示负号

Get file and locale names

file_name=os.listdir(f'./天气数据')
print(file_name)
region_name=[i[:2] for i in file_name]
region_name

 Define some functions to process data

def date_transform(x) :
    a= x.split(' ')[0]
    a = pd.to_datetime(a, format='%Y-%m-%d')
    return a
def C_check(C):
    a=C.split('℃')
    return int(a[0])
def tianqi_check1(txt):
    if '转' in txt:
        a=re.findall('\w{1,5}转',txt)
        a=a[0].split('转')
        a=a[0]
    elif '~' in txt:
        a=re.findall('\w{1,5}~',txt)
        a=a[0].split('~')
        a=a[0]
    else:
        a=txt
    return a
def tianqi_check2(txt):
    if '到' in txt:
        a=re.findall('到\w{1,5}',txt)
        a=a[0].split('到')
        a=a[1]
    else:
        a=txt
    return a
df_最高气温=pd.DataFrame()
df_最低气温=pd.DataFrame()
df_天气=pd.DataFrame()
dic_天气={'晴':0,'晴到多云':0.5,'晴间多云':0.5,'局部多云':0.5,'多云':1,'少云':1.5,'阴':2,'阴天':2,'雾':2.5,'霾':2.5,'小雨':3,'雨':3,'阴到小雨':2.5,
            '小到中雨':3.5,'小雨到中雨':3.5,'阵雨':3.5,'中雨':4,'小雨到大雨':4,'雷阵雨':4,'雷雨':4,'中到大雨':4.5,'大雨':5,'大到暴雨':5.5,
            '暴雨':6,'暴风雨':6.5,'小雪':7,'雨夹雪 ':7,'雪':7,'中雪':8,'大雪':9,'浮尘':2.5,'扬沙':2.5,'风':2.5}

start reading and processing

for i,f in enumerate(file_name):
    #print(i)
    file_path = f'./天气数据/{f}'
    data=pd.read_excel(file_path,usecols=['日期','最高气温','最低气温','天气'])
    data['日期']=data['日期'].apply(date_transform)
    data['最高气温']=data['最高气温'].apply(C_check)
    data['最低气温']=data['最低气温'].apply(C_check)
    data['天气']=data['天气'].astype(str).apply(tianqi_check1)
    data['天气']=data['天气'].astype(str).apply(tianqi_check2)
    data.loc[:,'天气']=data['天气'].map(dic_天气)
    data['天气'].fillna(data['天气'].mean)
    data=data.set_index('日期').resample('M').mean()
    #print(len(data))
    df_最高气温[region_name[i]]=data['最高气温']
    df_最低气温[region_name[i]]=data['最低气温']
    df_天气[region_name[i]]=data['天气']

Finally, three data frames are formed, the highest temperature and the lowest temperature, and the weather conditions (such as whether it is raining or sunny, etc.). I map the weather conditions with a map, and they all become numerical variables.


descriptive statistics

df_最高气温.plot(title='各地区每月最高温变化图',figsize=(14,5),xlabel='日期',ylabel='最高温')

 

df_最低气温.plot(title='各地区每月最低温变化图',figsize=(14,5),xlabel='日期',ylabel='最低温')

 

df_天气.plot(title='各地区每月天气变化图',figsize=(14,5),xlabel='日期',ylabel='天气')

 All have obvious periodicity, and the weather is a bit messy because it is a numerical data mapped by itself.

Then draw the boxplot of the highest temperature:

column = df_最高气温.columns.tolist() # 列表头
fig = plt.figure(figsize=(20, 8), dpi=128)  # 指定绘图对象宽度和高度
for i in range(len(column)):
    plt.subplot(2,5, i + 1)  # 2行5列子图
    sns.boxplot(data=df_最高气温[column[i]], orient="v",width=0.5)  # 箱式图
    plt.ylabel(column[i], fontsize=16)
    plt.title(f'{region_name[i]}每月最高温箱线图',fontsize=16)
plt.tight_layout()
plt.show()

 

 The minimum temperature and weather are also drawn in the same way, just change the name of the data frame.

The kernel density map of the lowest temperature is drawn below (the same is true for the highest temperature and weather)

fig = plt.figure(figsize=(20, 8), dpi=128)  # 指定绘图对象宽度和高度
for i in range(len(column)):
    plt.subplot(2,5, i + 1)  # 2行5列子图
    ax = sns.kdeplot(data=df_最低气温[column[i]],color='blue',shade= True)
    plt.ylabel(column[i], fontsize=16)
    plt.title(f'{region_name[i]}每月最低温核密度图',fontsize=16)
plt.tight_layout()
plt.show()

 Draw a correlation heat map of the weather

fig = plt.figure(figsize=(8, 8), dpi=128) 
corr= sns.heatmap(df_天气[column].corr(),annot=True,square=True)

The same is true for the highest and lowest temperatures, just change the name of the data frame. You can see which areas have high weather correlation


 K-means clustering

Because three features are constructed this time, K-means clustering can be performed three times. We can compare the clustering results. First, use the highest temperature for clustering:

Hottest cluster

from sklearn.cluster import KMeans 
kmeans_model = KMeans(n_clusters=3, random_state=123, n_init=20)
kmeans_model.fit(df_最高气温.T)
kmeans_model.inertia_   #组内平方和

# kmeans_cc=kmeans_model.cluster_centers_   # 聚类中心
# kmeans_cc

kmeans_labels = kmeans_model.labels_   # 样本的类别标签
kmeans_labels 

pd.Series(kmeans_labels).value_counts()   # 统计不同类别样本的数目

 Map the value of the category

dic_rusult={}
for i in range(10):
    dic_rusult[df_最高气温.T.index[i]]=kmeans_labels[i]
dic_rusult

 Count and print the results

第一类地区=[]
第二类地区=[]
第三类地区=[]
for k,v in dic_rusult.items():
    if v==0:
        第一类地区.append(k)
    elif v==1:
        第二类地区.append(k)
    elif v==2:
        第三类地区.append(k)
print(f'从最高气温来看的聚类的结果,将地区分为三个地区,\n第一个地区为:{第一类地区},\n第二个地区为:{第二类地区},\n第三个地区为:{第三类地区}')

 You can go to the map to see that the clustering results are still very reasonable, and the clustered areas are all very close together.


Minimum temperature K-means clustering

kmeans_model = KMeans(n_clusters=3, random_state=123, n_init=20)
kmeans_model.fit(df_最低气温.T)

kmeans_labels = kmeans_model.labels_   # 样本的类别标签
kmeans_labels 

pd.Series(kmeans_labels).value_counts()   # 统计不同类别样本的数目

dic_rusult2={}
for i in range(10):
    dic_rusult2[df_最低气温.T.index[i]]=kmeans_labels[i]
dic_rusult2

第一类地区=[]
第二类地区=[]
第三类地区=[]
for k,v in dic_rusult2.items():
    if v==2:
        第一类地区.append(k)
    elif v==1:
        第二类地区.append(k)
    elif v==0:
        第三类地区.append(k)
print(f'从最低气温来看的聚类的结果,将地区分为三个地区,\n第一个地区为:{第一类地区},\n第二个地区为:{第二类地区},\n第三个地区为:{第三类地区}')

 Similar to the clustering result of the highest temperature

Guanzhong corresponding to the first region

Northern Shaanxi corresponding to the second region

Southern Shaanxi corresponding to the third region


Weather K-Means Clustering

kmeans_model = KMeans(n_clusters=3, random_state=123, n_init=20)
kmeans_model.fit(df_天气.T)
kmeans_labels = kmeans_model.labels_   # 样本的类别标签
pd.Series(kmeans_labels).value_counts()   # 统计不同类别样本的数目 
dic_rusult3={}
for i in range(10):
    dic_rusult3[df_天气.T.index[i]]=kmeans_labels[i]
dic_rusult3
第一类地区=[]
第二类地区=[]
第三类地区=[]
for k,v in dic_rusult3.items():
    if v==1:
        第一类地区.append(k)
    elif v==2:
        第二类地区.append(k)
    elif v==0:
        第三类地区.append(k)
print(f'从天气来看的聚类的结果,将地区分为三个地区,\n第一个地区为:{第一类地区},\n第二个地区为:{第二类地区},\n第三个地区为:{第三类地区}')

 The effect of weather is similar to that of temperature.

Guanzhong corresponding to the first region

Northern Shaanxi corresponding to the second region

Southern Shaanxi corresponding to the third region

Indicates that the weather in geographically close areas is more similar (algorithm says)

Guess you like

Origin blog.csdn.net/weixin_46277779/article/details/126401866