Business district analysis based on base station location data

Business district analysis based on base station location data


Because a lot of this book does not give the most primitive data. Most of the data is in the process of processing, and a lot of training in data feature engineering has been lost.
In the future, we will pay more attention to the analysis of data characteristics in kaggle competitions.

Experiment introduction

Experimental background

With the popularity of personal mobile phones and the Internet, mobile phones have basically become a tool that everyone must have. According to the coverage of the mobile phone signal and the geographic space combined with the mobile phone positioning data of the time series, the real movement trajectory of the crowd can be completely restored, so as to obtain the characteristic information of the spatial distribution of the population in the activity contact.

The business district is an important space for corporate activities in the modern market. One of the purposes of dividing the business district is to study the distribution of potential customers in order to formulate appropriate business strategies. The data this time is provided by the communication operator, and the user positioning data obtained by the analysis of a specific interface.
insert image description here

Experimental goal

  1. Based on the user's historical positioning data, data mining technology is used to group base stations
  2. Analyze the characteristics of different business circle groups, compare the value of different business circle categories, and select appropriate areas for targeted marketing activities

Experimental Analysis Methods and Processes

Through the data provided by the communication operator, it can be inferred that the mobile phone user generates the positioning data when using the information such as the short message service, the call service and the Internet service. Since different base stations are used for location identification in the data, each base station can be regarded as a "business district", and then by summarizing the demographic characteristics of the base station range and using clustering algorithms to identify different types of base station ranges, it is equivalent to identify different types of base stations. category of business districts. To measure the regional demographic characteristics, it can be analyzed from the perspective of people flow and per capita residence time.

Analysis process

insert image description here

1) The user positioning data is obtained after parsing, processing, and filtering out user attributes from the specific interface provided by the mobile communication operator.
2) Take a single user as an example, carry out data exploration and analysis, study the dwell time in different base stations, and further perform preprocessing, including data reduction and data transformation.
3) Using the modeling data formed in step 2) that has completed data preprocessing, perform business circle clustering based on the flow characteristics of the base station coverage area, analyze the characteristics of each business circle group, and select an appropriate area for operator promotions Activity.

Data extraction and analysis

data extraction

The location data is obtained after parsing, processing and filtering on the specific interface provided by the mobile communication operator. The time is set to 2014-1-1 as the start time and 2014-6-30 as the end time as the observation window for analysis, and the positioning data of a certain city and a certain area in the window is extracted to form modeling data.

data analysis

In order to facilitate the observation of the data, the user ID, that is, the user whose EMASI number is "55555", is extracted first, and then the positioning data on January 1, 2014.

Observing the data, it can be found that the two data may be different times of the same base station .
insert image description here

As can be seen from the table, the user is in the range of the 36908 base station at 00:31:48 on January 1, 2014, and the next record is that the user is in the range of the 36908 base station at 00:53:46 on January 1, 2014. It shows that the user was in the 36908 base station from 00:31:48 to 00:53:46, stayed for 21 minutes and 58 seconds, and entered the range of the 36902 base station at 00:53:46. Judging the user's stay time at each base station needs to rely on the comparison between the current record and the next record. A change means that the base station where the user is located has changed, and the stay time at the previous base station can be recorded.

data preprocessing

data protocol

There are many attributes in the original data, and the three attributes of network type, LOC number and signaling type are not useful and eliminated by our mining target. It is not necessary to measure the user's stay time to the millisecond, so it is deleted together. When calculating the user's stay time, only the time difference between the two records is calculated. In order to reduce the data dimension, the year, month, and day are combined into dates, and hours, minutes, and seconds are combined into time to obtain the data.
insert image description here

data transformation

The goal of mining is to find high-value business districts. It is necessary to extract the characteristics of people flow in the area within the base station range, such as per capita residence time and people flow, according to the user's positioning data. High-value business districts are characterized by a large flow of people and a long stay per capita. Office workers in the office building have a fixed base station range during the day, and the time is also long and the flow of people is also large. Residential areas also have the characteristics of fixed base station range, long time, and large flow of people. Therefore, it is impossible to judge the type of business district simply by staying time. In modern social work, a week is a small work cycle, which is divided into working days and weekends. The day is divided into working hours and working hours.

To sum up, four indicators of the characteristics of people flow are designed, namely, the average stay time during working hours on weekdays, the average stay time in the early morning, the average stay time on weekends, and the average daily flow of people. The average stay time of all users within the range of the base station during the working hours during working hours on weekdays means the average time of all users staying within the range of the base station in the early morning, which means that all users are at 00:00 and 07:00 in the early morning. The average time spent in the range of the base station per capita on weekends, and so on. The average daily traffic refers to the average number of people who used to be within the range of the base station every day.

The calculation of these four indicators is more complicated to calculate directly from the original data. It needs to be processed into intermediate data first, and then the four indicators can be calculated from it. For base station 1, there is the following formula, which is then brought into all base stations to obtain the result.
insert image description here
insert image description here

Due to the large difference between the various properties. In order to eliminate the influence of order-of-magnitude data, before clustering, dispersion normalization processing is required. The code for dispersion normalization processing is as follows to obtain the modeled sample data.

#-*- coding: utf-8 -*-
#数据标准化到[0,1]
import pandas as pd

#参数初始化
filename = '../data/business_circle.xls' #原始数据文件
standardizedfile = '../tmp/standardized.xls' #标准化后数据保存路径
#########使用index_col来建立索引列,不包含在data中
data = pd.read_excel(filename, index_col = u'基站编号') #读取数据

data = (data - data.min())/(data.max() - data.min()) #离差标准化
data = data.reset_index()

data.to_excel(standardizedfile, index = False) #保存结果

Normalized data:
insert image description here

Model Construction - Hierarchical Clustering Algorithms

Hierarchical clustering

Hierarchical clustering methods decompose a given dataset hierarchically until a certain condition is met.

After the distance value has been obtained, the elements can be related. A structure can be constructed by separation and fusion. Traditionally, the representation method is tree data structure, hierarchical clustering algorithm, either bottom-up aggregation type, that is, starting from the leaf node, and finally converging to the root node; or top-down splitting type, that is, from the Start at the root node, recursively split down.

1. Agglomerative hierarchical clustering: AGNES algorithm (bottom-up)

First treat each object as a cluster, then merge these atomic clusters into larger and larger clusters until some terminal condition is satisfied

2. Split hierarchical clustering: DIANA algorithm (top-down)

All objects are first placed in a cluster and then gradually subdivided into smaller and smaller clusters until a certain terminal condition is reached.

After the data is preprocessed, the modeling data has been formed. In this clustering, the hierarchical clustering algorithm is used to cluster the modeling data based on the base station data, and the pedigree clustering diagram is drawn. The code is as follows.

#-*- coding: utf-8 -*-
#谱系聚类图
import pandas as pd

#参数初始化
standardizedfile = '../data/standardized.xls' #标准化后的数据文件
data = pd.read_excel(standardizedfile, index_col = u'基站编号') #读取数据

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15.0, 4.0)
plt.rcParams['figure.dpi'] = 100

from scipy.cluster.hierarchy import linkage,dendrogram
#这里使用scipy的层次聚类函数

Z = linkage(data, method = 'ward', metric = 'euclidean') #谱系聚类图
P = dendrogram(Z, 0) #画谱系聚类图
plt.savefig(u'cluster.png')
plt.show()

According to the code, the pedigree clustering diagram is obtained, as shown in the figure.
insert image description here

As can be seen from the figure, the number of clustering categories can be taken into 3 categories , and then the hierarchical clustering algorithm can be used to train the model. The code is as follows.

#-*- coding: utf-8 -*-
#层次聚类算法
import pandas as pd

#参数初始化
standardizedfile = '../data/standardized.xls' #标准化后的数据文件
k = 3 #聚类数
data = pd.read_excel(standardizedfile, index_col = u'基站编号') #读取数据

from sklearn.cluster import AgglomerativeClustering #导入sklearn的层次聚类函数
model = AgglomerativeClustering(n_clusters = k, linkage = 'ward')#凝聚层次聚类
model.fit(data) #训练模型

#详细输出原始数据及其类别
r = pd.concat([data, pd.Series(model.labels_, index = data.index)], axis = 1)  #详细输出每个样本对应的类别
r.columns = list(data.columns) + [u'聚类类别'] #重命名表头
r.to_excel('../tmp/r.xls')

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号

style = ['ro-', 'go-', 'bo-']
xlabels = [u'工作日人均停留时间', u'凌晨人均停留时间', u'周末人均停留时间', u'日均人流量']
pic_output = '../tmp/type_' #聚类图文件名前缀

for i in range(k): #逐一作图,作出不同样式
  plt.figure()
  tmp = r[r[u'聚类类别'] == i].iloc[:,:4] #提取每一类
  for j in range(len(tmp)):
    plt.plot(range(1, 5), tmp.iloc[j], style[i])
  
  plt.xticks(range(1, 5), xlabels, rotation = 20) #坐标标签
  plt.title(u'商圈类别%s' %(i+1)) #我们计数习惯从1开始
  plt.subplots_adjust(bottom=0.15) #调整底部
  plt.savefig(u'%s%s.png' %(pic_output, i+1)) #保存图片

In this section, we use two python libraries, scipy and sklearn, to implement the hierarchical clustering algorithm. According to the linkage clustering results of the scipy library, the number of clustering categories is set to 3. Then use the hierarchical clustering model of the sklearn library to divide the data into three categories for category analysis.

Model analysis

For the clustering results, a line graph of 3 features is drawn according to different categories:
insert image description here

insert image description here
insert image description here

  1. Business district category 1, the per capita stay time on weekdays and in the early morning are very low, the per capita stay time on weekends is medium, and the average daily traffic is extremely high, which is in line with the characteristics of business districts .
  2. Business district category 2, the average stay time per person on weekdays is medium, the average stay time per person in the early morning and on weekends is very long, and the average daily flow of people is low, which is in line with the characteristics of residential areas
  3. Business district category 3, this part of the working day has a long per capita stay, less in the early morning and on weekends, and the average daily traffic is moderate, which is very consistent with the office business district .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326386535&siteId=291194637