Epidemic prevention and control cannot be slack, and data mining is used to predict population density in key areas

The Spring Festival of 2021 is approaching. Last year, affected by the new crown pneumonia, everyone could only stay at home and could not visit relatives and party with friends. Thanks to the efforts of the people across the country, the epidemic has been effectively controlled in our country. Everyone is looking forward to returning home for a good year. However, in recent days, there have been sporadic cases of new coronary pneumonia in various places. As the day approaches, people are increasingly worried about whether the epidemic will rebound with the Spring Festival travel season and visiting relatives. The flow of population has objectively increased the risk of epidemic spread and the difficulty of prevention and control. In order to further grasp the movement of people's movements and do a good job in the prevention and control of emergency epidemics, it is particularly important to carry out population density predictions in key areas related to the epidemic.

This case is based on Smartbi Mining, the data mining platform of Sematic software, and uses logistic regression classification algorithm to predict the population density in key areas. The goals are as follows:

(1) With the help of historical population density in key areas, statistics on the characteristics of the flow index and migration index;

(2) Establish a model to predict the future population density in key areas, and grasp the trend of crowd flow;

(3) Do a good job of emergency epidemic prevention and control in areas with high population density.

The overall process of population density prediction in key areas in this case is shown in Figure 1-1.

7ea126c1e92d1e6a11da8d048de58925.webp

Picture 1-1


(1) Obtain data, which comes from the crowd density prediction data in key areas of the competition;

(2) Perform basic processing operations on the acquired data, group statistics on the flow index and migration index, as the input features of the model;

(3) Establish a population density prediction model in key areas based on statistical characteristic data;

(4) Evaluate the model results.

Implementation process

This case has a total of 3 data sets, taking the data related to the flow of people from 20200117-20200215 during the epidemic last year as an example. The following is the field description of each data set.

Table 2-1 The flow of people in key areas

1701e12a0efea85e393ed41f807b226f.webp

Table 2-2 Key area information table

7f05cd31edab35fdb71c35e4a5923d34.webp

Table 2-3 Beijing Migration Index Table

5ece93fe68e8b4424a8c33b00b2176d5.webp

the data shows:

●In the flow table of key areas, the flow index is proportional to the number of people in the area within a certain hour on a certain day. The larger the traffic index in area A, the more people appear in area A, and vice versa.

●In the Beijing Migration Index table, the migration index refers to the amount of people flowing between Beijing and other cities on a certain day. The greater the migration index from city A to city B, the greater the number of people migrating from city A to city B, and vice versa.

2.1. Data access

Add a data source node in the experiment, and read in the data from the above three tables. Part of the data is shown in Figure 2-1.

37fd0389bee564f7378aa4ad44060052.webp

picture 2-1


In order to understand the meaning of the field and make the field more intuitive, use the metadata to edit the node and add the Chinese field alias. The changed output is shown in Figure 2-2, and the flowchart is shown in Figure 2-3.

9f7d1ced2d1f2c6adcecf23d644bad40.webp

Figure 2-2


a586c8f9d3e9f18c900ff4b00d609daf.webp

Figure 2-3


2.2. Data exploration

The exploratory analysis of this case is to perform missing value analysis and data distribution analysis on the data to analyze the missing data and distribution. Through data observation, it is found that the date and time field formats in the key area human flow table and the Beijing migration index table are inconsistent (Figure 2-4 and Figure 2-5), which will affect the merger of the tables. Therefore, the two tables need to be unified Date and time format.

5eec95c1f89006e8f0fc37ee9ca21a07.webp

Figure 2-4 The flow of people in key areas


a4c123b65b95385f726d75d6d53206b0.webp

Figure 2-5 Beijing Migration Index Table


In order to view the numerical data of the entire data set, access a full table statistics node, select all the numerical fields as shown in Figure 2-6, and the output result is shown in Figure 2-7. You can see that all data has no missing values .

0bfc380d5c84cc8d25fab8264fb796e9.webp

Figure 2-6 Select all numeric fields


89c1d39078b741a2f31f0bbfbda66e2e.webp

Figure 2-7 Data missing


2.3. Data preprocessing

This case mainly uses feature derivation and data transformation preprocessing methods.

2.3.1. Feature Derivation

Through data exploration and analysis, it is found that the field formats in the two data tables are not uniform and cannot be merged, so the field formats need to be unified. Specific processing method: connect a derived column node to both tables, cut out the year, month, and day information of the date and time field, and unify the field format. Connect a derived column node, and the derived column configuration is shown in Figure 2-8.

2f2c2369ee9b4468d6e9e23cb595ffd4.webp

Figure 2-8 Derived column configuration


The result after the derived column is shown in Figure 2-9

23029d5747b61995c6613d08c6834be3.webp

Figure 2-9 The converted date and time format


According to the converted date and time format, the field "weekday" can be derived, which means that the day belongs to the day of the week. A derived column node is connected. The derived column configuration is shown in Figure 2-10.

9e0a3822449d2dd1a18aa75d5350392d.webp

Figure 2-10 Derived column configuration


The derivative result of the "weekday" field is shown in Figure 2-11.

b64c329153fbe269fef13314c44674ac.webp

Figure 2-11 The weekday field


2.3.2. Data changes

Since the original key area population density table only provides the historical 20200117-20200215 daily minute-hour human flow, regression prediction needs to be constructed for the target value. The specific construction method is: use date, hour, weekday, area and The aggregation characteristics of the statistical values ​​of the people flow index and migration index of the area type, such as minimum, maximum, mean, sum, etc.;

Connect to the aggregation node, perform Group operations for date, hour, weekday, area, and area type, and perform Min, Max, Avg, and Sum operations for the flow index and migration index, as shown in Figure 2-12, 2-13, 2-14 , 2-15, 2-16.

bdb67a8026188b38e8e19a773e356068.webp

Figure 2-12 Aggregate People Flow Index by Region


31cdd423fb0c7fd79f682cf2caaeb293.webp

Figure 2-13 Aggregate people flow index according to regional type


1eff6921093a7614456153baa14a02c4.webp

Figure 2-14 Aggregate traffic index by hour


d98fced18e81e415713bc2367be4668e.webp

Figure 2-15 Aggregate traffic index according to weekday


81182a3a17832e56b9ef2d0a68d626fe.webp

Figure 2-16 Aggregate migration index according to date


Use the JOIN node to merge the aggregated features, and then access the full table statistics node to view the distribution of all feature fields, as shown in Figure 2-17.

fe6ff787a38245045fe4080799dd5241.webp

Figure 2-17 Distribution of index values


2.3.3. Flow chart of preprocessing

The entire preprocessing flow chart is shown in Figure 2-18.

c874681870934a2a9827406b56071da3.webp

Figure 2-18


2.4. Build a model

We use a regression algorithm, here we use the gradient boosting regression tree algorithm. The overall experimental process is shown in Figure 2-19.

8928bfbc2c36ba5d26f594398b8c21e7.webp

Figure 2-19 Population density regression prediction model


The feature selection node, the feature column selects the aggregated features output by the data transformation step, as shown in Figure 2-20.

bb16cafc663f3ae4ccbb173f813c21c0.webp

Figure 2-20 Feature selection feature column


The target column of feature selection selects the flow index, as shown in Figure 2-21.

412f12dd3bbcc694c0876b5e7f2ef978.webp

Figure 2-21 Select the target column


The split node uses the default parameter configuration, and the ratio of the training set to the test set is 7:3;

The parameter configuration of the gradient boosting regression tree is shown in Figure 2-21.

2ee95c0c8e54b18ac793e6aaa9154bde.webp

Figure 2-21 Parameter configuration of gradient boosting regression tree


The output result of the evaluation node is shown in Figure 2-22, and R2 is about 0.96.

98127b72331d4a7c0314873a86fde9d0.webp

Figure 2-22 Model evaluation results


In this case, combined with the case of forecasting the population density of key areas during the epidemic, it focuses on the application of regression prediction analysis in actual cases. This case uses the historical population density of key areas to calculate the characteristics of the population flow index and migration index; establishes a model to predict the future population density in key areas, and grasps the trend of crowd flow; for areas with high population density, emergency epidemic prevention and control jobs.

Guess you like

Origin blog.51cto.com/15047075/2594963