The Spring Festival of 2021 is approaching. Last year, affected by the new crown pneumonia, everyone could only stay at home and could not visit relatives and party with friends. Thanks to the efforts of the people across the country, the epidemic has been effectively controlled in our country. Everyone is looking forward to returning home for a good year. However, in recent days, there have been sporadic cases of new coronary pneumonia in various places. As the day approaches, people are increasingly worried about whether the epidemic will rebound with the Spring Festival travel season and visiting relatives. The flow of population has objectively increased the risk of epidemic spread and the difficulty of prevention and control. In order to further grasp the movement of people's movements and do a good job in the prevention and control of emergency epidemics, it is particularly important to carry out population density predictions in key areas related to the epidemic.
This case is based on Smartbi Mining, the data mining platform of Sematic software, and uses logistic regression classification algorithm to predict the population density in key areas. The goals are as follows:
(1) With the help of historical population density in key areas, statistics on the characteristics of the flow index and migration index;
(2) Establish a model to predict the future population density in key areas, and grasp the trend of crowd flow;
(3) Do a good job of emergency epidemic prevention and control in areas with high population density.
The overall process of population density prediction in key areas in this case is shown in Figure 1-1.
Picture 1-1
(1) Obtain data, which comes from the crowd density prediction data in key areas of the competition;
(2) Perform basic processing operations on the acquired data, group statistics on the flow index and migration index, as the input features of the model;
(3) Establish a population density prediction model in key areas based on statistical characteristic data;
(4) Evaluate the model results.
Implementation process
This case has a total of 3 data sets, taking the data related to the flow of people from 20200117-20200215 during the epidemic last year as an example. The following is the field description of each data set.
Table 2-1 The flow of people in key areas
Table 2-2 Key area information table
Table 2-3 Beijing Migration Index Table
the data shows:
●In the flow table of key areas, the flow index is proportional to the number of people in the area within a certain hour on a certain day. The larger the traffic index in area A, the more people appear in area A, and vice versa.
●In the Beijing Migration Index table, the migration index refers to the amount of people flowing between Beijing and other cities on a certain day. The greater the migration index from city A to city B, the greater the number of people migrating from city A to city B, and vice versa.
2.1. Data access
Add a data source node in the experiment, and read in the data from the above three tables. Part of the data is shown in Figure 2-1.
picture 2-1
In order to understand the meaning of the field and make the field more intuitive, use the metadata to edit the node and add the Chinese field alias. The changed output is shown in Figure 2-2, and the flowchart is shown in Figure 2-3.
Figure 2-2
Figure 2-3
2.2. Data exploration
The exploratory analysis of this case is to perform missing value analysis and data distribution analysis on the data to analyze the missing data and distribution. Through data observation, it is found that the date and time field formats in the key area human flow table and the Beijing migration index table are inconsistent (Figure 2-4 and Figure 2-5), which will affect the merger of the tables. Therefore, the two tables need to be unified Date and time format.
Figure 2-4 The flow of people in key areas
Figure 2-5 Beijing Migration Index Table
In order to view the numerical data of the entire data set, access a full table statistics node, select all the numerical fields as shown in Figure 2-6, and the output result is shown in Figure 2-7. You can see that all data has no missing values .
Figure 2-6 Select all numeric fields
Figure 2-7 Data missing
2.3. Data preprocessing
This case mainly uses feature derivation and data transformation preprocessing methods.
2.3.1. Feature Derivation
Through data exploration and analysis, it is found that the field formats in the two data tables are not uniform and cannot be merged, so the field formats need to be unified. Specific processing method: connect a derived column node to both tables, cut out the year, month, and day information of the date and time field, and unify the field format. Connect a derived column node, and the derived column configuration is shown in Figure 2-8.
Figure 2-8 Derived column configuration
The result after the derived column is shown in Figure 2-9
Figure 2-9 The converted date and time format
According to the converted date and time format, the field "weekday" can be derived, which means that the day belongs to the day of the week. A derived column node is connected. The derived column configuration is shown in Figure 2-10.
Figure 2-10 Derived column configuration
The derivative result of the "weekday" field is shown in Figure 2-11.
Figure 2-11 The weekday field
2.3.2. Data changes
Since the original key area population density table only provides the historical 20200117-20200215 daily minute-hour human flow, regression prediction needs to be constructed for the target value. The specific construction method is: use date, hour, weekday, area and The aggregation characteristics of the statistical values of the people flow index and migration index of the area type, such as minimum, maximum, mean, sum, etc.;
Connect to the aggregation node, perform Group operations for date, hour, weekday, area, and area type, and perform Min, Max, Avg, and Sum operations for the flow index and migration index, as shown in Figure 2-12, 2-13, 2-14 , 2-15, 2-16.
Figure 2-12 Aggregate People Flow Index by Region
Figure 2-13 Aggregate people flow index according to regional type
Figure 2-14 Aggregate traffic index by hour
Figure 2-15 Aggregate traffic index according to weekday
Figure 2-16 Aggregate migration index according to date
Use the JOIN node to merge the aggregated features, and then access the full table statistics node to view the distribution of all feature fields, as shown in Figure 2-17.
Figure 2-17 Distribution of index values
2.3.3. Flow chart of preprocessing
The entire preprocessing flow chart is shown in Figure 2-18.
Figure 2-18
2.4. Build a model
We use a regression algorithm, here we use the gradient boosting regression tree algorithm. The overall experimental process is shown in Figure 2-19.
Figure 2-19 Population density regression prediction model
The feature selection node, the feature column selects the aggregated features output by the data transformation step, as shown in Figure 2-20.
Figure 2-20 Feature selection feature column
The target column of feature selection selects the flow index, as shown in Figure 2-21.
Figure 2-21 Select the target column
The split node uses the default parameter configuration, and the ratio of the training set to the test set is 7:3;
The parameter configuration of the gradient boosting regression tree is shown in Figure 2-21.
Figure 2-21 Parameter configuration of gradient boosting regression tree
The output result of the evaluation node is shown in Figure 2-22, and R2 is about 0.96.
Figure 2-22 Model evaluation results
In this case, combined with the case of forecasting the population density of key areas during the epidemic, it focuses on the application of regression prediction analysis in actual cases. This case uses the historical population density of key areas to calculate the characteristics of the population flow index and migration index; establishes a model to predict the future population density in key areas, and grasps the trend of crowd flow; for areas with high population density, emergency epidemic prevention and control jobs.