[PyTorch deep learning] PVCGN, clustering to visualize and predict the inflow and outflow of private cars (with source code and data set)

If you need source code and data sets, please like and follow the collection and leave a private message in the comment area~~~

1. Data preprocessing

After the private car trajectory data is acquired, there will be a series of data quality problems, such as data loss and redundancy. Before analyzing and mining the mobile trajectory data, the original data should be effectively predicted according to different application scenarios and research objectives. deal with. It mainly introduces the preprocessing method of private car travel data, and selects 7 main fields: ObjectID, StartTime, StartLon, StartLat, StopTime, StopLon, StopLat

Since the vehicle acquires data dynamically in real time, and the driving environment such as traffic conditions and road networks is complex, for example, when the vehicle is driving near a tall building where there is shelter or strong electromagnetic wave interference, or when the positioning device is faulty and has not been checked in time, will cause the positioning device to produce position data that deviates from the true value

The method and code for the preprocessing of the home car trajectory data, specifically clearing the missing data in the main field, clearing the data with the same start time of the trip, and clearing the data with the starting point less than 3m

First, remove the data whose main field is 0. Secondly, travel records of less than 1 minute by private car are marked as invalid records and will be deleted. Finally, delete the itinerary record of the private car with a relatively close starting position, calculate the distance between the passenger points through the Haversine distance, set the distance threshold to 3 meters, and the order with the starting position less than 3 meters will be deleted

2. Problem Statement and Model Framework

Based on the obtained time series CSV files of private car traffic, taking the open source deep learning forecasting model as an example, the practical teaching of the Pytorch framework in the application case of private car traffic forecasting will be carried out. In this section, based on private car trajectory data, the city is divided into multiple sub-regions, and the historical private car flow in each sub-region is counted. Taking the Physical-virtual collaboration graph network (Physical-virtual collaboration graph network, PVCGN) model as an example, several consecutive future Forecast the inflow and outflow of private cars in the time period

PVCGN model

As can be seen from the left figure below, the core component of the model is a collaborative gated recurrent network. Build multi-view graph structures based on historical data, and incorporate these multi-graphs into graph convolution-gated recurrent units for spatiotemporal representation learning. In addition, fully connected gated recurrent units are also applied to capture the global traffic evolution trend. As shown on the right, the joint GC-GRU and FC-GRU developed the CGRM model to predict future traffic

 

3. Data preparation 

Using Shenzhen private car trajectory data open sourced by Hunan University, the time span is from September 1, 2018 to September 15, 2018, with a total of 211,000 pieces of data. The data set contains 7 fields, which are the unique number of the vehicle after desensitization, departure time, longitude of departure point, latitude of departure point, arrival time, longitude of arrival point, and latitude of arrival point. Each piece of data represents the itinerary information of a private car, and the data examples are shown in the table below

The specific fields of the data are described in the following table

4. Data Modeling

 trajectory clustering

This practice is a simple supervised learning task. First, the K-means clustering algorithm (K-means) is used to cluster the trajectory data of private cars. The purpose of clustering is to label the places frequented by private car users, and to classify the places frequented by private car users. Groups of locations, thereby dividing the city into subregions. The spatial division of urban areas can be achieved in many ways. To simplify the process, the aim is to obtain the labels of urban areas

Due to the randomness of the clustering algorithm, it is necessary to cluster the data of the departure point and the arrival point together when selecting data. In order to construct similarity graphs, association graphs, and distance graphs in the future, it is necessary to output the latitude and longitude of the cluster centers and the clusters they belong to during calculation. Set the K value to 80

Use the read_csv function of the pandas library to import data from the CSV file, and then use the concat function to splice the departure and arrival data together. Use the K-means function under the sklearn.cluster module to construct an estimator clusterer, and use the labels_ and cluster_centers_ attributes of the estimator to obtain the cluster labels and cluster centers. Use the scatter function under the matplotlib.pyplot module to draw the clustering results into a scatter plot, and the points with the same color belong to the same cluster

Trajectory clustering results

The number of sample points contained in each cluster

 cluster center point

Time and space distribution statistics of traffic flow

 A statistical algorithm is designed to count the traffic flow of the location groups obtained by clustering, and a total of 80 sub-regions are divided. According to the time distribution of the data set, if the time slice is set to 1 hour, a day is divided into 24 time slices. The statistical algorithm records each departure record as a vehicle outflow, and records each arrival record as a private car inflow. If within a time slice, m clusters contain n departure or arrival records, it means that the outflow or outflow volume of area m in this period is n. The statistical algorithm designed in this section uses the count function in each loop iteration to extract the distribution of private car traffic in time and space

For the convenience of statistics, the data sets after clustering are first sorted in chronological order. Use the strftime function of the time module to format the date and get a readable string, and then use the strptime function to parse it into a time tuple to realize the addition of time. Start traversing the dataset, initialize the list infos in units of one hour, and use the append function to store the clustering labels of all data within one hour. For each value in infos, use the count function to count the number of each value. When a cycle ends, use the clear function to clear the infos and complete the statistics

Statistical results of the spatio-temporal distribution of traffic flow: the inflow of private cars in 80 areas per hour

From September 1, 2018 to September 15, 2018, there were a total of 210,942 vehicle inflows and 210,980 vehicle outflows in 80 regions, with an average inflow of 14,062 times/day and an average outflow of 14,065 times/day. This section counts the daily inflow and daily outflow separately, and draws them into a histogram. From the figure below, it can be seen that the traffic flow on weekdays is relatively stable and overall higher than that on rest days, which is in line with the travel rules of urban residents

 Statistical Summary of Time and Space Distribution of Traffic Flow

This section selects the data from September 1 to September 5, and continues to count private car traffic in different time periods. It can be seen from the above figure that the traffic flow from 0:00 to 6:00 shows a downward trend, and the number of private car trips is small, and it reaches the trough of the traffic flow distribution from 4:00 to 6:00. The traffic flow from 6:00 to 10:00 has a significant upward trend, and 8:00 to 10:00 is the morning peak period of the day. Urban residents travel mainly for work and school. The traffic flow from 10:00 to 12:00 is relatively stable, showing small fluctuations up and down. 12:00-14:00 is the lunch break for residents, and the traffic flow during this period has a slight decrease. The rising trend of traffic flow from 14:00 to 16:00 is relatively gentle, and reaches the peak of the day from 16:00 to 18:00. This period is the evening peak period. for shopping and other activities. Private car trips after 18:00 show an overall downward trend over time

Construction of multi-view space-time graph 

In this section, we design algorithms to construct distance graphs, similarity graphs, and correlation graphs, and use multiple graphs to extract multi-view spatiotemporal associations between different regions. Each graph has the same 80 nodes, which represent 80 cluster centers, and the edges of each graph have different definitions. First, a distance matrix P is constructed, and P(i, j) represents the actual distance from cluster center i to cluster center j, that is, the spherical distance calculated by using the latitude and longitude values ​​and the Haversine formula. The algorithm only calculates the matrix values ​​above the main diagonal, and sets the value P(i,i) at the diagonal to 0, and then fills the values ​​according to the symmetry of the matrix, avoiding repeated calculations and saving time.

An example of constructing an undirected distance graph from a multi-view space-time graph

To construct the similarity matrix S, first calculate the mean value of P(i, j) (0≤i, j≤79), that is, the mean distance_mean of the actual distance between nodes. If the value of P(i, j) is greater than distance_mean, set the value of S(i, j) to 1, otherwise S(i, j) is 0

 Then construct the correlation matrix C, C(i, j) represents the total number of vehicles from cluster center i to cluster center j in the entire data set, and describes the dynamic interaction between the two regions. C(i, i) needs to be calculated, because some private cars will enter and exit the same area. Define a two-dimensional array trans_matrix[i][j] for cumulative statistics

Example of building a directed association graph from a multi-view space-time graph

Use the dump function in the pickle library to encapsulate the matrices P, S, and C into graph_sz_conn.pkl, graph_sz_sml.pkl, and graph_sz_cor.pkl, respectively, representing the distance graph, similarity graph, and correlation graph

Data format conversion

 The input of the prediction model contains 3 pkl files in total, namely the training set, verification set and test set, all of which store the inflow and outflow data of private cars. This section divides the private car traffic data (2018/9/1-2018/9/15) into three parts, namely the training set (2018/9/1-2018/9/10), the verification set (2018/9 /11-2018/9/12) and the test set (2018/9/13-2018/9/15). Referring to the input data format of the model, the algorithm is designed to store three sets of data sets in train.pkl, val.pkl and test.pkl respectively, and each pkl file is a dictionary containing four multidimensional arrays. Taking train.pkl as an example, there are 233 groups of time slices in total, and each group contains 4 time slices. This section intends to use the private car traffic of the first 4 time slices (1h*4=4h) (private cars in 80 areas Inflow and outflow) to predict the private car traffic in the next 4 time slices, that is, use x_train to predict y_train, for example, 0:00-04:00 → 04:00-08:00

Data format conversion: pkl file field description

Data format conversion: parameter description

 Data format conversion: pkl file specific information

PVCGN model

 Use the CGRM module and the Seq2Seq framework to build a PVCGN model to predict the private car flow in each region in each time period in the future. PVCGN consists of an encoder and a decoder, both of which contain two CGRM modules. In the encoder, flow data is fed into the underlying CGRM module in order to accumulate relevant historical information. And the hidden state output by it is input into the CGRM module of the upper layer for high-level feature learning. In the decoder, the input data is set to 0 for the first iteration, and the final hidden state of the encoder is used to initialize the hidden state of the decoder. Then input the output hidden state of the upper layer CGRM module into the fully connected layer to predict the future traffic flow. In the next iteration, the underlying CGRM module takes the predicted value from the previous iteration as input, and the upper CGRM module continues to use the fully connected layer to make predictions, and finally obtains the future sequence of predicted values

5. Model training and test results

Input the packaged 6 pkl files into the PVCGN deep learning model, use the model.train() statement to enable Batch Normalization and Dropout and start training. In order to facilitate model loading and save time, use the state_dict dictionary object and save function in pytorch to save the model parameters after each round of training. The test part uses the model.eval() statement to switch from the training mode to the test mode, and uses the with torch.no_grad() statement to stop the gradient update to speed up and save video memory. The code ggnn_train.py for offline training and testing of PVCGN has been open sourced on Github

Models were evaluated using root mean square error, mean absolute error, and mean absolute percentage error. These three indicators all describe the degree of error between the predicted value and the real value. The smaller the value, the higher the accuracy of the model.

        The experimental evaluation uses the inverse_transform function under the scaler to restore the standardized data to obtain the original data. Then use a custom function to calculate the value of each evaluation parameter

In the prediction part, the operating system of the experimental equipment used is Ubuntu 18.04, and the GPU is NVIDIA RTX2080Ti. The softening used and its version number are shown in the figure below

When setting parameters, we mainly consider factors such as data size and computing power. Batch Size indicates the number of training samples in each batch. In order to find the best balance between memory efficiency and memory capacity, this chapter sets Batch Size to 32 and epoch to 200 times. As the number of epochs increases, the neural network The number of update iterations in the middle increases, and from the initial non-fitting state, it slowly enters the optimal fitting state.

Determine the optimal model by adjusting the hyperparameter rnn_units (the number of hidden layer units), and record the influence of the number of hidden layer units on MAE, MAPE and RMSE during the training process. The experimental results are shown in the figure below

 As shown in the figure below, it can be found that when the number of hidden layer units is 32, the MAE and subsequent MAPE curves are both at the trough, that is, reaching the minimum value

Although the value of RMSE is the lowest when the number of hidden layer units is 16, the previous MAE curve and the value of MAPE are both higher. Therefore, the optimal value of the number of hidden layer units is 32. At this time, the prediction ability of the model is optimal. When the epoch is 83, the optimal prediction result is achieved.

 

6. Code

 The last part of the code is as follows

ggnn_train.py

import random
import argparse
import time
import yaml
import numpy as np
import torch
import os

from torch import nn
from torch.nn.utils import clip_grad_norm_
from torch import optim
from torch.optim.lr_scheduler import MultiStepLR
from torch.nn.init import xavier_uniform_
from lib import utils
from lib import metrics
from lib.utils import collate_wrapper
from ggnn.multigraph import Net

import torch.backends.cudnn as cudnn


try:
    from yaseed)
torch.manual_seed(seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
cuda = True
cudnn.benchmark = True

# 读取配置文件
def read_cfg_file(filename):
    with open(filename, 'r') as ymlfile:
        cfg = yaml.load(ymlfile, Loader=Loader)
    return cfg

torch.cuda.empty_cache()

def run_model(model, data_iterator, edge_index, edge_attr, device, output_dim):
    """
    return a list of (horizon_i, batch_size, num_nodes, output_dim)
    """
    # while evaluation, we need model.eval and torch.no_grad
    model.eval()
    y_pred_list = []
    for _, (x, y, xtime, ytime) in enumerate(data_iterator):
        y = y[..., :output_dim]
        sequences, y = collate_wrapper(x=x, y=y,
                                       edge_index=edge_index,
                                       edge_attr=edge_attr,
                                       device=device)
        # (T, N, num_nodes, num_out_channels)
        with torch.no_grad():
            y_pred = model(sequences)
            y_pred_list.append(y_pred.cpu().numpy())
    return y_pred_list


# 模型性能评估
def evaluate(model,
             dataset,
             dataset_type,
             edge_index,
             edge_attr,
             device,
             output_dim,
             logger,
             detail=True,
             cfg=None,
             format_result=False):
    if detail:
        logger.info('Evaluation_{}_Begin:'.format(dataset_type))
    scaler = dataset['scaler']
    y_preds = run_model(
        model,
        data_iterator=dataset['{}_loader'.format(dataset_type)].get_iterator(),
        edge_index=edge_index,
        edge_attr=edge_attr,
        device=device,
        output_dim=output_dim)

    y_preds = np.concatenate(y_preds, axis=0)  # concat in batch_size dim.
    mae_list = []
    mape_list = []
    rmse_list = []
    mae_sum = 0
    mape_sum = 0
    rmse_sum = 0
    # horizon = dataset['y_{}'.format(dataset_type)].shape[1]
    horizon = cfg['model']['horizon']
    for horizon_i in range(horizon):
        y_truth = scaler.inverse_transform(
            dataset['y_{}'.format(dataset_type)][:, horizon_i, :, :output_dim])

        y_pred = scaler.inverse_transform(
            y_preds[:y_truth.shape[0], horizon_i, :, :output_dim])
        mae = metrics.masked_mae_np(y_pred, y_truth, null_val=0, mode='dcrnn')
        mape = metrics.masked_mape_np(y_pred, y_truth, null_val=0)
        rmse = metrics.masked_rmse_np(y_pred, y_truth, null_val=0)
        mae_sum += mae
        mape_sum += mape
        rmse_sum += rmse
        mae_list.append(mae)
        mape_list.append(mape)
        rmse_list.append(rmse)
        msg = "Horizon {:02d}, MAE: {:.2f}, MAPE: {:.4f}, RMSE: {:.2f}"
        if detail:
            logger.info(msg.format(horizon_i + 1, mae, mape, rmse))
    if detail:
        logger.info('Evaluation_{}_End:'.format(dataset_type))
    if format_result:
        for i in range(len(mape_list)):
            print('{:.2f}'.format(mae_list[i]))
            print('{:.2f}%'.format(mape_list[i] * 100))
            print('{:.2f}'.format(rmse_list[i]))
            print()
    else:
        return mae_sum / horizon, mape_sum / horizon, rmse_sum / horizon


class StepLR2(MultiStepLR):
    """StepLR with min_lr"""

    def __init__(self,
                 optimizer,
                 milestones,
                 gamma=0.1,
                 last_epoch=-1,
                 min_lr=2.0e-6):
        """

        :optimizer: TODO
        :milestones: TODO
        :gamma: TODO
        :last_epoch: TODO
        :min_lr: TODO

        """
        self.optimizer = optimizer
        self.milestones = milestones
        self.gamma = gamma
        self.last_epoch = last_epoch
        self.min_lr = min_lr
        super(StepLR2, self).__init__(optimizer, milestones, gamma)

    def get_lr(self):
        lr_candidate = super(StepLR2, self).get_lr()
        if isinstance(lr_candidate, list):
            for i in range(len(lr_candidate)):
                lr_candidate[i] = max(self.min_lr, lr_candidate[i])

        else:
            lr_candidate = max(self.min_lr, lr_candidate)

        return lr_candidate


def adjacency_to_edge_index(A):
    node_x, node_y = A.nonzero()
    return np.asarray(list(zip(node_x, node_y))).transpose(1, 0)


def adjacency_to_edge_weight(A):
    sources, targets = A.nonzero()
    assert (len(sources) == len(targets))
    edge_weight = []
    for i in range(len(sources)):
        w = A[sources[i], targets[i]]
        edge_weight.append(w)
    return np.asarray(edge_weight)


def _get_log_dir(kwargs):
    log_dir = kwargs['train'].get('log_dir')
    if log_dir is None:
        batch_size = kwargs['data'].get('batch_size')
        learning_rate = kwargs['train'].get('base_lr')
        num_rnn_layers = kwargs['model'].get('num_rnn_layers')
        rnn_units = kwargs['model'].get('rnn_units')
        structure = '-'.join(['%d' % rnn_units for _ in range(num_rnn_layers)])
        others = ''
        if kwargs['model'].get('global_fusion', False) is True:
            others = others + '_' + 'gf'
        if kwargs['model'].get('use_input', False) is True:
            others = others + '_' + 'input'

        K = kwargs['model'].get('K')
        graph_type = kwargs['model'].get('graph_type')
        run_id = 'ggnn_%s_%s_k%d%s_lr%g_bs%d_%s/' % (
            structure,
            graph_type,
            K,
            others,
            learning_rate,
            batch_size,
            time.strftime('%m%d%H%M%S'))
        base_dir = kwargs.get('base_dir')
        log_dir = os.path.join(base_dir, run_id)
    if not os.path.exists(log_dir):
        os.makedirs(log_dir)
    return log_dir


def init_weights(m):
    if type(m) == nn.Linear:
        xavier_uniform_(m.weight.data)
        xavier_uniform_(m.bias.data)


def main(args):
    cfg = read_cfg_file(args.config_filename)
    log_dir = _get_log_dir(cfg)
    log_level = cfg.get('log_level', 'INFO')

    logger = utils.get_logger(log_dir, __name__, 'info.log', level=log_level)

    device = torch.device(
        'cuda') if torch.cuda.is_available() else torch.device('cpu')
    #  all edge_index in same dataset is same
    # edge_index = adjacency_to_edge_index(adj_mx)  # alreay added self-loop
    logger.info(cfg)
    batch_size = cfg['data']['batch_size']
    test_batch_size = cfg['data']['test_batch_size']
    # edge_index = utils.load_pickle(cfg['data']['edge_index_pkl_filename'])
    sz = cfg['data'].get('name', 'notsz') == 'sz'

    adj_mx_list = []
    graph_pkl_filename = cfg['data']['graph_pkl_filename']

    if not isinstance(graph_pkl_filename, list):
        graph_pkl_filename = [graph_pkl_filename]

    src = []
    dst = []
    for g in graph_pkl_filename:
        if sz:
            adj_mx = utils.load_graph_data_sz(g)
        else:
            _, _, adj_mx = utils.load_graph_data(g)

        for i in range(len(adj_mx)):
            adj_mx[i, i] = 0
        adj_mx_list.append(adj_mx)

    adj_mx = np.stack(adj_mx_list, axis=-1)
    if cfg['model'].get('norm', False):
        print('row normalization')
        adj_mx = adj_mx / (adj_mx.sum(axis=0) + 1e-18)
    src, dst = adj_mx.sum(axis=-1).nonzero()
    edge_index = torch.tensor([src, dst], dtype=torch.long, device=device)
    edge_attr = torch.tensor(adj_mx[adj_mx.sum(axis=-1) != 0],
                             dtype=torch.float,
                             device=device)

    output_dim = cfg['model']['output_dim']
    for i in range(adj_mx.shape[-1]):
        logger.info(adj_mx[..., i])

    #  print(adj_mx.shape) (207, 207)

    if sz:
        dataset = utils.load_dataset_sz(**cfg['data'],
                                        scaler_axis=(0,
                                                     1,
                                                     2,
                                                     3))
    else:
        dataset = utils.load_dataset(**cfg['data'])
    for k, v in dataset.items():
        if hasattr(v, 'shape'):
            logger.info((k, v.shape))

    scaler = dataset['scaler']
    scaler_torch = utils.StandardScaler_Torch(scaler.mean,
                                              scaler.std,
                                              device=device)
    logger.info('scaler.mean:{}, scaler.std:{}'.format(scaler.mean,
                                                       scaler.std))

    model = Net(cfg).to(device)
    # model.apply(init_weights)
    criterion = nn.L1Loss(reduction='mean')
    optimizer = optim.Adam(model.parameters(),
                           lr=cfg['train']['base_lr'],
                           eps=cfg['train']['epsilon'])
    scheduler = StepLR2(optimizer=optimizer,
                        milestones=cfg['train']['steps'],
                        gamma=cfg['train']['lr_decay_ratio'],
                        min_lr=cfg['train']['min_learning_rate'])

    max_grad_norm = cfg['train']['max_grad_norm']
    train_patience = cfg['train']['patience']
    val_steady_count = 0
    last_val_mae = 1e6
    horizon = cfg['model']['horizon']

    for epoch in range(cfg['train']['epochs']):
        total_loss = 0
        i = 0
        begin_time = time.perf_counter()
        train_iterator = dataset['train_loader'].get_iterator()
        model.train()
        for _, (x, y, xtime, ytime) in enumerate(train_iterator):
            optimizer.zero_grad()
            y = y[:, :horizon, :, :output_dim]
            sequences, y = collate_wrapper(x=x, y=y,
                                           edge_index=edge_index,
                                           edge_attr=edge_attr,
                                           device=device)
            y_pred = model(sequences)
            y_pred = scaler_torch.inverse_transform(y_pred)
            y = scaler_torch.inverse_transform(y)
            loss = criterion(y_pred, y)
            loss.backward()
            clip_grad_norm_(model.parameters(), max_grad_norm)
            optimizer.step()
            total_loss += loss.item()
            i += 1

        val_result = evaluate(model=model,
                              dataset=dataset,
                              dataset_type='val',
                              edge_index=edge_index,
                              edge_attr=edge_attr,
                              device=device,
                              output_dim=output_dim,
                              logger=logger,
                              detail=False,
                              cfg=cfg)
        val_mae, _, _ = val_result
        time_elapsed = time.perf_counter() - begin_time

        logger.info(('Epoch:{}, train_mae:{:.2f}, val_mae:{},'
                     'r_loss={:.2f},lr={},  time_elapsed:{}').format(
                         epoch,
                         total_loss / i,
                         val_mae,
                         0,
                         str(scheduler.get_lr()),
                         time_elapsed))
        if last_val_mae > val_mae:
            logger.info('val_mae decreased from {:.2f} to {:.2f}'.format(
                last_val_mae,
                val_mae))
            last_val_mae = val_mae
            val_steady_count = 0
        else:
            val_steady_count += 1

        #  after per epoch, run evaluation on test dataset.
        if (epoch + 1) % cfg['train']['test_every_n_epochs'] == 0:
            evaluate(model=model,
                     dataset=dataset,
                     dataset_type='test',
                     edge_index=edge_index,
                     edge_attr=edge_attr,
                     device=device,
                     output_dim=output_dim,
                     logger=logger,
                     cfg=cfg)

        if (epoch + 1) % cfg['train']['save_every_n_epochs'] == 0:
            save_dir = log_dir
            if not os.path.exists(save_dir):
                os.mkdir(save_dir)
            config_path = os.path.join(save_dir,
                                       'config-{}.yaml'.format(epoch + 1))
            epoch_path = os.path.join(save_dir,
                                      'epoch-{}.pt'.format(epoch + 1))
            torch.save(model.state_dict(), epoch_path)
            with open(config_path, 'w') as f:
                from copy import deepcopy
                save_cfg = deepcopy(cfg)
                save_cfg['model']['save_path'] = epoch_path
                f.write(yaml.dump(save_cfg, Dumper=Dumper))

        if train_patience <= val_steady_count:
            logger.info('early stopping.')
            break
        scheduler.step()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--config_filename',
                        default=None,
                        type=str,
                        help='Configuration filename for restoring the model.')
    args = parser.parse_args()
    main(args)

It's not easy to create and find it helpful, please like, follow and collect~~~

Guess you like

Origin blog.csdn.net/jiebaoshayebuhui/article/details/130471502