Time series prediction task PyTorch data set class - TimeSeriesDataSet class detailed explanation

Time series prediction task PyTorch data set class - TimeSeriesDataSet class detailed explanation

When doing time series forecasting or time series analysis, it is usually necessary to preprocess and transform the data to improve the effect and accuracy of the model. The TimeSeriesDataSet class is a PyTorch dataset class created for these purposes, providing some automated features that make preprocessing and transformation more convenient and efficient. This class can be used for a variety of time series forecasting tasks, such as predicting stock prices, traffic flow, energy consumption, etc.

PyTorch Dataset for fitting timeseries models.

The dataset automates common tasks such as

* scaling and encoding of variables
* normalizing the target variable
* efficiently converting timeseries in pandas dataframes to torch tensors
* holding information about static and time-varying variables known and unknown in the future
* holiding information about related categories (such as holidays)
* downsampling for data augmentation
* generating inference, validation and test datasets
* etc.

In this class, some automated functions include:

  • Variable scaling and coding: For different variables, you can specify the scaling method and coding method to normalize them and reduce the variance between variables, thereby improving the effect of the model.
  • Target variable normalization: For a target variable in a time series, it can be normalized to better fit the model.
  • Data Conversion: This class provides methods to convert time series data from pandas dataframes to PyTorch tensors to better fit PyTorch models.
  • Preservation of variable information: This class can save information about static and time-varying variables known and unknown in the future to better handle time series data with multiple time steps.
  • Storage of related category information: This class can save category information related to time series data, such as holiday information, etc., so as to better handle time series data with category information.
  • Data enhancement: This class provides the function of downsampling, which can downsample time series data in order to better deal with long time series data.
  • Dataset Generation: This class can automatically generate training, validation, and test datasets for better model evaluation and testing.

These automated functions can help users better process time series data and improve the effect and accuracy of models.

    # todo: refactor:
    # - creating base class with minimal functionality
    # - "outsource" transformations -> use pytorch transformations as default

    # todo: integrate graphs
    # - add option to pass networkx graph to the dataset -> clearly defined
    # - create method to create networkx graph for hierachies -> clearly defined
    # - convert networkx graph to pytorch geometric graph
    # - create sampler to sample from the graph
    # - create option in `to_dataloader` method to use a graph sampler
    #     -> automatically changing collate function which returns graphs
    #     -> should incorporate entire dataset but be compatible with current approach
    # - integrate hierachical loss somehow into loss metrics

    # how to get there:
    # - add networkx and pytorch_geometric to requirements BUT as extras
    #     -> do we also need torch_sparse, etc.? -> can we avoid this? probably not
    # - networkx graph: define what makes sense from user perspective
    # - define conversion into pytorch geometric graph? is this a two-step process of
    #     - encoding networkx graph and converting it into "unfilled" pytorch geometric graph
    #     - then creating full graph in collate function on the fly?
    #     - or is data already stored in pytorch geometric graph and we only cut through it?
    #     - dataformat would change? Is is all timeseries data? + mask when valid?
    #     - then making cuts through the graph in sampling?
    #     - would it be best in this case to re-think the timeseries class and design it as series of transformations?
    #     - what is the new master data? very off current state or very similar?
    #     - current approach is storing data in long format which is memory efficient and using the index object to
    #       make sense of it when accessing. graphs would require wide format?
    # - do NOT overengineer, i.e. support only usecase of single static graph, but only subset might be relevant
    #     -> however, should think what happens if we want a dynamic graph. would this completely change the
    #        data format?

    # decisions:
    # - stay with long format and create graph on the fly even if hampering efficiency and performance
    # - go with pytorch_geometric approach for future proofing
    # - directly convert networkx into pytorch_geometric graph
    # - sampling: support only time-synchronized.
    #     - sample randomly an instance from index as now.
    #     - then get additional samples as per graph (that has been created) and available data
    #     - then collate into graph object

This comment is a task list and thinking record left by a programmer, mainly to allow yourself or other developers to understand the logic and design of the code more clearly and clearly when maintaining and improving this code in the future. Here is a detailed explanation of each part:

  • "todo: refactor": The code needs to be refactored, that is, to improve the design, structure, and implementation of the code to improve the readability, maintainability, and scalability of the code.
  • "outsource transformations": abstract the function of data transformation, and use PyTorch's built-in data transformation function as the default option.
  • "integrate graphs": Integrate graph structures into datasets so that they can be used for sampling and loss computation when training neural networks.
  • "add option to pass networkx graph to the dataset": Add option to allow users to pass NetworkX graph to the dataset, so that the graph structure for sampling and loss calculation can be clearly defined.
  • "create method to create networkx graph for hierarchies": create method to create NetworkX graph for hierarchical datasets.
  • "convert networkx graph to pytorch geometric graph": converts a NetworkX graph to a PyTorch Geometric graph to use the PyTorch Geometric library when training a neural network.
  • "create sampler to sample from the graph": Create a sampler for sampling data from the graph structure.
  • "create option in to_dataloader method to use a graph sampler": Create option in to_dataloader method to use a graph sampler for data loading.
  • “incorporate entire dataset but be compatible with current approach”: It is necessary to ensure that the new graph structure method is compatible with the existing dataset and can be applied to the entire dataset.
  • "integrate hierarchical loss somehow into loss metrics": Integrate hierarchical loss structures into loss metrics.
  • "add networkx and pytorch_geometric to requirements BUT as extras": Adds the NetworkX and PyTorch Geometric libraries to the project's dependencies, but as optional extensions, not required.
  • "define what makes sense from user perspective": Define the graph structure design requirements from the user's perspective.
  • "define conversion into pytorch geometric graph": Defines the process of converting a NetworkX graph into a PyTorch Geometric graph.
  • "re-think the timeseries class and design it as series of transformations": Re-think the design of the timeseries class and design it as a series of transformation operations.
  • "do NOT overengineer": don't overengineer, i.e. don't support overly complex use cases, keep the code simple and usable.
  • "stay with long format and create graph on the fly": keep the long format of the data and dynamically generate the graph structure.
  • "go with pytorch_geometric approach for future proofing": Adopt the design method of PyTorch Geometric library to ensure the maintainability and scalability of the code.
    def __init__(
        self,
        data: pd.DataFrame,
        time_idx: str,
        target: Union[str, List[str]],
        group_ids: List[str],
        weight: Union[str, None] = None,
        max_encoder_length: int = 30,
        min_encoder_length: int = None,
        min_prediction_idx: int = None,
        min_prediction_length: int = None,
        max_prediction_length: int = 1,
        static_categoricals: List[str] = [],
        static_reals: List[str] = [],
        time_varying_known_categoricals: List[str] = [],
        time_varying_known_reals: List[str] = [],
        time_varying_unknown_categoricals: List[str] = [],
        time_varying_unknown_reals: List[str] = [],
        variable_groups: Dict[str, List[int]] = {
    
    },
        constant_fill_strategy: Dict[str, Union[str, float, int, bool]] = {
    
    },
        allow_missing_timesteps: bool = False,
        lags: Dict[str, List[int]] = {
    
    },
        add_relative_time_idx: bool = False,
        add_target_scales: bool = False,
        add_encoder_length: Union[bool, str] = "auto",
        target_normalizer: Union[NORMALIZER, str, List[NORMALIZER], Tuple[NORMALIZER]] = "auto",
        categorical_encoders: Dict[str, NaNLabelEncoder] = {
    
    },
        scalers: Dict[str, Union[StandardScaler, RobustScaler, TorchNormalizer, EncoderNormalizer]] = {
    
    },
        randomize_length: Union[None, Tuple[float, float], bool] = False,
        predict_mode: bool = False,
    ):
  • data: DataFrame or numpy array containing time series.
  • group_ids: ID of each time series, used to distinguish time series.
  • target: The name or index of the target variable to predict.
  • static_categoricals: Names or indices of static categorical features.
  • static_reals: Names or indices of static continuous features.
  • time_varying_known_categoricals: Names or indices of known time varying categorical features.
  • time_varying_known_reals: Names or indices of known time-varying continuous features.
  • time_varying_unknown_categoricals: Names or indices of unknown time varying categorical features.
  • time_varying_unknown_reals: Names or indices of unknown time-varying continuous features.
  • max_encoder_length: Encoder maximum length.
  • max_prediction_length: The maximum length of the predictor.
  • train_sampler: The sampler used to sample the training data.
  • val_sampler: The sampler used to sample the validation data.
  • batch_size: batch size.
  • num_workers: Number of processes loading data.
  • scalers: A dictionary containing the scikit-learn scalers used to scale the data.
  • randomize_length: Parameter used to randomize the length.
  • predict_mode: Whether to iterate each time series only once, i.e. use only the samples provided by the last batch as prediction samples.

This is a Python class constructor used to construct a time series dataset. This dataset is used to train the time series model. The following is an explanation of the parameters of this constructor:

  • data: pd. DataFrame. A DataFrame storing time series data. Each row of data can be identified by time index (time_idx) and group_ids.
  • time_idx: str. The column name representing the time. This column is used to determine the time series of the samples.
  • target: Union[str, List[str]]. The target column or list of target columns, which can be categorical or continuous.
  • group_ids: List[str]. A list of column names representing the time series. This means that group_ids together with time_idx determine the samples. If there is only one time series, setting it to a constant column name will do the trick.
  • weight: Union[str, None]. Column names for weights. The default is None.
  • max_encoder_length: int. Maximum encoding length. This is the maximum history length used by time series datasets.
  • min_encoder_length: int. The minimum encoding length allowed. Defaults to max_encoder_length.
  • min_prediction_idx: int. The time index from which to start forecasting. This parameter can be used to create validation or test sets.
  • max_prediction_length: int. Maximum prediction/decoding length (don't choose a length that is too short as it may cause difficulty in convergence).
  • min_prediction_length: int. Minimum prediction/decoding length. Defaults to max_prediction_length.
  • static_categoricals: List[str]. A list of static categorical variables that do not change over time, the entries can also be a list and then code them together (e.g. product categories are useful).
  • static_reals: List[str]. A list of continuous variables that do not change over time.
  • time_varying_known_categoricals: List[str]. Entries can also be lists of categorical variables that change over time but are known in the future, and then code them together (e.g. special dates or promotional categories are useful).
  • time_varying_known_reals: List[str]. A list of continuous variables that change over time but are known in the future (for example, the price of a product, but not the demand for the product).
  • time_varying_unknown_categoricals: List[str]. A list of categorical variables that change over time and whose future is unknown, entries can also be lists, and then encode them together.
  • time_varying_unknown_reals: List[str]. A list of continuous variables that change over time and whose future is unknown.
  • variable_groups: Dict[str, List[int]]. A dictionary that groups variables together, where the keys are the group names and the values ​​are lists of indices into the variables.
  • constant_fill_strategy: Dict[str, Union[str, float, int, bool]]. Dictionary of constant fill strategies where keys are column names and values ​​are constants or strings to fill, either "ffill" or "bfill".
  • allow_missing_timesteps (bool): Whether to allow missing timesteps in the data and fill them automatically in the dataset. Missing time steps refer to gaps in the time series. For example, if a time series only contains time steps 1, 2, 4, and 5, then time step 3 will be automatically generated when generating the data set. However, this parameter does not handle missing values ​​(NA values). The dataframe should be populated with NA values ​​before passing it to the TimeSeriesDataSet.
  • lags (Dict[str, List[int]]): Dictionary of lag time steps defining variables. A lagged time step can be used to indicate to the model the seasonality of the data. If you know the seasonality of your data, you should at least add the target variable and its corresponding lag timestep to improve performance. The lagged timestep cannot be larger than the minimum time series, and all time series will be truncated by the max lagged timestep value to prevent NA values. Lag variables must appear among time-varying variables. If you only want lagged values ​​and not current values, you can manually lag in the input data.
  • add_relative_time_idx (bool): Whether to add relative time indices as features to the dataset. For each sample sequence, the index will range from -encoder_length to prediction_length.
  • add_target_scales (bool): If the center and scale of the target are added as features to static real-valued features. That is, the center and scale of the unnormalized time series are added to the dataset as features.
  • add_encoder_length (bool): Whether to add the encoder length to the list of static real-valued variables. "auto" by default, i.e. "True" when "min_encoder_length != max_encoder_length".
  • target_normalizer (Union[TorchNormalizer, NaNLabelEncoder, EncoderNormalizer, str, list, tuple]) : Transformer used to normalize the target. Choose from TorchNormalizer, GroupNormalizer, NaNLabelEncoder, EncoderNormalizer, or None. For multiple targets, use MultiNormalizer. By default, an appropriate normalizer will be chosen automatically.
  • categorical_encoders (Dict[str, NaNLabelEncoder]): A dictionary of scikit-learn label encoders. If there are unobserved categories/cold start issues in the future, you can use NaNLabelEncoder and set "add_nan=True". By default scikit-learn's "LabelEncoder()". Pre-fitted encoders will no longer be fitted.
  • The dictionary variable of scalers contains a variety of different normalization methods, such as StandardScaler and RobustScaler in the scikit-learn library, and EncoderNormalizer and GroupNormalizer in the pytorch_forecasting library, etc. By default, the StandardScaler from the scikit-learn library is used as the normalization method. If you want to use another normalization method, you can specify the corresponding method name in the dictionary. In addition, you can choose not to use the normalization method, that is, use None, or use the normalization method with center=0 and scale=1 parameters (method="identity"). For the pre-fitted encoder (encoder) sequence, except for EncoderNormalizer which needs to be fitted on each encoder sequence, other normalization methods do not need to be fitted again.
  • The variable of randomize_length is used to control whether the sequence length is randomly sampled, and the sampling method. Finally, there is a boolean variable called "predict_mode" that controls whether the model iterates over the series only once, i.e. uses only the last few samples of each time series for prediction.
        super().__init__()
        self.max_encoder_length = max_encoder_length
        assert isinstance(self.max_encoder_length, int), "max encoder length must be integer"
        if min_encoder_length is None:
            min_encoder_length = max_encoder_length
        self.min_encoder_length = min_encoder_length
        assert (
            self.min_encoder_length <= self.max_encoder_length
        ), "max encoder length has to be larger equals min encoder length"
        assert isinstance(self.min_encoder_length, int), "min encoder length must be integer"
        self.max_prediction_length = max_prediction_length
        assert isinstance(self.max_prediction_length, int), "max prediction length must be integer"
        if min_prediction_length is None:
            min_prediction_length = max_prediction_length
        self.min_prediction_length = min_prediction_length
        assert (
            self.min_prediction_length <= self.max_prediction_length
        ), "max prediction length has to be larger equals min prediction length"
        assert self.min_prediction_length > 0, "min prediction length must be larger than 0"
        assert isinstance(self.min_prediction_length, int), "min prediction length must be integer"
        assert data[time_idx].dtype.kind == "i", "Timeseries index should be of type integer"
        self.target = target
        self.weight = weight
        self.time_idx = time_idx
        self.group_ids = [] + group_ids
        self.static_categoricals = [] + static_categoricals
        self.static_reals = [] + static_reals
        self.time_varying_known_categoricals = [] + time_varying_known_categoricals
        self.time_varying_known_reals = [] + time_varying_known_reals
        self.time_varying_unknown_categoricals = [] + time_varying_unknown_categoricals
        self.time_varying_unknown_reals = [] + time_varying_unknown_reals
        self.add_relative_time_idx = add_relative_time_idx

        # set automatic defaults
        if isinstance(randomize_length, bool):
            if not randomize_length:
                randomize_length = None
            else:
                randomize_length = (0.2, 0.05)
        self.randomize_length = randomize_length
        if min_prediction_idx is None:
            min_prediction_idx = data[self.time_idx].min()
        self.min_prediction_idx = min_prediction_idx
        self.constant_fill_strategy = {
    
    } if len(constant_fill_strategy) == 0 else constant_fill_strategy
        self.predict_mode = predict_mode
        self.allow_missing_timesteps = allow_missing_timesteps
        self.target_normalizer = target_normalizer
        self.categorical_encoders = {
    
    } if len(categorical_encoders) == 0 else categorical_encoders
        self.scalers = {
    
    } if len(scalers) == 0 else scalers
        self.add_target_scales = add_target_scales
        self.variable_groups = {
    
    } if len(variable_groups) == 0 else variable_groups
        self.lags = {
    
    } if len(lags) == 0 else lags
  • super(). init (): Call the constructor of the parent class to initialize the object of this class.

  • self.max_encoder_length = max_encoder_length: A parameter that defines the maximum encoder length, that is, the maximum length of the time series data used to train the model.

  • assert isinstance(self.max_encoder_length, int), “max encoder length must be integer”: assert that the value of the maximum encoder length must be an integer type, otherwise an exception will be thrown.

  • if min_encoder_length is None: min_encoder_length = max_encoder_length: If the parameter for the minimum encoder length is not specified, it will default to the maximum encoder length.

  • self.min_encoder_length = min_encoder_length: A parameter that defines the minimum encoder length, that is, the minimum length of time series data used to train the model.

  • assert (self.min_encoder_length <= self.max_encoder_length), “max encoder length has to be larger equals min encoder length”: assert that the maximum encoder length must be greater than or equal to the minimum encoder length, otherwise an exception will be thrown.

  • assert isinstance(self.min_encoder_length, int), “min encoder length must be integer”: asserts that the value of the minimum encoder length must be an integer type, otherwise an exception will be thrown.

  • self.max_prediction_length = max_prediction_length: The parameter that defines the maximum prediction length, that is, the maximum length of the time series data predicted by the model.

  • assert isinstance(self.max_prediction_length, int), “max prediction length must be integer”: assert that the value of the maximum prediction length must be an integer type, otherwise an exception will be thrown.

  • if min_prediction_length is None: min_prediction_length = max_prediction_length: If the parameter for the minimum prediction length is not specified, it will default to the maximum prediction length.

  • self.min_prediction_length = min_prediction_length: The parameter that defines the minimum prediction length, that is, the minimum length of the time series data predicted by the model.

  • assert (self.min_prediction_length <= self.max_prediction_length), “max prediction length has to be larger equals min prediction length”: assert that the maximum prediction length must be greater than or equal to the minimum prediction length, otherwise an exception will be thrown.

  • assert self.min_prediction_length > 0, “min prediction length must be larger than 0”: assert that the minimum prediction length must be greater than 0, otherwise an exception will be thrown.

  • assert isinstance(self.min_prediction_length, int), “min prediction length must be integer”: assert that the value of the minimum prediction length must be an integer type, otherwise an exception will be thrown.

  • assert data[time_idx].dtype.kind == "i", "Timeseries index should be of type integer": assert that the index of time series data must be of type integer, otherwise an exception will be thrown.

  • self.target = target: defines the name of the target variable.

  • self.weight = weight: Defines the name of the weight variable.

  • self.time_idx = time_idx: Store the index value of the column used as the time index in the input time series data in self.time_idx.

  • self.group_ids = [] + group_ids: Copy the input group ID list group_ids to self.group_ids.

  • self.static_categoricals = [] + static_categoricals: Copy the input static categorical variable list static_categoricals into self.static_categoricals.

  • self.static_reals = [] + static_reals: Copy the input static real variable list static_reals to self.static_reals.

  • self.time_varying_known_categoricals = [] + time_varying_known_categoricals: Copy the input time_varying_known_categoricals list of categorical variables known to vary over time into self.time_varying_known_categoricals.

  • self.time_varying_known_reals = [] + time_varying_known_reals: Copy the input known time varying real variable list time_varying_known_reals to self.time_varying_known_reals.

  • self.time_varying_unknown_categoricals = [] + time_varying_unknown_categoricals: Copy the input unknown time varying categorical variable list time_varying_unknown_categoricals to self.time_varying_unknown_categoricals.

  • self.time_varying_unknown_reals = [] + time_varying_unknown_reals: Copy the input unknown time varying real variable list time_varying_unknown_reals to self.time_varying_unknown_reals.

  • self.add_relative_time_idx = add_relative_time_idx : If True, add the relative time index of the time variable to the model.

In addition, the following piece of code is to set the automatic default value:

  • if isinstance(randomize_length, bool):: If randomize_length is a Boolean value type:
  • if not randomize_length:: If randomize_length is False, set randomize_length to None.
    else:: otherwise:
  • randomize_length = (0.2, 0.05): Set randomize_length to (0.2, 0.05), which is the default range of random crop length.
  • self.randomize_length: Used to control whether to randomize the length of each sequence. If randomize_length is a Boolean type, it is judged whether randomization is required; if randomization is required, the length range of randomization is controlled by a tuple (p, len), where p is the probability of randomization, and len is the randomized value Length change ratio. If randomize_length is None, the sequence length is not randomized.
  • self.min_prediction_idx: The minimum prediction time index, which is equal to the minimum value of the time index in the data array. This variable is used to constrain the forecast time index for each series to avoid forecasting past times.
  • self.constant_fill_strategy: Constant fill strategy, which is a dictionary type, used to specify how to fill constants on missing time steps. If the dictionary is empty, no constant padding is done.
  • self.predict_mode: Prediction mode, which controls whether the model is used for single-step prediction or multi-step prediction. If it is predict, multi-step prediction is performed; if it is repeat, single-step prediction is performed.
  • self.allow_missing_timesteps: A boolean that controls whether missing timesteps are allowed. If True, missing time steps are allowed, and missing time steps will be filled by self.constant_fill_strategy; if False, missing time steps are not allowed, and missing time steps will be ignored.
  • self.target_normalizer: The normalizer of the target variable, used to standardize the value range of the target variable.
  • self.categorical_encoders: Encoders for categorical variables, which is a dictionary type, used to encode categorical variables.
  • self.scalers: Variable scalers, a dictionary type, used to scale numeric variables.
  • self.add_target_scales: A Boolean value used to control whether the target variable needs to be scaled.
  • self.variable_groups: variable group, which is a dictionary type, used to group different variables.
  • self.lags: Lag value, which is a dictionary type, used to specify the lag values ​​of different variables as different values.
       # add_encoder_length
        if isinstance(add_encoder_length, str):
            assert (
                add_encoder_length == "auto"
            ), f"Only 'auto' allowed for add_encoder_length but found {
      
      add_encoder_length}"
            add_encoder_length = self.min_encoder_length != self.max_encoder_length
        assert isinstance(
            add_encoder_length, bool
        ), f"add_encoder_length should be boolean or 'auto' but found {
      
      add_encoder_length}"
        self.add_encoder_length = add_encoder_length

        # target normalizer
        self._set_target_normalizer(data)

        # overwrite values
        self.reset_overwrite_values()

        for target in self.target_names:
            assert (
                target not in self.time_varying_known_reals
            ), f"target {
      
      target} should be an unknown continuous variable in the future"

add_encoder_length
If add_encoder_length is a string, it must be "auto", otherwise an assertion error occurs, only 'auto' is allowed to automatically calculate whether to add encoder_length. When self.min_encoder_length is not equal to self.max_encoder_length, add_encoder_length will be set to True. If add_encoder_length is a boolean or 'auto' string, then self.add_encoder_length will be set to the value of add_encoder_length.

target normalizer
sets the normalizer for the target variable

overwrite values
​​resets overwrite_values, that is, clears the previously overwritten values.

assert
asserts that the target variable should not appear in self.time_varying_known_reals but should be an unknown continuous variable.

        # add time index relative to prediction position
        if self.add_relative_time_idx or self.add_encoder_length:
            data = data.copy()  # only copies indices (underlying data is NOT copied)
        if self.add_relative_time_idx:
            assert (
                "relative_time_idx" not in data.columns
            ), "relative_time_idx is a protected column and must not be present in data"
            if "relative_time_idx" not in self.time_varying_known_reals and "relative_time_idx" not in self.reals:
                self.time_varying_known_reals.append("relative_time_idx")
            data.loc[:, "relative_time_idx"] = 0.0  # dummy - real value will be set dynamiclly in __getitem__()

add time index relative to prediction position
If self.add_relative_time_idx or self.add_encoder_length is True, copy the data, where only the index is copied (the underlying data is not copied). If self.add_relative_time_idx is True, make sure that the "relative_time_idx" column does not exist in "data", if not, add "relative_time_idx" to self.time_varying_known_reals, and add "relative_time_idx" column in data, and set all its values Set to 0.0 (the actual value will be dynamically set in __getitem__())

        # add decoder length to static real variables
        if self.add_encoder_length:
            assert (
                "encoder_length" not in data.columns
            ), "encoder_length is a protected column and must not be present in data"
            if "encoder_length" not in self.time_varying_known_reals and "encoder_length" not in self.reals:
                self.static_reals.append("encoder_length")
            data.loc[:, "encoder_length"] = 0  # dummy - real value will be set dynamiclly in __getitem__()

        # validate
        self._validate_data(data)
        assert data.index.is_unique, "data index has to be unique"


add decoder length to static real variables
If self.add_encoder_length is True, make sure that the "encoder_length" column does not exist in "data", if not, add "encoder_length" to self.static_reals, and add "encoder_length" to data column, and set all its values ​​to 0 (in __getitem__() the actual value will be set dynamically)

validateVerifies
the data to ensure that the data meets the requirements

        # add lags
        assert self.min_lag > 0, "lags should be positive"
        if len(self.lags) > 0:
            # add variables
            for name in self.lags:
                lagged_names = self._get_lagged_names(name)
                for lagged_name in lagged_names:
                    assert (
                        lagged_name not in data.columns
                    ), f"{
      
      lagged_name} is a protected column and must not be present in data"
                # add lags
                if name in self.time_varying_known_reals:
                    for lagged_name in lagged_names:
                        if lagged_name not in self.time_varying_known_reals:
                            self.time_varying_known_reals.append(lagged_name)
                elif name in self.time_varying_known_categoricals:
                    for lagged_name in lagged_names:
                        if lagged_name not in self.time_varying_known_categoricals:
                            self.time_varying_known_categoricals.append(lagged_name)
                elif name in self.time_varying_unknown_reals:
                    for lagged_name, lag in lagged_names.items():
                        if lag < self.max_prediction_length:  # keep in unknown as if lag is too small
                            if lagged_name not in self.time_varying_unknown_reals:
                                self.time_varying_unknown_reals.append(lagged_name)
                        else:
                            if lagged_name not in self.time_varying_known_reals:
                                # switch to known so that lag can be used in decoder directly
                                self.time_varying_known_reals.append(lagged_name)
                elif name in self.time_varying_unknown_categoricals:
                    for lagged_name, lag in lagged_names.items():
                        if lag < self.max_prediction_length:  # keep in unknown as if lag is too small
                            if lagged_name not in self.time_varying_unknown_categoricals:
                                self.time_varying_unknown_categoricals.append(lagged_name)
                        if lagged_name not in self.time_varying_known_categoricals:
                            # switch to known so that lag can be used in decoder directly
                            self.time_varying_known_categoricals.append(lagged_name)
                else:
                    raise KeyError(f"lagged variable {
      
      name} is not a known nor unknown time-varying variable")

add lags
asserts that lags should be positive. If lags length is greater than 0, each lag is added to the corresponding variable. If lagged_name does not exist in data.columns, it is added to the corresponding list. If name is in self.time_varying_known_reals, add lagged_name to self.time_varying_known_reals list, if name is in self.time_varying_known_categoricals, add lagged_name to self.time_varying_known_categoricals list, if name is in self.time_varying_unknown_reals, add lagged_name to the self.time_varying_unknown_reals list (if the lag is less than self.max_prediction_length), otherwise add it to the self.time_varying_known_reals list (so that the lag can be used directly in the decoder). If name is in self.time_varying_unknown_categoricals, add lagged_name to self.time_varying_unknown_categoricals list (if lag is less than self.max_prediction_length), otherwise add it to self.time_varying_known_categoricals list (so that lag can be used directly in the decoder). If name is not in a known or unknown time variable, a KeyError exception will be raised.

        # filter data
        if min_prediction_idx is not None:
            # filtering for min_prediction_idx will be done on subsequence level ensuring
            # minimal decoder index is always >= min_prediction_idx
            data = data[lambda x: x[self.time_idx] >= self.min_prediction_idx - self.max_encoder_length - self.max_lag]
        data = data.sort_values(self.group_ids + [self.time_idx])

        # preprocess data
        data = self._preprocess_data(data)
        for target in self.target_names:
            assert target not in self.scalers, "Target normalizer is separate and not in scalers."

        # create index
        self.index = self._construct_index(data, predict_mode=self.predict_mode)

        # convert to torch tensor for high performance data loading later
        self.data = self._data_to_tensors(data)

This code is mainly used for data preprocessing and building PyTorch Tensor for data loading. The explanation is as follows:

  • If min_prediction_idx is not None, the data is filtered to ensure that the smallest decoder index is always greater than or equal to min_prediction_idx. This filtering will be performed at the subsequence level to ensure that min_prediction_idx is feasible. The filtered data will be used for subsequent preprocessing. Filtering data is done by using lambda expressions on the DataFrame.
  • Sort the filtered data according to group_ids and time_idx. This ordering is to enable data to be grouped and processed more efficiently in subsequent operations.
  • Preprocess the data, including standardization and filling of missing values. For each target variable, make sure it is not in the scaler.
  • Build the index. Use the preprocessed data to build an index for subsequent batch processing and training.
    Convert data to PyTorch Tensor for subsequent high-performance data loading. The converted data will be stored in self.data for subsequent use.
 @property
    def dropout_categoricals(self) -> List[str]:
        """
        list of categorical variables that are unknown when making a
        forecast without observed history
        """
        return [name for name, encoder in self.categorical_encoders.items() if encoder.add_nan]

    def _get_lagged_names(self, name: str) -> Dict[str, int]:
        """
        Generate names for lagged variables

        Args:
            name (str): name of variable to lag

        Returns:
            Dict[str, int]: dictionary mapping new variable names to lags
        """
        return {
    
    f"{
      
      name}_lagged_by_{
      
      lag}": lag for lag in self.lags.get(name, [])}

    @property
    @lru_cache(None)
    def lagged_variables(self) -> Dict[str, str]:
        """
        Lagged variables.

        Returns:
            Dict[str, str]: dictionary of variable names corresponding to lagged variables
                mapped to variable that is lagged
        """
        vars = {
    
    }
        for name in self.lags:
            vars.update({
    
    lag_name: name for lag_name in self._get_lagged_names(name)})
        return vars

    @property
    @lru_cache(None)
    def lagged_targets(self) -> Dict[str, str]:
        """Subset of `lagged_variables` but only includes variables that are lagged targets."""
        vars = {
    
    }
        for name in self.lags:
            vars.update({
    
    lag_name: name for lag_name in self._get_lagged_names(name) if name in self.target_names})
        return vars

    @property
    @lru_cache(None)
    def min_lag(self) -> int:
        """
        Minimum number of time steps variables are lagged.

        Returns:
            int: minimum lag
        """
        if len(self.lags) == 0:
            return 1e9
        else:
            return min([min(lag) for lag in self.lags.values()])

    @property
    @lru_cache(None)
    def max_lag(self) -> int:
        """
        Maximum number of time steps variables are lagged.

        Returns:
            int: maximum lag
        """
        if len(self.lags) == 0:
            return 0
        else:
            return max([max(lag) for lag in self.lags.values()])

Here are some attributes and methods in a Python class that are mainly used for feature engineering of time series data. Each property and method is explained one by one below:

@property: This is a decorator in Python used to convert a method into a class property.

dropout_categoricals: This is a class attribute method that returns a list containing the names of categorical variables for which no historical data was observed during the forecasting process. Specifically, this list contains all categorical variable names that satisfy encoder.add_nan=True.

_get_lagged_names: This is a class method for generating variable names with lagged properties. Specifically, this method takes a variable name as an argument and returns a dictionary containing variable names with a specific lag and their corresponding lag steps.

lagged_variables: This is a class attribute method that returns a dictionary containing all lagged variable names and their corresponding raw variable names. Specifically, this dictionary contains the lagged versions of all variable names generated by the method _get_lagged_names.

lagged_targets: This is a class attribute method that returns a dictionary containing all lagged target variable names and their corresponding original variable names. Specifically, this dictionary contains only the lagged versions of the target variable names satisfying name in self.target_names, which are also generated by the method _get_lagged_names.

min_lag: This is a class attribute method that returns the smallest number of lag steps among all variables with lag properties. Specifically, this method returns the smallest number of lag steps that satisfies min(lag).

max_lag: This is a class attribute method that returns the largest number of lag steps among all variables with lag properties. Specifically, this method returns the largest number of lag steps that satisfies max(lag).

 def _set_target_normalizer(self, data: pd.DataFrame):
        """
        Determine target normalizer.

        Args:
            data (pd.DataFrame): input data
        """
        if isinstance(self.target_normalizer, str) and self.target_normalizer == "auto":
            normalizers = []
            for target in self.target_names:
                if data[target].dtype.kind != "f":  # category
                    normalizers.append(NaNLabelEncoder())
                    if self.add_target_scales:
                        warnings.warn("Target scales will be only added for continous targets", UserWarning)
                else:
                    data_positive = (data[target] > 0).all()
                    if data_positive:
                        if data[target].skew() > 2.5:
                            transformer = "log"
                        else:
                            transformer = "relu"
                    else:
                        transformer = None
                    if self.max_encoder_length > 20 and self.min_encoder_length > 1:
                        normalizers.append(EncoderNormalizer(transformation=transformer))
                    else:
                        normalizers.append(GroupNormalizer(transformation=transformer))
            if self.multi_target:
                self.target_normalizer = MultiNormalizer(normalizers)
            else:
                self.target_normalizer = normalizers[0]
        elif isinstance(self.target_normalizer, (tuple, list)):
            self.target_normalizer = MultiNormalizer(self.target_normalizer)
        elif self.target_normalizer is None:
            self.target_normalizer = TorchNormalizer(method="identity")
        assert (
            not isinstance(self.target_normalizer, EncoderNormalizer)
            or self.min_encoder_length >= self.target_normalizer.min_length
        ), "EncoderNormalizer is only allowed if min_encoder_length > 1"
        assert isinstance(
            self.target_normalizer, (TorchNormalizer, NaNLabelEncoder)
        ), f"target_normalizer has to be either None or of class TorchNormalizer but found {
      
      self.target_normalizer}"
        assert not self.multi_target or isinstance(self.target_normalizer, MultiNormalizer), (
            "multiple targets / list of targets requires MultiNormalizer as target_normalizer "
            f"but found {
      
      self.target_normalizer}"

This code is used to determine the target normalizer (target normalizer), the target variable refers to the variable that needs to be predicted. The function takes a data frame as an argument, which contains the variables to be predicted.

First, the function checks the type of the target normalizer. If the target normalizer is a string and equal to 'auto', a normalizer is automatically chosen based on the data type of the target variable. If the target variable is a categorical variable, use the NaNLabelEncoder normalizer, otherwise, judge the distribution of the target variable. If the data are all positive and the data skewness is greater than 2.5, use the log transformation, otherwise use the ReLU transformation. If time series data needs to be standardized, two normalizers, EncoderNormalizer and GroupNormalizer, will be considered. Which normalizer to choose depends on the length of the data set and the minimum length parameter min_length of EncoderNormalizer. If it is multi-target prediction, use MultiNormalizer.

MultiNormalizer is used if the target normalizer is a tuple or list. If target normalizer is None, TorchNormalizer is used. Finally, the function checks and asserts the type of the target normalizer to ensure that the normalizer is correct.

@property
    @lru_cache(None)
    def _group_ids_mapping(self) -> Dict[str, str]:
        """
        Mapping of group id names to group ids used to identify series in dataset -
        group ids can also be used for target normalizer.
        The former can change from training to validation and test dataset while the later must not.
        """
        return {
    
    name: f"__group_id__{
      
      name}" for name in self.group_ids}

    @property
    @lru_cache(None)
    def _group_ids(self) -> List[str]:
        """
        Group ids used to identify series in dataset.

        See :py:meth:`~TimeSeriesDataSet._group_ids_mapping` for details.
        """
        return list(self._group_ids_mapping.values())

    def _validate_data(self, data: pd.DataFrame):
        """
        Validate that data will not cause hick-ups later on.
        """
        # check for numeric categoricals which can cause hick-ups in logging in tensorboard
        category_columns = data.head(1).select_dtypes("category").columns
        object_columns = data.head(1).select_dtypes(object).columns
        for name in self.flat_categoricals:
            if name not in data.columns:
                raise KeyError(f"variable {
      
      name} specified but not found in data")
            if not (
                name in object_columns
                or (name in category_columns and data[name].cat.categories.dtype.kind not in "bifc")
            ):
                raise ValueError(
                    f"Data type of category {
      
      name} was found to be numeric - use a string type / categorified string"
                )
        # check for "." in column names
        columns_with_dot = data.columns[data.columns.str.contains(r"\.")]
        if len(columns_with_dot) > 0:
            raise ValueError(
                f"column names must not contain '.' characters. Names {
      
      columns_with_dot.tolist()} are invalid"
            )

    def save(self, fname: str) -> None:
        """
        Save dataset to disk

        Args:
            fname (str): filename to save to
        """
        torch.save(self, fname)

    @classmethod
    def load(cls, fname: str):
        """
        Load dataset from disk

        Args:
            fname (str): filename to load from

        Returns:
            TimeSeriesDataSet
        """
        obj = torch.load(fname)
        assert isinstance(obj, cls), f"Loaded file is not of class {
      
      cls}"
        return obj

This code defines a class TimeSeriesDataSet, including the following methods:

The _group_ids_mapping and _group_ids methods modified by @property and @lru_cache(None) decorators are used to obtain group_ids and their mapping relationship, so as to identify series in the dataset.
The _validate_data method is used to verify whether the dataset meets some requirements, such as whether the category column is a string type.
The save and load methods are used to save and load the dataset to disk. Here, PyTorch's torch.save and torch.load methods are used.
In addition, there are some comments explaining the function of these methods, for example, the _group_ids_mapping method returns a mapping relationship dictionary, which is used to map the group id name to the group id used to identify the series in the dataset. The _validate_data method is used to verify that the data meets some requirements, for example, the category column must be of string type, otherwise it will cause problems when recording TensorBoard. The save and load methods are used to save the dataset to disk for later training and reasoning.

    def _preprocess_data(self, data: pd.DataFrame) -> pd.DataFrame:
        """
        Scale continuous variables, encode categories and set aside target and weight.

        Args:
            data (pd.DataFrame): original data

        Returns:
            pd.DataFrame: pre-processed dataframe
        """
        # add lags to data
        for name in self.lags:
            # todo: add support for variable groups
            assert (
                name not in self.variable_groups
            ), f"lagged variables that are in {
      
      self.variable_groups} are not supported yet"
            for lagged_name, lag in self._get_lagged_names(name).items():
                data[lagged_name] = data.groupby(self.group_ids, observed=True)[name].shift(lag)

        # encode group ids - this encoding
        for name, group_name in self._group_ids_mapping.items():
            # use existing encoder - but a copy of it not too loose current encodings
            encoder = deepcopy(self.categorical_encoders.get(group_name, NaNLabelEncoder()))
            self.categorical_encoders[group_name] = encoder.fit(data[name].to_numpy().reshape(-1), overwrite=False)
            data[group_name] = self.transform_values(name, data[name], inverse=False, group_id=True)

        # encode categoricals first to ensure that group normalizer for relies on encoded categories
        if isinstance(
            self.target_normalizer, (GroupNormalizer, MultiNormalizer)
        ):  # if we use a group normalizer, group_ids must be encoded as well
            group_ids_to_encode = self.group_ids
        else:
            group_ids_to_encode = []
        for name in dict.fromkeys(group_ids_to_encode + self.categoricals):
            if name in self.lagged_variables:
                continue  # do not encode here but only in transform
            if name in self.variable_groups:  # fit groups
                columns = self.variable_groups[name]
                if name not in self.categorical_encoders:
                    self.categorical_encoders[name] = NaNLabelEncoder().fit(data[columns].to_numpy().reshape(-1))
                elif self.categorical_encoders[name] is not None:
                    try:
                        check_is_fitted(self.categorical_encoders[name])
                    except NotFittedError:
                        self.categorical_encoders[name] = self.categorical_encoders[name].fit(
                            data[columns].to_numpy().reshape(-1)
                        )
            else:
                if name not in self.categorical_encoders:
                    self.categorical_encoders[name] = NaNLabelEncoder().fit(data[name])
                elif self.categorical_encoders[name] is not None and name not in self.target_names:
                    try:
                        check_is_fitted(self.categorical_encoders[name])
                    except NotFittedError:
                        self.categorical_encoders[name] = self.categorical_encoders[name].fit(data[name])

        # encode them
        for name in dict.fromkeys(group_ids_to_encode + self.flat_categoricals):
            # targets and its lagged versions are handled separetely
            if name not in self.target_names and name not in self.lagged_targets:
                data[name] = self.transform_values(
                    name, data[name], inverse=False, ignore_na=name in self.lagged_variables
                )

        # save special variables
        assert "__time_idx__" not in data.columns, "__time_idx__ is a protected column and must not be present in data"
        data["__time_idx__"] = data[self.time_idx]  # save unscaled
        for target in self.target_names:
            assert (
                f"__target__{
      
      target}" not in data.columns
            ), f"__target__{
      
      target} is a protected column and must not be present in data"
            data[f"__target__{
      
      target}"] = data[target]
        if self.weight is not None:
            data["__weight__"] = data[self.weight]

        # train target normalizer
        if self.target_normalizer is not None:

            # fit target normalizer
            try:
                check_is_fitted(self.target_normalizer)
            except NotFittedError:
                if isinstance(self.target_normalizer, EncoderNormalizer):
                    self.target_normalizer.fit(data[self.target])
                elif isinstance(self.target_normalizer, (GroupNormalizer, MultiNormalizer)):
                    self.target_normalizer.fit(data[self.target], data)
                else:
                    self.target_normalizer.fit(data[self.target])

            # transform target
            if isinstance(self.target_normalizer, EncoderNormalizer):
                # we approximate the scales and target transformation by assuming one
                # transformation over the entire time range but by each group
                common_init_args = [
                    name
                    for name in inspect.signature(GroupNormalizer.__init__).parameters.keys()
                    if name in inspect.signature(EncoderNormalizer.__init__).parameters.keys()
                    and name not in ["data", "self"]
                ]
                copy_kwargs = {
    
    name: getattr(self.target_normalizer, name) for name in common_init_args}
                normalizer = GroupNormalizer(groups=self.group_ids, **copy_kwargs)
                data[self.target], scales = normalizer.fit_transform(data[self.target], data, return_norm=True)

            elif isinstance(self.target_normalizer, GroupNormalizer):
                data[self.target], scales = self.target_normalizer.transform(data[self.target], data, return_norm=True)

            elif isinstance(self.target_normalizer, MultiNormalizer):
                transformed, scales = self.target_normalizer.transform(data[self.target], data, return_norm=True)

                for idx, target in enumerate(self.target_names):
                    data[target] = transformed[idx]

                    if isinstance(self.target_normalizer[idx], NaNLabelEncoder):
                        # overwrite target because it requires encoding (continuous targets should not be normalized)
                        data[f"__target__{
      
      target}"] = data[target]

            elif isinstance(self.target_normalizer, NaNLabelEncoder):
                data[self.target] = self.target_normalizer.transform(data[self.target])
                # overwrite target because it requires encoding (continuous targets should not be normalized)
                data[f"__target__{
      
      self.target}"] = data[self.target]
                scales = None

            else:
                data[self.target], scales = self.target_normalizer.transform(data[self.target], return_norm=True)

            # add target scales
            if self.add_target_scales:
                if not isinstance(self.target_normalizer, MultiNormalizer):
                    scales = [scales]
                for target_idx, target in enumerate(self.target_names):
                    if not isinstance(self.target_normalizers[target_idx], NaNLabelEncoder):
                        for scale_idx, name in enumerate(["center", "scale"]):
                            feature_name = f"{
      
      target}_{
      
      name}"
                            assert (
                                feature_name not in data.columns
                            ), f"{
      
      feature_name} is a protected column and must not be present in data"
                            data[feature_name] = scales[target_idx][:, scale_idx].squeeze()
                            if feature_name not in self.reals:
                                self.static_reals.append(feature_name)

        # rescale continuous variables apart from target
        for name in self.reals:
            if name in self.target_names or name in self.lagged_variables:
                # lagged variables are only transformed - not fitted
                continue
            elif name not in self.scalers:
                self.scalers[name] = StandardScaler().fit(data[[name]])
            elif self.scalers[name] is not None:
                try:
                    check_is_fitted(self.scalers[name])
                except NotFittedError:
                    if isinstance(self.scalers[name], GroupNormalizer):
                        self.scalers[name] = self.scalers[name].fit(data[name], data)
                    else:
                        self.scalers[name] = self.scalers[name].fit(data[[name]])

        # encode after fitting
        for name in self.reals:
            # targets are handled separately
            transformer = self.get_transformer(name)
            if (
                name not in self.target_names
                and transformer is not None
                and not isinstance(transformer, EncoderNormalizer)
            ):
                data[name] = self.transform_values(name, data[name], data=data, inverse=False)

        # encode lagged categorical targets
        for name in self.lagged_targets:
            # normalizer only now available
            if name in self.flat_categoricals:
                data[name] = self.transform_values(name, data[name], inverse=False, ignore_na=True)

        # encode constant values
        self.encoded_constant_fill_strategy = {
    
    }
        for name, value in self.constant_fill_strategy.items():
            if name in self.target_names:
                self.encoded_constant_fill_strategy[f"__target__{
      
      name}"] = value
            self.encoded_constant_fill_strategy[name] = self.transform_values(
                name, np.array([value]), data=data, inverse=False
            )[0]

        # shorten data by maximum of lagged sequences to avoid NA values - shorten only after encoding
        if self.max_lag > 0:
            # negative tail implementation as .groupby().tail(-self.max_lag) is not implemented in pandas
            g = data.groupby(self._group_ids, observed=True)
            data = g._selected_obj[g.cumcount() >= self.max_lag]
        return data

This function is used in the Prophet model to preprocess the input data. It consists of the following steps:
Scaling continuous variables
Adding lag terms to the data.
Encodes the group ID.
Encodes categorical features.
Holds special variables (time_idx, __target__name and weight).
Train a normalizer on the target variable.
Transform the target variable.
Let's explain the steps in detail step by step:

Continuous variables are scaled, categorical variables are coded, and targets and weights are set aside.

Add lags to data: For each variable specified in the lags parameter, this step adds lagged versions of the variable to the data. The number of lagged versions is determined by the value of the lags parameter. This step uses the _get_lagged_names method to create new column names for the lagged versions of the variable. Add lagged values ​​to the data. Traversing each lag variable, adding a new column to the data, the value of the new column is the result of the original variable after lag processing.

Encode group IDs: For each group ID name and corresponding new name, encode with the existing encoder. This step encodes the group IDs specified in the group_ids parameter. Encoding is done by mapping each group ID to an integer. This step is coded using the factorize function of pandas.

Encoding categorical features: To ensure that subsequent group normalizers rely on encoded categorical variables, first encode the categorical variables and then perform group normalization on the encoded data. This step encodes categorical features, which are specified in the categorical_features parameter. The encoding method is to use one-hot encoding, which splits each categorical feature into multiple binary features. This step is coded using the pandas get_dummies function.

Save Special Variables: This step saves the special variables (time_idx, __target__name, and weight) into a dictionary for use in later steps. Make sure that column names prefixed with "time_idx" and "target" do not appear in the data. Save the target and weights, if present.

Normalize the target variable. If a normalizer for the target variable is used, the fit operation is performed first, and then the normalization is performed on the training set. If GroupNormalizer is used, the targets will be normalized within each group. If EncoderNormalizer is used, the target is encoded before normalization.

Train a normalizer for the target variable: This step trains a normalizer using the value of the target variable (__target__name) to scale the target variable to the specified range. Normalizers are implemented using sklearn's StandardScaler class.

Transform target variable: This step scales the target variable to a specified range using a trained normalizer. This step also changes the name of the target variable from __target__name to y.

The input to this function is a Pandas DataFrame object, and the output is also a Pandas DataFrame object. After preprocessing the data, the output DataFrame can be used for model training.

def get_transformer(self, name: str, group_id: bool = False):
    """
    Get transformer for variable.

    Args:
        name (str): variable name
        group_id (bool, optional): If the passed name refers to a group id (different encoders are used for these).
            Defaults to False.

    Returns:
        transformer
    """
    if group_id:
        name = self._group_ids_mapping[name]
    elif name in self.lagged_variables:  # recover transformer fitted on non-lagged variable
        name = self.lagged_variables[name]

    if name in self.flat_categoricals + self.group_ids + self._group_ids:
        name = self.variable_to_group_mapping.get(name, name)  # map name to encoder

        # take target normalizer if required
        if name in self.target_names:
            transformer = self.target_normalizers[self.target_names.index(name)]
        else:
            transformer = self.categorical_encoders.get(name, None)
        return transformer

    elif name in self.reals:
        # take target normalizer if required
        if name in self.target_names:
            transformer = self.target_normalizers[self.target_names.index(name)]
        else:
            transformer = self.scalers.get(name, None)
        return transformer
    else:
        return None

This code is a function in the Prophet model to get the transformer of the variable. The following explains line by line:

  • def get_transformer(self, name: str, group_id: bool = False):: Define the function get_transformer, with two input parameters name and group_id, group_id is False by default.
  • if group_id:: If group_id is True, you need to map the name to the corresponding encoder.
  • name = self._group_ids_mapping[name]: Map name to the corresponding encoder.
  • elif name in self.lagged_variables:: If name is a lagged variable, you need to revert to the already fitted transformer on the unlagged variable.
    name = self.lagged_variables[name]: Revert to the already fitted transformer on unlagged variables.
  • if name in self.flat_categoricals + self.group_ids + self._group_ids:: If name is one of flat categorical variables, grouping variables or group variables, you need to map name to the corresponding encoder.
    name = self.variable_to_group_mapping.get(name, name): Map name to the corresponding encoder.
  • if name in self.target_names:: If name is one of the target variables, it needs to be mapped to the target normalizer.
    transformer = self.target_normalizers[self.target_names.index(name)]: maps name to the target normalizer.
  • else:: If it is not one of the target variables, it needs to be mapped to the corresponding encoder.
  • transformer = self.categorical_encoders.get(name, None): map the name to the corresponding encoder.
  • return transformer: The transformer that returns the variable.
  • elif name in self.reals:: If the variable name name is in the self.reals list, it is a continuous variable, then get the normalizer (transformer) of the variable. If the variable is one of the target variables (in the self.target_names list), get the target normalizers (self.target_normalizers); otherwise get the normalizers for continuous variables (self.scalers). Returns None if name is not of any variable type.
    def transform_values(
        self,
        name: str,
        values: Union[pd.Series, torch.Tensor, np.ndarray],
        data: pd.DataFrame = None,
        inverse=False,
        group_id: bool = False,
        **kwargs,
    ) -> np.ndarray:
        """
        Scale and encode values.

        Args:
            name (str): name of variable
            values (Union[pd.Series, torch.Tensor, np.ndarray]): values to encode/scale
            data (pd.DataFrame, optional): extra data used for scaling (e.g. dataframe with groups columns).
                Defaults to None.
            inverse (bool, optional): if to conduct inverse transformation. Defaults to False.
            group_id (bool, optional): If the passed name refers to a group id (different encoders are used for these).
                Defaults to False.
            **kwargs: additional arguments for transform/inverse_transform method

        Returns:
            np.ndarray: (de/en)coded/(de)scaled values
        """
        transformer = self.get_transformer(name, group_id=group_id)
        if transformer is None:
            return values
        if inverse:
            transform = transformer.inverse_transform
        else:
            transform = transformer.transform

        if group_id:
            name = self._group_ids_mapping[name]
        # remaining categories
        if name in self.flat_categoricals + self.group_ids + self._group_ids:
            return transform(values, **kwargs)

        # reals
        elif name in self.reals:
            if isinstance(transformer, GroupNormalizer):
                return transform(values, data, **kwargs)
            elif isinstance(transformer, EncoderNormalizer):
                return transform(values, **kwargs)
            else:
                if isinstance(values, pd.Series):
                    values = values.to_frame()
                    return np.asarray(transform(values, **kwargs)).reshape(-1)
                else:
                    values = values.reshape(-1, 1)
                    return transform(values, **kwargs).reshape(-1)
        else:
            return values

This code is used to normalize and encode the input data.

The function name is transform_values, which contains six input parameters: name indicates the variable name of the data, values ​​indicates the value of the data, data indicates additional data for normalization, inverse indicates whether to inversely transform the data, group_id indicates whether to group the data For encoding, **kwargs represents a variable number of parameter lists.

First, get the normalizer or encoder for the data by calling the get_transformer function. If the obtained result is None, it means that the variable does not need to be normalized and encoded, and returns the original data directly. Otherwise, judge whether inverse transformation is needed according to the inverse parameter, and choose to call the transform or inverse_transform function of the normalizer or encoder. If you need to encode grouped data, you need to convert the variable name to the corresponding encoder, and pass the input data into the transform function for encoding. For other data types, you need to convert the input data according to the required format first, then call the corresponding function to normalize or encode, and finally return the result.

In short, this code implements the core logic of normalizing and encoding operations on data in the model.

 def _data_to_tensors(self, data: pd.DataFrame) -> Dict[str, torch.Tensor]:
        """
        Convert data to tensors for faster access with :py:meth:`~__getitem__`.

        Args:
            data (pd.DataFrame): preprocessed data

        Returns:
            Dict[str, torch.Tensor]: dictionary of tensors for continous, categorical data, groups, target and
                time index
        """

        index = check_for_nonfinite(
            torch.tensor(data[self._group_ids].to_numpy(np.int64), dtype=torch.int64), self.group_ids
        )
        time = check_for_nonfinite(
            torch.tensor(data["__time_idx__"].to_numpy(np.int64), dtype=torch.int64), self.time_idx
        )

        # categorical covariates
        categorical = check_for_nonfinite(
            torch.tensor(data[self.flat_categoricals].to_numpy(np.int64), dtype=torch.int64), self.flat_categoricals
        )

        # get weight
        if self.weight is not None:
            weight = check_for_nonfinite(
                torch.tensor(
                    data["__weight__"].to_numpy(dtype=np.float64),
                    dtype=torch.float,
                ),
                self.weight,
            )
        else:
            weight = None

        # get target
        if isinstance(self.target_normalizer, NaNLabelEncoder):
            target = [
                check_for_nonfinite(
                    torch.tensor(data[f"__target__{
      
      self.target}"].to_numpy(dtype=np.int64), dtype=torch.long),
                    self.target,
                )
            ]
        else:
            if not isinstance(self.target, str):  # multi-target
                target = [
                    check_for_nonfinite(
                        torch.tensor(
                            data[f"__target__{
      
      name}"].to_numpy(
                                dtype=[np.float64, np.int64][data[name].dtype.kind in "bi"]
                            ),
                            dtype=[torch.float, torch.long][data[name].dtype.kind in "bi"],
                        ),
                        name,
                    )
                    for name in self.target_names
                ]
            else:
                target = [
                    check_for_nonfinite(
                        torch.tensor(data[f"__target__{
      
      self.target}"].to_numpy(dtype=np.float64), dtype=torch.float),
                        self.target,
                    )
                ]

        # continuous covariates
        continuous = check_for_nonfinite(
            torch.tensor(data[self.reals].to_numpy(dtype=np.float64), dtype=torch.float), self.reals
        )

        tensors = dict(
            reals=continuous, categoricals=categorical, groups=index, target=target, weight=weight, time=time
        )

        return tensors

This code implements converting preprocessed data into tensors for faster data access during model training and prediction. Its input is a preprocessed Pandas DataFrame, which contains all the input variables required by the model, such as categorical variables, continuous variables, group IDs, timestamps, target variables, and sample weights. The function converts the data into the corresponding PyTorch tensor (tensor), stores it in a dictionary and returns it, each key value corresponds to a type of input variable.

Specifically, the function first converts the group ID, timestamp, and categorical variables into tensors of integer type (int64), and checks them for NaN values ​​to ensure the correctness of the data. Next, if there are sample weights, convert them to tensors of float type (float). If there is a target variable, it is converted to the corresponding tensor, depending on whether it is a multi-target problem, multiple conversions may be required. Finally, convert the continuous variable to a tensor of type float. All these tensors are stored in a dictionary with each key representing a type of input variable for easier use during training and prediction.

@property
    def categoricals(self) -> List[str]:
        """
        Categorical variables as used for modelling.

        Returns:
            List[str]: list of variables
        """
        return self.static_categoricals + self.time_varying_known_categoricals + self.time_varying_unknown_categoricals

    @property
    def flat_categoricals(self) -> List[str]:
        """
        Categorical variables as defined in input data.

        Returns:
            List[str]: list of variables
        """
        categories = []
        for name in self.categoricals:
            if name in self.variable_groups:
                categories.extend(self.variable_groups[name])
            else:
                categories.append(name)
        return categories

    @property
    def variable_to_group_mapping(self) -> Dict[str, str]:
        """
        Mapping from categorical variables to variables in input data.

        Returns:
            Dict[str, str]: dictionary mapping from :py:meth:`~categorical` to :py:meth:`~flat_categoricals`.
        """
        groups = {
    
    }
        for group_name, sublist in self.variable_groups.items():
            groups.update({
    
    name: group_name for name in sublist})
        return groups

    @property
    def reals(self) -> List[str]:
        """
        Continous variables as used for modelling.

        Returns:
            List[str]: list of variables
        """
        return self.static_reals + self.time_varying_known_reals + self.time_varying_unknown_reals

    @property
    @lru_cache(None)
    def target_names(self) -> List[str]:
        """
        List of targets.

        Returns:
            List[str]: list of targets
        """
        if self.multi_target:
            return self.target
        else:
            return [self.target]

    @property
    def multi_target(self) -> bool:
        """
        If dataset encodes one or multiple targets.

        Returns:
            bool: true if multiple targets
        """
        return isinstance(self.target, (list, tuple))

    @property
    def target_normalizers(self) -> List[TorchNormalizer]:
        """
        List of target normalizers aligned with ``target_names``.

        Returns:
            List[TorchNormalizer]: list of target normalizers
        """
        if isinstance(self.target_normalizer, MultiNormalizer):
            target_normalizers = self.target_normalizer.normalizers
        else:
            target_normalizers = [self.target_normalizer]
        return target_normalizers

    def get_parameters(self) -> Dict[str, Any]:
        """
        Get parameters that can be used with :py:meth:`~from_parameters` to create a new dataset with the same scalers.

        Returns:
            Dict[str, Any]: dictionary of parameters
        """
        kwargs = {
    
    
            name: getattr(self, name)
            for name in inspect.signature(self.__class__.__init__).parameters.keys()
            if name not in ["data", "self"]
        }
        kwargs["categorical_encoders"] = self.categorical_encoders
        kwargs["scalers"] = self.scalers
        return kwargs

    @classmethod
    def from_dataset(
        cls, dataset, data: pd.DataFrame, stop_randomization: bool = False, predict: bool = False, **update_kwargs
    ):
        """
        Generate dataset with different underlying data but same variable encoders and scalers, etc.

        Calls :py:meth:`~from_parameters` under the hood.

        Args:
            dataset (TimeSeriesDataSet): dataset from which to copy parameters
            data (pd.DataFrame): data from which new dataset will be generated
            stop_randomization (bool, optional): If to stop randomizing encoder and decoder lengths,
                e.g. useful for validation set. Defaults to False.
            predict (bool, optional): If to predict the decoder length on the last entries in the
                time index (i.e. one prediction per group only). Defaults to False.
            **kwargs: keyword arguments overriding parameters in the original dataset

        Returns:
            TimeSeriesDataSet: new dataset
        """
        return cls.from_parameters(
            dataset.get_parameters(), data, stop_randomization=stop_randomization, predict=predict, **update_kwargs
        )

categoricals attribute: Get a list of all categorical variables used for modeling, including static categorical variables, categorical variables with known temporal changes, and categorical variables with unknown temporal changes.

flat_categoricals attribute: Gets a list of all categorical variables in the input data, including all possible values ​​for each categorical variable.

variable_to_group_mapping attribute: Get a mapping dictionary from model categorical variables to input data categorical variables.

Reals attribute: Get a list of all continuous variables used for modeling, including static continuous variables, continuous variables with known time changes, and continuous variables with unknown time changes.

target_names attribute: Get a list of all target variables, if the dataset has only one target, convert it to a list.

multi_target attribute: Check if the dataset contains multiple target variables.

target_normalizers attribute: Get a list of target variable normalizers, corresponding to target_names.

get_parameters method: Get the parameters of the current dataset in order to create a new dataset with the same scaler using the from_parameters method.

from_dataset class method: create a new dataset with different underlying data but the same variable encoders and scalers etc. This method internally calls the from_parameters method. The new dataset can be customized by passing stop_randomization and predict parameters, as well as keyword arguments to override those in the original dataset.

  @classmethod
    def from_parameters(
        cls,
        parameters: Dict[str, Any],
        data: pd.DataFrame,
        stop_randomization: bool = None,
        predict: bool = False,
        **update_kwargs,
    ):
        """
        Generate dataset with different underlying data but same variable encoders and scalers, etc.

        Args:
            parameters (Dict[str, Any]): dataset parameters which to use for the new dataset
            data (pd.DataFrame): data from which new dataset will be generated
            stop_randomization (bool, optional): If to stop randomizing encoder and decoder lengths,
                e.g. useful for validation set. Defaults to False.
            predict (bool, optional): If to predict the decoder length on the last entries in the
                time index (i.e. one prediction per group only). Defaults to False.
            **kwargs: keyword arguments overriding parameters

        Returns:
            TimeSeriesDataSet: new dataset
        """
        parameters = deepcopy(parameters)
        if predict:
            if stop_randomization is None:
                stop_randomization = True
            elif not stop_randomization:
                warnings.warn(
                    "If predicting, no randomization should be possible - setting stop_randomization=True", UserWarning
                )
                stop_randomization = True
            parameters["min_prediction_length"] = parameters["max_prediction_length"]
            parameters["predict_mode"] = True
        elif stop_randomization is None:
            stop_randomization = False

        if stop_randomization:
            parameters["randomize_length"] = None
        parameters.update(update_kwargs)

        new = cls(data, **parameters)
        return new

This code defines a class method from_parameters that is used to generate a dataset with different latent data but the same features like variable encoders and scalers.

This method has several parameters:

parameters: A dictionary containing the parameters used to generate the new dataset.
data: A Pandas DataFrame used to generate new datasets.
stop_randomization: A boolean indicating whether to stop randomizing the encoder and decoder lengths. The default is None.
predict: A boolean indicating whether to predict the decoder length on the last entry of the time index (i.e. predict only once per group). The default is False.
**update_kwargs: Variable number of keyword arguments to override values ​​in parameters.
The main steps of this method are:

Copy the passed parameters parameter.
If predict is True, update stop_randomization to True. If stop_randomization is not None and is False, issue a warning and set stop_randomization to True. Set the parameter "min_prediction_length" to "max_prediction_length", and the parameter "predict_mode" to True.
If stop_randomization is None, set it to False.
If stop_randomization is True, set the parameter "randomize_length" to None.
Creates a new TimeSeriesDataSet object with the updated parameters and passed in data, and returns it.

    def _construct_index(self, data: pd.DataFrame, predict_mode: bool) -> pd.DataFrame:
        """
        Create index of samples.

        Args:
            data (pd.DataFrame): preprocessed data
            predict_mode (bool): if to create one same per group with prediction length equals ``max_decoder_length``

        Returns:
            pd.DataFrame: index dataframe for timesteps and index dataframe for groups.
                It contains a list of all possible subsequences.
        """
        g = data.groupby(self._group_ids, observed=True)

        df_index_first = g["__time_idx__"].transform("nth", 0).to_frame("time_first")
        df_index_last = g["__time_idx__"].transform("nth", -1).to_frame("time_last")
        df_index_diff_to_next = -g["__time_idx__"].diff(-1).fillna(-1).astype(int).to_frame("time_diff_to_next")
        df_index = pd.concat([df_index_first, df_index_last, df_index_diff_to_next], axis=1)
        df_index["index_start"] = np.arange(len(df_index))
        df_index["time"] = data["__time_idx__"]
        df_index["count"] = (df_index["time_last"] - df_index["time_first"]).astype(int) + 1
        sequence_ids = g.ngroup()
        df_index["sequence_id"] = sequence_ids

        min_sequence_length = self.min_prediction_length + self.min_encoder_length
        max_sequence_length = self.max_prediction_length + self.max_encoder_length

        # calculate maximum index to include from current index_start
        max_time = (df_index["time"] + max_sequence_length - 1).clip(upper=df_index["count"] + df_index.time_first - 1)

        # if there are missing timesteps, we cannot say directly what is the last timestep to include
        # therefore we iterate until it is found
        if (df_index["time_diff_to_next"] != 1).any():
            assert (
                self.allow_missing_timesteps
            ), "Time difference between steps has been idenfied as larger than 1 - set allow_missing_timesteps=True"

        df_index["index_end"], missing_sequences = _find_end_indices(
            diffs=df_index.time_diff_to_next.to_numpy(),
            max_lengths=(max_time - df_index.time).to_numpy() + 1,
            min_length=min_sequence_length,
        )
        # add duplicates but mostly with shorter sequence length for start of timeseries
        # while the previous steps have ensured that we start a sequence on every time step, the missing_sequences
        # ensure that there is a sequence that finishes on every timestep
        if len(missing_sequences) > 0:
            shortened_sequences = df_index.iloc[missing_sequences[:, 0]].assign(index_end=missing_sequences[:, 1])

            # concatenate shortened sequences
            df_index = pd.concat([df_index, shortened_sequences], axis=0, ignore_index=True)

        # filter out where encode and decode length are not satisfied
        df_index["sequence_length"] = df_index["time"].iloc[df_index["index_end"]].to_numpy() - df_index["time"] + 1

        # filter too short sequences
        df_index = df_index[
            # sequence must be at least of minimal prediction length
            lambda x: (x.sequence_length >= min_sequence_length)
            &
            # prediction must be for after minimal prediction index + length of prediction
            (x["sequence_length"] + x["time"] >= self.min_prediction_idx + self.min_prediction_length)
        ]

        if predict_mode:  # keep longest element per series (i.e. the first element that spans to the end of the series)
            # filter all elements that are longer than the allowed maximum sequence length
            df_index = df_index[
                lambda x: (x["time_last"] - x["time"] + 1 <= max_sequence_length)
                & (x["sequence_length"] >= min_sequence_length)
            ]
            # choose longest sequence
            df_index = df_index.loc[df_index.groupby("sequence_id").sequence_length.idxmax()]

        # check that all groups/series have at least one entry in the index
        if not sequence_ids.isin(df_index.sequence_id).all():
            missing_groups = data.loc[~sequence_ids.isin(df_index.sequence_id), self._group_ids].drop_duplicates()
            # decode values
            for name, id in self._group_ids_mapping.items():
                missing_groups[id] = self.transform_values(name, missing_groups[id], inverse=True, group_id=True)
            warnings.warn(
                "Min encoder length and/or min_prediction_idx and/or min prediction length and/or lags are "
                "too large for "
                f"{
      
      len(missing_groups)} series/groups which therefore are not present in the dataset index. "
                "This means no predictions can be made for those series. "
                f"First 10 removed groups: {
      
      list(missing_groups.iloc[:10].to_dict(orient='index').values())}",
                UserWarning,
            )
        assert (
            len(df_index) > 0
        ), "filters should not remove entries all entries - check encoder/decoder lengths and lags"

        return df_index

This is a function that preprocesses the data in order to build an index to help the model generate sequence predictions. The function takes a pandas DataFrame type of data and a boolean type predict_mode variable, and returns a data frame containing the time step and group index data frame. where timesteps is the list of all possible subsequences and groups are distinct time series. Specifically, this function implements the following steps:

1. Group the data according to the group identifier in the data.

2. Using the first and last timestep of the grouped data, and the time difference between adjacent timesteps, build a DataFrame with the index start, index end, time, and count columns.

3. Calculate the minimum and maximum length of the sequence and calculate the maximum index that should be included starting from the current index.

4. If there are missing time steps, iterations are required until the missing time steps are found.

5. Add the missing series to the data frame and shorten its length.

6. Filter sequences of insufficient length.

7. If predict_mode is True, only keep the longest element in each sequence.

8. Make sure all groups are included in the index.

9. Return the indexed data frame.

    def filter(self, filter_func: Callable, copy: bool = True) -> "TimeSeriesDataSet":
        """
        Filter subsequences in dataset.

        Uses interpretable version of index :py:meth:`~decoded_index`
        to filter subsequences in dataset.

        Args:
            filter_func (Callable): function to filter. Should take :py:meth:`~decoded_index`
                dataframe as only argument which contains group ids and time index columns.
            copy (bool): if to return copy of dataset or filter inplace.

        Returns:
            TimeSeriesDataSet: filtered dataset
        """
        # calculate filter
        filtered_index = self.index[np.asarray(filter_func(self.decoded_index))]
        # raise error if filter removes all entries
        if len(filtered_index) == 0:
            raise ValueError("After applying filter no sub-sequences left in dataset")
        if copy:
            dataset = _copy(self)
            dataset.index = filtered_index
            return dataset
        else:
            self.index = filtered_index
            return self

    @property
    def decoded_index(self) -> pd.DataFrame:
        """
        Get interpretable version of index.

        DataFrame contains
        - group_id columns in original encoding
        - time_idx_first column: first time index of subsequence
        - time_idx_last columns: last time index of subsequence
        - time_idx_first_prediction columns: first time index which is in decoder

        Returns:
            pd.DataFrame: index that can be understood in terms of original data
        """
        # get dataframe to filter
        index_start = self.index["index_start"].to_numpy()
        index_last = self.index["index_end"].to_numpy()
        index = (
            # get group ids in order of index
            pd.DataFrame(self.data["groups"][index_start].numpy(), columns=self.group_ids)
            # to original values
            .apply(lambda x: self.transform_values(name=x.name, values=x, group_id=True, inverse=True))
            # add time index
            .assign(
                time_idx_first=self.data["time"][index_start].numpy(),
                time_idx_last=self.data["time"][index_last].numpy(),
                # prediction index is last time index - decoder length + 1
                time_idx_first_prediction=lambda x: x.time_idx_last
                - self.calculate_decoder_length(
                    time_last=x.time_idx_last, sequence_length=x.time_idx_last - x.time_idx_first + 1
                )
                + 1,
            )
        )
        return index

This code is a method filter of a Python class used to filter subsequences in a time series dataset. Specifically, this method accepts two parameters: filter_func and copy. filter_func is a callable object that accepts a DataFrame as input with grouping id and time index columns. The copy parameter is a boolean specifying whether to return a copy of the dataset or modify the dataset in place.

The return value of the method is a filtered time series dataset object. If copy is True, returns a copy of the dataset, otherwise modifies the original dataset. If no subsequences remain after filtering, a ValueError is raised.

This class also defines a decoded_index attribute method, which is used to get the interpreted index, which contains the group ID, the start and end time indexes of the subsequence, and the first time index of the decoder. Specifically, this method converts the index of the dataset into a form that can be understood as raw data. The return value of this method is a DataFrame object.

    def plot_randomization(
        self, betas: Tuple[float, float] = None, length: int = None, min_length: int = None
    ) -> Tuple[plt.Figure, torch.Tensor]:
        """
        Plot expected randomized length distribution.

        Args:
            betas (Tuple[float, float], optional): Tuple of betas, e.g. ``(0.2, 0.05)`` to use for randomization.
                Defaults to ``randomize_length`` of dataset.
            length (int, optional): . Defaults to ``max_encoder_length``.
            min_length (int, optional): [description]. Defaults to ``min_encoder_length``.

        Returns:
            Tuple[plt.Figure, torch.Tensor]: tuple of figure and histogram based on 1000 samples
        """
        if betas is None:
            betas = self.randomize_length
        if length is None:
            length = self.max_encoder_length
        if min_length is None:
            min_length = self.min_encoder_length
        probabilities = Beta(betas[0], betas[1]).sample((1000,))

        lengths = ((length - min_length) * probabilities).round() + min_length

        fig, ax = plt.subplots()
        ax.hist(lengths)
        return fig, lengths

    def __len__(self) -> int:
        """
        Length of dataset.

        Returns:
            int: length
        """
        return self.index.shape[0]

    def set_overwrite_values(
        self, values: Union[float, torch.Tensor], variable: str, target: Union[str, slice] = "decoder"
    ) -> None:
        """
        Convenience method to quickly overwrite values in decoder or encoder (or both) for a specific variable.

        Args:
            values (Union[float, torch.Tensor]): values to use for overwrite.
            variable (str): variable whose values should be overwritten.
            target (Union[str, slice], optional): positions to overwrite. One of "decoder", "encoder" or "all" or
                a slice object which is directly used to overwrite indices, e.g. ``slice(-5, None)`` will overwrite
                the last 5 values. Defaults to "decoder".
        """
        values = torch.tensor(self.transform_values(variable, np.asarray(values).reshape(-1), inverse=False)).squeeze()
        assert target in [
            "all",
            "decoder",
            "encoder",
        ], f"target has be one of 'all', 'decoder' or 'encoder' but target={
      
      target} instead"

        if variable in self.static_categoricals or variable in self.static_categoricals:
            target = "all"

        if variable in self.target_names:
            raise NotImplementedError("Target variable is not supported")
        if self.weight is not None and self.weight == variable:
            raise NotImplementedError("Weight variable is not supported")
        if isinstance(self.scalers.get(variable, self.categorical_encoders.get(variable)), TorchNormalizer):
            raise NotImplementedError("TorchNormalizer (e.g. GroupNormalizer) is not supported")

        if self._overwrite_values is None:
            self._overwrite_values = {
    
    }
        self._overwrite_values.update(dict(values=values, variable=variable, target=target))

    def reset_overwrite_values(self) -> None:
        """
        Reset values used to override sample features.
        """
        self._overwrite_values = None

    def calculate_decoder_length(
        self,
        time_last: Union[int, pd.Series, np.ndarray],
        sequence_length: Union[int, pd.Series, np.ndarray],
    ) -> Union[int, pd.Series, np.ndarray]:
        """
        Calculate length of decoder.

        Args:
            time_last (Union[int, pd.Series, np.ndarray]): last time index of the sequence
            sequence_length (Union[int, pd.Series, np.ndarray]): total length of the sequence

        Returns:
            Union[int, pd.Series, np.ndarray]: decoder length(s)
        """
        if isinstance(time_last, int):
            decoder_length = min(
                time_last - (self.min_prediction_idx - 1),  # not going beyond min prediction idx
                self.max_prediction_length,  # maximum prediction length
                sequence_length - self.min_encoder_length,  # sequence length - min decoder length
            )
        else:
            decoder_length = np.min(
                [
                    time_last - (self.min_prediction_idx - 1),
                    sequence_length - self.min_encoder_length,
                ],
                axis=0,
            ).clip(max=self.max_prediction_length)
        return decoder_length

These codes are methods of a class that is a dataset for time series forecasting. The following is the detailed function of each method:

plot_randomization: This method is used to draw a histogram of expected random length distribution. A range of randomization lengths can be specified by passing the betas parameter. By default, the randomize_length value of the dataset will be used as the betas parameter. The length and min_length parameters specify the maximum and minimum length of the encoder sequence, respectively. The method returns a tuple including the plotted graph and the histogram based on 1000 samples.

len : This method returns the length of the dataset.

set_overwrite_values: This method is used to quickly overwrite the values ​​of specific variables in the encoder or decoder (or both). The values ​​parameter specifies the values ​​to use, the variable parameter specifies the variable whose values ​​are to be overridden, and the target parameter specifies which locations to overwrite. Optional target parameter includes "decoder", "encoder" or "all", or directly use the slice object to cover the index. The method returns nothing, but stores the overridden value in the dataset object.

reset_overwrite_values: This method is used to reset the values ​​used to overwrite sample features.

calculate_decoder_length: This method is used to calculate the length of the decoder. It calculates the length of the decoder based on the last time index, the sequence length, and the minimum and maximum decoder lengths of the dataset. What is returned is an integer or array representing the length of the decoder.

    def __getitem__(self, idx: int) -> Tuple[Dict[str, torch.Tensor], torch.Tensor]:
        """
        Get sample for model

        Args:
            idx (int): index of prediction (between ``0`` and ``len(dataset) - 1``)

        Returns:
            Tuple[Dict[str, torch.Tensor], torch.Tensor]: x and y for model
        """
        index = self.index.iloc[idx]
        # get index data
        data_cont = self.data["reals"][index.index_start : index.index_end + 1].clone()
        data_cat = self.data["categoricals"][index.index_start : index.index_end + 1].clone()
        time = self.data["time"][index.index_start : index.index_end + 1].clone()
        target = [d[index.index_start : index.index_end + 1].clone() for d in self.data["target"]]
        groups = self.data["groups"][index.index_start].clone()
        if self.data["weight"] is None:
            weight = None
        else:
            weight = self.data["weight"][index.index_start : index.index_end + 1].clone()
        # get target scale in the form of a list
        target_scale = self.target_normalizer.get_parameters(groups, self.group_ids)
        if not isinstance(self.target_normalizer, MultiNormalizer):
            target_scale = [target_scale]

        # fill in missing values (if not all time indices are specified
        sequence_length = len(time)
        if sequence_length < index.sequence_length:
            assert self.allow_missing_timesteps, "allow_missing_timesteps should be True if sequences have gaps"
            repetitions = torch.cat([time[1:] - time[:-1], torch.ones(1, dtype=time.dtype)])
            indices = torch.repeat_interleave(torch.arange(len(time)), repetitions)
            repetition_indices = torch.cat([torch.tensor([False], dtype=torch.bool), indices[1:] == indices[:-1]])

            # select data
            data_cat = data_cat[indices]
            data_cont = data_cont[indices]
            target = [d[indices] for d in target]
            if weight is not None:
                weight = weight[indices]

            # reset index
            if self.time_idx in self.reals:
                time_idx = self.reals.index(self.time_idx)
                data_cont[:, time_idx] = torch.linspace(
                    data_cont[0, time_idx], data_cont[-1, time_idx], len(target[0]), dtype=data_cont.dtype
                )

            # make replacements to fill in categories
            for name, value in self.encoded_constant_fill_strategy.items():
                if name in self.reals:
                    data_cont[repetition_indices, self.reals.index(name)] = value
                elif name in [f"__target__{
      
      target_name}" for target_name in self.target_names]:
                    target_pos = self.target_names.index(name[len("__target__") :])
                    target[target_pos][repetition_indices] = value
                elif name in self.flat_categoricals:
                    data_cat[repetition_indices, self.flat_categoricals.index(name)] = value
                elif name in self.target_names:  # target is just not an input value
                    pass
                else:
                    raise KeyError(f"Variable {
      
      name} is not known and thus cannot be filled in")

            sequence_length = len(target[0])

        # determine data window
        assert (
            sequence_length >= self.min_prediction_length
        ), "Sequence length should be at least minimum prediction length"
        # determine prediction/decode length and encode length
        decoder_length = self.calculate_decoder_length(time[-1], sequence_length)
        encoder_length = sequence_length - decoder_length
        assert (
            decoder_length >= self.min_prediction_length
        ), "Decoder length should be at least minimum prediction length"
        assert encoder_length >= self.min_encoder_length, "Encoder length should be at least minimum encoder length"

        if self.randomize_length is not None:  # randomization improves generalization
            # modify encode and decode lengths
            modifiable_encoder_length = encoder_length - self.min_encoder_length
            encoder_length_probability = Beta(self.randomize_length[0], self.randomize_length[1]).sample()

            # subsample a new/smaller encode length
            new_encoder_length = self.min_encoder_length + int(
                (modifiable_encoder_length * encoder_length_probability).round()
            )

            # extend decode length if possible
            new_decoder_length = min(decoder_length + (encoder_length - new_encoder_length), self.max_prediction_length)

            # select subset of sequence of new sequence
            if new_encoder_length + new_decoder_length < len(target[0]):
                data_cat = data_cat[encoder_length - new_encoder_length : encoder_length + new_decoder_length]
                data_cont = data_cont[encoder_length - new_encoder_length : encoder_length + new_decoder_length]
                target = [t[encoder_length - new_encoder_length : encoder_length + new_decoder_length] for t in target]
                encoder_length = new_encoder_length
                decoder_length = new_decoder_length

            # switch some variables to nan if encode length is 0
            if encoder_length == 0 and len(self.dropout_categoricals) > 0:
                data_cat[
                    :, [self.flat_categoricals.index(c) for c in self.dropout_categoricals]
                ] = 0  # zero is encoded nan

        assert decoder_length > 0, "Decoder length should be greater than 0"
        assert encoder_length >= 0, "Encoder length should be at least 0"

        if self.add_relative_time_idx:
            data_cont[:, self.reals.index("relative_time_idx")] = (
                torch.arange(-encoder_length, decoder_length, dtype=data_cont.dtype) / self.max_encoder_length
            )

        if self.add_encoder_length:
            data_cont[:, self.reals.index("encoder_length")] = (
                (encoder_length - 0.5 * self.max_encoder_length) / self.max_encoder_length * 2.0
            )

        # rescale target
        for idx, target_normalizer in enumerate(self.target_normalizers):
            if isinstance(target_normalizer, EncoderNormalizer):
                target_name = self.target_names[idx]
                # fit and transform
                target_normalizer.fit(target[idx][:encoder_length])
                # get new scale
                single_target_scale = target_normalizer.get_parameters()
                # modify input data
                if target_name in self.reals:
                    data_cont[:, self.reals.index(target_name)] = target_normalizer.transform(target[idx])
                if self.add_target_scales:
                    data_cont[:, self.reals.index(f"{
      
      target_name}_center")] = self.transform_values(
                        f"{
      
      target_name}_center", single_target_scale[0]
                    )[0]
                    data_cont[:, self.reals.index(f"{
      
      target_name}_scale")] = self.transform_values(
                        f"{
      
      target_name}_scale", single_target_scale[1]
                    )[0]
                # scale needs to be numpy to be consistent with GroupNormalizer
                target_scale[idx] = single_target_scale.numpy()

        # rescale covariates
        for name in self.reals:
            if name not in self.target_names and name not in self.lagged_variables:
                normalizer = self.get_transformer(name)
                if isinstance(normalizer, EncoderNormalizer):
                    # fit and transform
                    pos = self.reals.index(name)
                    normalizer.fit(data_cont[:encoder_length, pos])
                    # transform
                    data_cont[:, pos] = normalizer.transform(data_cont[:, pos])

        # also normalize lagged variables
        for name in self.reals:
            if name in self.lagged_variables:
                normalizer = self.get_transformer(name)
                if isinstance(normalizer, EncoderNormalizer):
                    pos = self.reals.index(name)
                    data_cont[:, pos] = normalizer.transform(data_cont[:, pos])

        # overwrite values
        if self._overwrite_values is not None:
            if isinstance(self._overwrite_values["target"], slice):
                positions = self._overwrite_values["target"]
            elif self._overwrite_values["target"] == "all":
                positions = slice(None)
            elif self._overwrite_values["target"] == "encoder":
                positions = slice(None, encoder_length)
            else:  # decoder
                positions = slice(encoder_length, None)

            if self._overwrite_values["variable"] in self.reals:
                idx = self.reals.index(self._overwrite_values["variable"])
                data_cont[positions, idx] = self._overwrite_values["values"]
            else:
                assert (
                    self._overwrite_values["variable"] in self.flat_categoricals
                ), "overwrite values variable has to be either in real or categorical variables"
                idx = self.flat_categoricals.index(self._overwrite_values["variable"])
                data_cat[positions, idx] = self._overwrite_values["values"]

        # weight is only required for decoder
        if weight is not None:
            weight = weight[encoder_length:]

        # if user defined target as list, output should be list, otherwise tensor
        if self.multi_target:
            encoder_target = [t[:encoder_length] for t in target]
            target = [t[encoder_length:] for t in target]
        else:
            encoder_target = target[0][:encoder_length]
            target = target[0][encoder_length:]
            target_scale = target_scale[0]

        return (
            dict(
                x_cat=data_cat,
                x_cont=data_cont,
                encoder_length=encoder_length,
                decoder_length=decoder_length,
                encoder_target=encoder_target,
                encoder_time_idx_start=time[0],
                groups=groups,
                target_scale=target_scale,
            ),
            (target, weight),
        )

This is a Python class method named __getitem__ that is used to access object instances of the class by index.

This method accepts an integer type parameter idx, which represents the index of a sample in the data set. Then, obtain the corresponding sample data from the dataset, process and prepare the data for use by the model. The return value of the method is a tuple containing two elements. The first element is the data x of dictionary type, which contains the values ​​of all input variables required by the model. The second element is the data y of tensor type, indicating the value of the target variable corresponding to the sample.

Specifically, this method first obtains the index information index of the corresponding sample in the data set according to the input parameter idx. Then, obtain the continuous variable data data_cont, categorical variable data data_cat, time variable data time, target variable data target, grouping variable data groups and weight variable data weight corresponding to the sample from the data set. If the weight variable does not exist, set it to None. Additionally, the normalized parameter target_scale of the target variable is taken and converted to list format. If the target variable normalizer is MultiNormalizer, then target_scale is the list itself, otherwise it needs to be converted to a list.

Next, missing value handling is performed on the data. If there are missing values ​​in the time series, select whether to allow missing values ​​according to the allow_missing_timesteps parameter setting. If allowed, missing values ​​need to be interpolated. Specifically, the time points where the missing values ​​are located are split into multiple subintervals, and then linear interpolation is performed on each subinterval to fill in the missing values. In addition, the missing values ​​of categorical variables need to be filled using the specified filling strategy to ensure data integrity.

Finally, the lengths of the encoder and decoder are calculated based on the processed data, and the lengths of the encoder and decoder are randomly adjusted according to the setting of the parameter randomize_length. Then, the subsequences required by the encoder and decoder are selected from the processed data and returned as input and output data of the model.

 @staticmethod
    def _collate_fn(
        batches: List[Tuple[Dict[str, torch.Tensor], torch.Tensor]]
    ) -> Tuple[Dict[str, torch.Tensor], torch.Tensor]:
        """
        Collate function to combine items into mini-batch for dataloader.

        Args:
            batches (List[Tuple[Dict[str, torch.Tensor], torch.Tensor]]): List of samples generated with
                :py:meth:`~__getitem__`.

        Returns:
            Tuple[Dict[str, torch.Tensor], Tuple[Union[torch.Tensor, List[torch.Tensor]], torch.Tensor]: minibatch
        """
        # collate function for dataloader
        # lengths
        encoder_lengths = torch.tensor([batch[0]["encoder_length"] for batch in batches], dtype=torch.long)
        decoder_lengths = torch.tensor([batch[0]["decoder_length"] for batch in batches], dtype=torch.long)

        # ids
        decoder_time_idx_start = (
            torch.tensor([batch[0]["encoder_time_idx_start"] for batch in batches], dtype=torch.long) + encoder_lengths
        )
        decoder_time_idx = decoder_time_idx_start.unsqueeze(1) + torch.arange(decoder_lengths.max()).unsqueeze(0)
        groups = torch.stack([batch[0]["groups"] for batch in batches])

        # features
        encoder_cont = rnn.pad_sequence(
            [batch[0]["x_cont"][:length] for length, batch in zip(encoder_lengths, batches)], batch_first=True
        )
        encoder_cat = rnn.pad_sequence(
            [batch[0]["x_cat"][:length] for length, batch in zip(encoder_lengths, batches)], batch_first=True
        )

        decoder_cont = rnn.pad_sequence(
            [batch[0]["x_cont"][length:] for length, batch in zip(encoder_lengths, batches)], batch_first=True
        )
        decoder_cat = rnn.pad_sequence(
            [batch[0]["x_cat"][length:] for length, batch in zip(encoder_lengths, batches)], batch_first=True
        )

        # target scale
        if isinstance(batches[0][0]["target_scale"], torch.Tensor):  # stack tensor
            target_scale = torch.stack([batch[0]["target_scale"] for batch in batches])
        elif isinstance(batches[0][0]["target_scale"], (list, tuple)):
            target_scale = []
            for idx in range(len(batches[0][0]["target_scale"])):
                if isinstance(batches[0][0]["target_scale"][idx], torch.Tensor):  # stack tensor
                    scale = torch.stack([batch[0]["target_scale"][idx] for batch in batches])
                else:
                    scale = torch.from_numpy(
                        np.array([batch[0]["target_scale"][idx] for batch in batches], dtype=np.float32),
                    )
                target_scale.append(scale)
        else:  # convert to tensor
            target_scale = torch.from_numpy(
                np.array([batch[0]["target_scale"] for batch in batches], dtype=np.float32),
            )

        # target and weight
        if isinstance(batches[0][1][0], (tuple, list)):
            target = [
                rnn.pad_sequence([batch[1][0][idx] for batch in batches], batch_first=True)
                for idx in range(len(batches[0][1][0]))
            ]
            encoder_target = [
                rnn.pad_sequence([batch[0]["encoder_target"][idx] for batch in batches], batch_first=True)
                for idx in range(len(batches[0][1][0]))
            ]
        else:
            target = rnn.pad_sequence([batch[1][0] for batch in batches], batch_first=True)
            encoder_target = rnn.pad_sequence([batch[0]["encoder_target"] for batch in batches], batch_first=True)

        if batches[0][1][1] is not None:
            weight = rnn.pad_sequence([batch[1][1] for batch in batches], batch_first=True)
        else:
            weight = None

        return (
            dict(
                encoder_cat=encoder_cat,
                encoder_cont=encoder_cont,
                encoder_target=encoder_target,
                encoder_lengths=encoder_lengths,
                decoder_cat=decoder_cat,
                decoder_cont=decoder_cont,
                decoder_target=target,
                decoder_lengths=decoder_lengths,
                decoder_time_idx=decoder_time_idx,
                groups=groups,
                target_scale=target_scale,
            ),
            (target, weight),
        )

This function is a function used to combine multiple samples in a list into a mini-batch. The following are the specific steps and input and output of the function:

enter:

batches: A list containing multiple samples, each sample consists of an input dictionary and an output label.
output:

A tuple containing two elements:
a dictionary containing the following key-value pairs:
encoder_cat: The categorical features of the encoder input, with shape (batch_size, max_encoder_length, num_cat_features).
encoder_cont: The continuous features of the encoder input, the shape is (batch_size, max_encoder_length, num_cont_features).
encoder_target: The target of the encoder, the shape is (batch_size, max_encoder_length).
encoder_lengths: The effective length of each sample of the encoder, the shape is (batch_size,).
decoder_cat: The category features of the decoder input, the shape is (batch_size, max_decoder_length, num_cat_features).
decoder_cont: The continuous features of the decoder input, the shape is (batch_size, max_decoder_length, num_cont_features).
decoder_target: The target of the decoder, the shape is (batch_size, max_decoder_length) or (batch_size, num_decoder_outputs, max_decoder_length).
decoder_lengths: The effective length of each sample of the decoder, the shape is (batch_size,).
decoder_time_idx: The index of each time step of the decoder, the shape is (batch_size, max_decoder_length).
groups: The grouping to which the sample belongs, the shape is (batch_size,).
target_scale: The scaling factor of the target, the shape is (batch_size,) or (batch_size, num_decoder_outputs).
A tuple with two elements:
decoder_target or a list of decoder_targets corresponding to the decoder_targets in the dictionary above.
weights, same shape as decoder_target, or None.
Function specific steps:

Extract the encoder length (encoder_lengths) and decoder length (decoder_lengths) of each sample from the batches list.
Calculate decoder_time_idx based on encoder_lengths and the start time index (encoder_time_idx_start) of the decoder. Where decoder_time_idx_start is calculated from encoder_time_idx_start + encoder_lengths.
Extract groups for each sample.
Padding is applied to the categorical features (encoder_cat) and continuous features (encoder_cont) input by the encoder.
Padding is applied to the categorical features (decoder_cat) and continuous features (decoder_cont) input by the decoder.
Extract the target (encoder_target) and weight (weight) of each sample and pad them.
Process the scale factor of the target (target_scale), if it is a tensor, perform a stack operation, otherwise convert it to a tensor.
Finally, the function returns the above dictionaries and tuples.

    def to_dataloader(
        self, train: bool = True, batch_size: int = 64, batch_sampler: Union[Sampler, str] = None, **kwargs
    ) -> DataLoader:
        """
        Get dataloader from dataset.

        The

        Args:
            train (bool, optional): if dataloader is used for training or prediction
                Will shuffle and drop last batch if True. Defaults to True.
            batch_size (int): batch size for training model. Defaults to 64.
            batch_sampler (Union[Sampler, str]): batch sampler or string. One of

                * "synchronized": ensure that samples in decoder are aligned in time. Does not support missing
                  values in dataset. This makes only sense if the underlying algorithm makes use of values aligned
                  in time.
                * PyTorch Sampler instance: any PyTorch sampler, e.g. the WeightedRandomSampler()
                * None: samples are taken randomly from times series.

            **kwargs: additional arguments to ``DataLoader()``

        Returns:
            DataLoader: dataloader that returns Tuple.
                First entry is ``x``, a dictionary of tensors with the entries (and shapes in brackets)

                * encoder_cat (batch_size x n_encoder_time_steps x n_features): long tensor of encoded
                  categoricals for encoder
                * encoder_cont (batch_size x n_encoder_time_steps x n_features): float tensor of scaled continuous
                  variables for encoder
                * encoder_target (batch_size x n_encoder_time_steps or list thereof with each entry for a different
                  target):
                  float tensor with unscaled continous target or encoded categorical target,
                  list of tensors for multiple targets
                * encoder_lengths (batch_size): long tensor with lengths of the encoder time series. No entry will
                  be greater than n_encoder_time_steps
                * decoder_cat (batch_size x n_decoder_time_steps x n_features): long tensor of encoded
                  categoricals for decoder
                * decoder_cont (batch_size x n_decoder_time_steps x n_features): float tensor of scaled continuous
                  variables for decoder
                * decoder_target (batch_size x n_decoder_time_steps or list thereof with each entry for a different
                  target):
                  float tensor with unscaled continous target or encoded categorical target for decoder
                  - this corresponds to first entry of ``y``, list of tensors for multiple targets
                * decoder_lengths (batch_size): long tensor with lengths of the decoder time series. No entry will
                  be greater than n_decoder_time_steps
                * group_ids (batch_size x number_of_ids): encoded group ids that identify a time series in the dataset
                * target_scale (batch_size x scale_size or list thereof with each entry for a different target):
                  parameters used to normalize the target.
                  Typically these are mean and standard deviation. Is list of tensors for multiple targets.


                Second entry is ``y``, a tuple of the form (``target``, `weight`)

                * target (batch_size x n_decoder_time_steps or list thereof with each entry for a different target):
                  unscaled (continuous) or encoded (categories) targets, list of tensors for multiple targets
                * weight (None or batch_size x n_decoder_time_steps): weight

        Example:

            Weight by samples for training:

            .. code-block:: python

                from torch.utils.data import WeightedRandomSampler

                # length of probabilties for sampler have to be equal to the length of the index
                probabilities = np.sqrt(1 + data.loc[dataset.index, "target"])
                sampler = WeightedRandomSampler(probabilities, len(probabilities))
                dataset.to_dataloader(train=True, sampler=sampler, shuffle=False)
        """
        default_kwargs = dict(
            shuffle=train,
            drop_last=train and len(self) > batch_size,
            collate_fn=self._collate_fn,
            batch_size=batch_size,
            batch_sampler=batch_sampler,
        )
        default_kwargs.update(kwargs)
        kwargs = default_kwargs
        if kwargs["batch_sampler"] is not None:
            sampler = kwargs["batch_sampler"]
            if isinstance(sampler, str):
                if sampler == "synchronized":
                    kwargs["batch_sampler"] = TimeSynchronizedBatchSampler(
                        self, batch_size=kwargs["batch_size"], shuffle=kwargs["shuffle"], drop_last=kwargs["drop_last"]
                    )
                else:
                    raise ValueError(f"batch_sampler {
      
      sampler} unknown - see docstring for valid batch_sampler")
            del kwargs["batch_size"]
            del kwargs["shuffle"]
            del kwargs["drop_last"]

        return DataLoader(
            self,
            **kwargs,
        )

    def x_to_index(self, x: Dict[str, torch.Tensor]) -> pd.DataFrame:
        """
        Decode dataframe index from x.

        Returns:
            dataframe with time index column for first prediction and group ids
        """
        index_data = {
    
    self.time_idx: x["decoder_time_idx"][:, 0].cpu()}
        for id in self.group_ids:
            index_data[id] = x["groups"][:, self.group_ids.index(id)].cpu()
            # decode if possible
            index_data[id] = self.transform_values(id, index_data[id], inverse=True, group_id=True)
        index = pd.DataFrame(index_data)
        return index

    def __repr__(self) -> str:
        return repr_class(self, attributes=self.get_parameters(), extra_attributes=dict(length=len(self)))

This is a Python function called to_dataloader. Here's a line-by-line explanation:

def to_dataloader(
        self, train: bool = True, batch_size: int = 64, batch_sampler: Union[Sampler, str] = None, **kwargs
    ) -> DataLoader:

This function will return a PyTorch DataLoader object for iterating over the data. It has three parameters, an optional boolean train (used to specify whether to use for training), an integer batch_size (specify the batch size, the default is 64) and an optional batch_sampler, the type can be Sampler or str ( specifies how to sample the data). Additionally, this function can accept any number of keyword arguments, which will be passed to DataLoader().

default_kwargs = dict(
            shuffle=train,
            drop_last=train and len(self) > batch_size,
            collate_fn=self._collate_fn,
            batch_size=batch_size,
            batch_sampler=batch_sampler,
        )
        default_kwargs.update(kwargs)
        kwargs = default_kwargs

This function first defines a dictionary of default arguments, then updates it with any keyword arguments passed in. Finally, set the kwargs variable to the updated dictionary.

if kwargs["batch_sampler"] is not None:
            sampler = kwargs["batch_sampler"]
            if isinstance(sampler, str):
                if sampler == "synchronized":
                    kwargs["batch_sampler"] = TimeSynchronizedBatchSampler(
                        self, batch_size=kwargs["batch_size"], shuffle=kwargs["shuffle"], drop_last=kwargs["drop_last"]
                    )
                else:
                    raise ValueError(f"batch_sampler {
      
      sampler} unknown - see docstring for valid batch_sampler")
            del kwargs["batch_size"]
            del kwargs["shuffle"]
            del kwargs["drop_last"]

If batch_sampler is not None, then check if it is of type string. If so, create a TimeSynchronizedBatchSampler object based on the value of the string. Otherwise, a ValueError exception is thrown. Then delete three key-value pairs that are no longer needed: batch_size, shuffle, and drop_last.

return DataLoader(
            self,
            **kwargs,
        )

Finally, the function returns a DataLoader object that takes self (the object the function is on) as the first argument, and any keyword arguments in kwargs as the remaining arguments.

In addition to the to_dataloader function, there are two helper functions. The first is x_to_index, which decodes the time index and group ID in x into a Pandas DataFrame. The second is repr , which returns a string representation of the object, including the object's arguments and length.

Guess you like

Origin blog.csdn.net/m0_46413065/article/details/130176034