1. Details of structured problem solving process

1. Knowledge points of structured competition questions

Insert picture description here

2. Data analysis of competition questions

1.resp field analysis

Insert picture description here

2. feature.csv data


Insert picture description here
The purpose of the original URL feature.csv is to show the relationship between anonymous features, tag0~tag28 are anonymous shared components/concepts used in feature derivation. For example, if the value of (feature_i, tag_j) is True, it means that tag_j is used to derive feature_i.
Insert picture description here
Insert picture description here
How are tags actually distributed?
Reference URL

  1. Import library
    Insert picture description here

  2. Download Data
    Insert picture description here

  3. Analyze Example_sample_submission
    Insert picture description here
    Insert picture description here
    Insert picture description here

  4. Analyze feature.csv and
    Insert picture description here
    Insert picture description here
    count the True+False of each feature in the table
    Insert picture description here
    Insert picture description here
    Insert picture description here
    Insert picture description here
    Insert picture description here
    Insert picture description here

Insert picture description here

It can be seen that some features are similar to others.

Features 1 to 40 form set 1
Features 61 to 70 form set 2
Features 71 to 120 form set 3
Features 121 to 129 form set 4

Insert picture description here
Insert picture description here

  1. Analyze example_test data
    Insert picture description here
    Insert picture description here
    Insert picture description here
    Data grouped by ID and Date
    Insert picture description here
    Insert picture description here

3. In the original baseline, date>85 (screening data) is proposed. How to choose?

The original website
This website uses data analysis—from the cumulative daily income chart to find the trend over time, and finds that many characteristics seem to have changed around the 85th day.
There are several opinions on the Internet:

1. The first 85 days is a time window. When rolling calculations, the first window cannot be used due to incomplete data.
My opinion is: when the date is relatively small, there are more missing data in the characteristics, so this may be the reason of. As for how to supplement the data? There are still doubts. Common things such as filling the mean and median are not feasible. If the random forest algorithm is used for filling, based on the data distribution, future data will be used to predict the past data for filling. This is also problematic. So you can consider deleting the days with more missing values ​​before missing values.

2. Some people think that the first 85 days and later dates may be different trading models. If the data trend of the first 85 days is different from the data trend of the next few days due to market fluctuations, then a few days should be reserved to better model this situation. Because you cannot guarantee that the market fluctuations will not occur within the time period corresponding to the test data. If the market fluctuates, but the data is deleted when training the model, the generalization ability of the model will be poor!

4. Analyze feature_0

Original URL

The data format of featue_0 is 1, -1, try to analyze whether feature_0 corresponds to the category of each row of data. The following uses all the characteristic lung-shaped UMAPs to represent the two main "lungs"/clusters to show some internal structures.

  1. TriMap of features 1-130

Perform dimensionality reduction on all features except feature_0 to find two main clusters from the lung UMAP, and analyze whether these two clusters correspond to the value of feature 0? Use TriMAP instead of UMAP for analysis, but I hope we will see similar clusters in TriMAP space

Insert picture description here
Insert picture description here
Insert picture description here

As expected, feature1-130 is divided into two different distributions. In fact, these distributions exactly match `feature_0'. Does this mean buying/selling, buying/selling or other financial knowledge that I don’t know, but obviously, the value of "feature_0" will affect the value of some (if not all) other features in the data set

  1. Find the feature related to feature_0

Obviously, feature_0 divides the remaining features into two data distributions, but the figure only illustrates so much. In order to further study the influence of feature_0 on other features, here I try to solve the inverse problem of predicting feature_0 based on feature_0 1-130. Then, I will delete one of the most important features (according to the importance of the feature), and check the predictive ability of the remaining features on feature_0 again. By iteratively deleting the most important features, we can understand how many features are needed to predict feature_0, and thus how many features are related to feature_0

The reason for adopting this method is not only to look at the correlation between "feature_0" and "feature_x", but also that we also want to consider possible non-linear effects, that is, "feature_i" to "feature_j" may be combined with each other Predict "feature_0", but cannot predict by itself

Insert picture description here
Insert picture description here

We can delete many of the most important features, but still be able to classify feature_0 perfectly—it is not just a distribution of a single feature that depends on the value of feature_0; instead, it is a distribution of a large number of features associated with the value of feature_0.

  1. Interrelationships between features

Above, the most predictive feature for feature_0 is repeatedly removed from other features to understand the minimum feature set related to feature_0. In this experiment, I will use another method to see which features alone are sufficient to predict the value of feature_0

Insert picture description here
Insert picture description here

To better measure, let’s see how much AUC we use for all models with a score <0.6

Insert picture description here

>> AUC score of poor features combined: 0.97

Mode 1: Features 17 to 40 are very good at separating Feature_0, look at the tags file, which does not seem to match the given tags.
Mode 2: Features 65-68 seem to be related to feature_0, none of these features have any tags.
Mode 3: Feature 72- 106 seems to have a repetitive pattern
pattern 4 related to feature_0 : even if the model score number based on a single feature shows that some features are not related to Feature_0, when all these "poor" features are combined, they can work well Classify Feature_0, which shows that the value of Feature_0 does have a distribution that affects many other features

  1. Inspecting feature patterns

Using the different modes identified above, let's take a look at the actual values ​​of the features divided by feature_0 to see if we can actually observe the difference in distribution

Insert picture description here
The graphic part is not fully displayed, please check the original URL if you are interested! ! !
Insert picture description here
Insert picture description here
Insert picture description here

5. What is the use of visualization?

Insert picture description here

3. Model training and verification

1. Data division method

Insert picture description here

2. Time series data division method

In this website, others not only did a detailed analysis, but also proposed a new division method for time series data
Insert picture description here
Insert picture description here

  1. train_test_split—Reserve method
    Use shuffle=False; otherwise we will lose all time sequence

There is a problem with this approach, because by constantly verifying the same data, we are slowly starting to overfit ("leak") the test set. Splitting the test set into a verification set again can solve this problem: adjust and verify the model's hyperparameters on the verification set. Once it is fully completed, we will test it on the test set (only once!)

Insert picture description here
Insert picture description here
Insert picture description here
2. Cross-validation—cross-validation

In cross-validation (CV), multiple validation sets are derived from the training set.
Each fold will use a new part of the training set as the validation set, and the data previously used for validation will now become part of the training set

Insert picture description here
Insert picture description here

It can be seen from the results that the index of the test set is smaller than the index of the training set for the k-fold cross-validation split data. In the time series prediction problem, it is meaningless to use the future to predict the past!

  1. TimeSeriesSplit—time series split

Compared with Cross-validation, TimeSeriesSplit solves the problem that the test set index is smaller than the training set index

Insert picture description here

However, TimeSeriesSplit does not consider the groups available in the data. Although it is not very obvious in this figure, we can imagine that a group can be partly in the training set and partly in the test set.
Insert picture description here
From the running results, we can see that
there are data from the 44th day in the training and test sets. , Which means that we are training half of the transactions on a certain day, just to verify their performance in the other half of the day. Of course we want to train all transactions on a specific day and verify it on the following day! Otherwise it will leak again.

  1. GroupKFold—cross validation by group
    Insert picture description here
    Insert picture description here

The GroupKFold iterator does respect grouping. One day's data will not appear in the training set and test set at the same time, but it confuses the time sequence and causes the phenomenon of time travel. We need to combine GroupKFold and TimeSeriesSplit: GroupTimesSeriesSplit

  1. GroupTimesSeriesSplit—time series data packet cutting
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args

# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class GroupTimeSeriesSplit(_BaseKFold):
    """Time Series cross-validator variant with non-overlapping groups.
    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals according to a
    third-party provided group.
    In each split, test indices must be higher than before, and thus shuffling
    in cross validator is inappropriate.
    This cross-validation object is a variation of :class:`KFold`.
    In the kth split, it returns first k folds as train set and the
    (k+1)th fold as test set.
    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).
    Note that unlike standard cross-validation methods, successive
    training sets are supersets of those that come before them.
    Read more in the :ref:`User Guide <cross_validation>`.
    Parameters
    ----------
    n_splits : int, default=5
        Number of splits. Must be at least 2.
    max_train_size : int, default=None
        Maximum size for a single training set.
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import GroupTimeSeriesSplit
    >>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
                           'b', 'b', 'b', 'b', 'b',\
                           'c', 'c', 'c', 'c',\
                           'd', 'd', 'd'])
    >>> gtss = GroupTimeSeriesSplit(n_splits=3)
    >>> for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...     print("TRAIN:", train_idx, "TEST:", test_idx)
    ...     print("TRAIN GROUP:", groups[train_idx],\
                  "TEST GROUP:", groups[test_idx])
    TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']\
    TEST GROUP: ['b' 'b' 'b' 'b' 'b']
    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']\
    TEST GROUP: ['c' 'c' 'c' 'c']
    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\
    TEST: [15, 16, 17]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']\
    TEST GROUP: ['d' 'd' 'd']
    """
    @_deprecate_positional_args
    def __init__(self,
                 n_splits=5,
                 *,
                 max_train_size=None
                 ):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_size = max_train_size

    def split(self, X, y=None, groups=None):
        """Generate indices to split data into training and test set.
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like of shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like of shape (n_samples,)
            Group labels for the samples used while splitting the dataset into
            train/test set.
        Yields
        ------
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        """
        if groups is None:
            raise ValueError(
                "The 'groups' parameter should not be None")
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        group_dict = {
    
    }
        u, ind = np.unique(groups, return_index=True)
        unique_groups = u[np.argsort(ind)]
        n_samples = _num_samples(X)
        n_groups = _num_samples(unique_groups)
        for idx in np.arange(n_samples):
            if (groups[idx] in group_dict):
                group_dict[groups[idx]].append(idx)
            else:
                group_dict[groups[idx]] = [idx]
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds={0} greater than"
                 " the number of groups={1}").format(n_folds,
                                                     n_groups))
        group_test_size = n_groups // n_folds
        group_test_starts = range(n_groups - n_splits * group_test_size,
                                  n_groups, group_test_size)
        for group_test_start in group_test_starts:
            train_array = []
            test_array = []
            for train_group_idx in unique_groups[:group_test_start]:
                train_array_tmp = group_dict[train_group_idx]
                train_array = np.sort(np.unique(
                                      np.concatenate((train_array,
                                                      train_array_tmp)),
                                      axis=None), axis=None)
            train_end = train_array.size
            if self.max_train_size and self.max_train_size < train_end:
                train_array = train_array[train_end -
                                          self.max_train_size:train_end]
            for test_group_idx in unique_groups[group_test_start:
                                                group_test_start +
                                                group_test_size]:
                test_array_tmp = group_dict[test_group_idx]
                test_array = np.sort(np.unique(
                                              np.concatenate((test_array,
                                                              test_array_tmp)),
                                     axis=None), axis=None)
            yield [int(i) for i in train_array], [int(i) for i in test_array]

The result of time series data segmentation is
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

Fourth, the baseline strengthening route

Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_46649052/article/details/112852134