Article Directory
1. Knowledge points of structured competition questions
2. Data analysis of competition questions
1.resp field analysis
2. feature.csv data
The purpose of the original URL feature.csv is to show the relationship between anonymous features, tag0~tag28 are anonymous shared components/concepts used in feature derivation. For example, if the value of (feature_i, tag_j) is True, it means that tag_j is used to derive feature_i.
How are tags actually distributed?
Reference URL
-
Import library
-
Download Data
-
Analyze Example_sample_submission
-
Analyze feature.csv and
count the True+False of each feature in the table
It can be seen that some features are similar to others.
Features 1 to 40 form set 1
Features 61 to 70 form set 2
Features 71 to 120 form set 3
Features 121 to 129 form set 4
- Analyze example_test data
Data grouped by ID and Date
3. In the original baseline, date>85 (screening data) is proposed. How to choose?
The original website
This website uses data analysis—from the cumulative daily income chart to find the trend over time, and finds that many characteristics seem to have changed around the 85th day.
There are several opinions on the Internet:
1. The first 85 days is a time window. When rolling calculations, the first window cannot be used due to incomplete data.
My opinion is: when the date is relatively small, there are more missing data in the characteristics, so this may be the reason of. As for how to supplement the data? There are still doubts. Common things such as filling the mean and median are not feasible. If the random forest algorithm is used for filling, based on the data distribution, future data will be used to predict the past data for filling. This is also problematic. So you can consider deleting the days with more missing values before missing values.
2. Some people think that the first 85 days and later dates may be different trading models. If the data trend of the first 85 days is different from the data trend of the next few days due to market fluctuations, then a few days should be reserved to better model this situation. Because you cannot guarantee that the market fluctuations will not occur within the time period corresponding to the test data. If the market fluctuates, but the data is deleted when training the model, the generalization ability of the model will be poor!
4. Analyze feature_0
The data format of featue_0 is 1, -1, try to analyze whether feature_0 corresponds to the category of each row of data. The following uses all the characteristic lung-shaped UMAPs to represent the two main "lungs"/clusters to show some internal structures.
- TriMap of features 1-130
Perform dimensionality reduction on all features except feature_0 to find two main clusters from the lung UMAP, and analyze whether these two clusters correspond to the value of feature 0? Use TriMAP instead of UMAP for analysis, but I hope we will see similar clusters in TriMAP space
As expected, feature1-130 is divided into two different distributions. In fact, these distributions exactly match `feature_0'. Does this mean buying/selling, buying/selling or other financial knowledge that I don’t know, but obviously, the value of "feature_0" will affect the value of some (if not all) other features in the data set
- Find the feature related to feature_0
Obviously, feature_0 divides the remaining features into two data distributions, but the figure only illustrates so much. In order to further study the influence of feature_0 on other features, here I try to solve the inverse problem of predicting feature_0 based on feature_0 1-130. Then, I will delete one of the most important features (according to the importance of the feature), and check the predictive ability of the remaining features on feature_0 again. By iteratively deleting the most important features, we can understand how many features are needed to predict feature_0, and thus how many features are related to feature_0
The reason for adopting this method is not only to look at the correlation between "feature_0" and "feature_x", but also that we also want to consider possible non-linear effects, that is, "feature_i" to "feature_j" may be combined with each other Predict "feature_0", but cannot predict by itself
We can delete many of the most important features, but still be able to classify feature_0 perfectly—it is not just a distribution of a single feature that depends on the value of feature_0; instead, it is a distribution of a large number of features associated with the value of feature_0.
- Interrelationships between features
Above, the most predictive feature for feature_0 is repeatedly removed from other features to understand the minimum feature set related to feature_0. In this experiment, I will use another method to see which features alone are sufficient to predict the value of feature_0
To better measure, let’s see how much AUC we use for all models with a score <0.6
>> AUC score of poor features combined: 0.97
Mode 1: Features 17 to 40 are very good at separating Feature_0, look at the tags file, which does not seem to match the given tags.
Mode 2: Features 65-68 seem to be related to feature_0, none of these features have any tags.
Mode 3: Feature 72- 106 seems to have a repetitive pattern
pattern 4 related to feature_0 : even if the model score number based on a single feature shows that some features are not related to Feature_0, when all these "poor" features are combined, they can work well Classify Feature_0, which shows that the value of Feature_0 does have a distribution that affects many other features
- Inspecting feature patterns
Using the different modes identified above, let's take a look at the actual values of the features divided by feature_0 to see if we can actually observe the difference in distribution
The graphic part is not fully displayed, please check the original URL if you are interested! ! !
5. What is the use of visualization?
3. Model training and verification
1. Data division method
2. Time series data division method
- train_test_split—Reserve method
Use shuffle=False; otherwise we will lose all time sequence
There is a problem with this approach, because by constantly verifying the same data, we are slowly starting to overfit ("leak") the test set. Splitting the test set into a verification set again can solve this problem: adjust and verify the model's hyperparameters on the verification set. Once it is fully completed, we will test it on the test set (only once!)
2. Cross-validation—cross-validation
In cross-validation (CV), multiple validation sets are derived from the training set.
Each fold will use a new part of the training set as the validation set, and the data previously used for validation will now become part of the training set
It can be seen from the results that the index of the test set is smaller than the index of the training set for the k-fold cross-validation split data. In the time series prediction problem, it is meaningless to use the future to predict the past!
- TimeSeriesSplit—time series split
Compared with Cross-validation, TimeSeriesSplit solves the problem that the test set index is smaller than the training set index
However, TimeSeriesSplit does not consider the groups available in the data. Although it is not very obvious in this figure, we can imagine that a group can be partly in the training set and partly in the test set.
From the running results, we can see that
there are data from the 44th day in the training and test sets. , Which means that we are training half of the transactions on a certain day, just to verify their performance in the other half of the day. Of course we want to train all transactions on a specific day and verify it on the following day! Otherwise it will leak again.
- GroupKFold—cross validation by group
The GroupKFold iterator does respect grouping. One day's data will not appear in the training set and test set at the same time, but it confuses the time sequence and causes the phenomenon of time travel. We need to combine GroupKFold and TimeSeriesSplit: GroupTimesSeriesSplit
- GroupTimesSeriesSplit—time series data packet cutting
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args
# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class GroupTimeSeriesSplit(_BaseKFold):
"""Time Series cross-validator variant with non-overlapping groups.
Provides train/test indices to split time series data samples
that are observed at fixed time intervals according to a
third-party provided group.
In each split, test indices must be higher than before, and thus shuffling
in cross validator is inappropriate.
This cross-validation object is a variation of :class:`KFold`.
In the kth split, it returns first k folds as train set and the
(k+1)th fold as test set.
The same group will not appear in two different folds (the number of
distinct groups has to be at least equal to the number of folds).
Note that unlike standard cross-validation methods, successive
training sets are supersets of those that come before them.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
n_splits : int, default=5
Number of splits. Must be at least 2.
max_train_size : int, default=None
Maximum size for a single training set.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import GroupTimeSeriesSplit
>>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
'b', 'b', 'b', 'b', 'b',\
'c', 'c', 'c', 'c',\
'd', 'd', 'd'])
>>> gtss = GroupTimeSeriesSplit(n_splits=3)
>>> for train_idx, test_idx in gtss.split(groups, groups=groups):
... print("TRAIN:", train_idx, "TEST:", test_idx)
... print("TRAIN GROUP:", groups[train_idx],\
"TEST GROUP:", groups[test_idx])
TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']\
TEST GROUP: ['b' 'b' 'b' 'b' 'b']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']\
TEST GROUP: ['c' 'c' 'c' 'c']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\
TEST: [15, 16, 17]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']\
TEST GROUP: ['d' 'd' 'd']
"""
@_deprecate_positional_args
def __init__(self,
n_splits=5,
*,
max_train_size=None
):
super().__init__(n_splits, shuffle=False, random_state=None)
self.max_train_size = max_train_size
def split(self, X, y=None, groups=None):
"""Generate indices to split data into training and test set.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Training data, where n_samples is the number of samples
and n_features is the number of features.
y : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
groups : array-like of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.
Yields
------
train : ndarray
The training set indices for that split.
test : ndarray
The testing set indices for that split.
"""
if groups is None:
raise ValueError(
"The 'groups' parameter should not be None")
X, y, groups = indexable(X, y, groups)
n_samples = _num_samples(X)
n_splits = self.n_splits
n_folds = n_splits + 1
group_dict = {
}
u, ind = np.unique(groups, return_index=True)
unique_groups = u[np.argsort(ind)]
n_samples = _num_samples(X)
n_groups = _num_samples(unique_groups)
for idx in np.arange(n_samples):
if (groups[idx] in group_dict):
group_dict[groups[idx]].append(idx)
else:
group_dict[groups[idx]] = [idx]
if n_folds > n_groups:
raise ValueError(
("Cannot have number of folds={0} greater than"
" the number of groups={1}").format(n_folds,
n_groups))
group_test_size = n_groups // n_folds
group_test_starts = range(n_groups - n_splits * group_test_size,
n_groups, group_test_size)
for group_test_start in group_test_starts:
train_array = []
test_array = []
for train_group_idx in unique_groups[:group_test_start]:
train_array_tmp = group_dict[train_group_idx]
train_array = np.sort(np.unique(
np.concatenate((train_array,
train_array_tmp)),
axis=None), axis=None)
train_end = train_array.size
if self.max_train_size and self.max_train_size < train_end:
train_array = train_array[train_end -
self.max_train_size:train_end]
for test_group_idx in unique_groups[group_test_start:
group_test_start +
group_test_size]:
test_array_tmp = group_dict[test_group_idx]
test_array = np.sort(np.unique(
np.concatenate((test_array,
test_array_tmp)),
axis=None), axis=None)
yield [int(i) for i in train_array], [int(i) for i in test_array]
The result of time series data segmentation is
Fourth, the baseline strengthening route