Newsletter|The Weekly Star (Second Week) Ranking List of the Big Data Challenge has been released, and experience sharing is here!

7dd7e24b31bdb11706b05cea6f2cbaf0.png

On July 10th, the China University Computer Contest-Big Data Challenge entered the preliminary stage of Zhou Zhouxing's award selection process. Through the real-time evaluation of the online submission results of the contestants (based on the results of the Public list at 12:00 on the 10th), The list of Zhou Zhouxing with the best results in the second week are all current student teams (see the picture below), congratulations to the two winning student teams!

ad79c93666de1f66d0307b8bba7e1280.png

What are the practical experience of the two winning teams in the competition? Let us listen to their sharing!

01

Echoch team award-winning experience sharing

Comprehension

To be honest, this question feels that there are still many places that are not well understood and difficult. But with some luck, I got Zhou Zhouxing. The plan only uses the trace and log features, and the metric is not used at all. I even put the feature of whether there is a metric. The following is mainly the analysis of log and trace.

Trace

timestamp: For counts less than 100 and equal to 1199999 in timestamp.min

end_time-start_time: For refinement, only the samples of the mysql class are counted

status_code: For refinement, only the samples of web pages are counted

Other categorical features: count/n_unique

Log

timestamp: Zhou Xingxing did diff and grouped last week, which is not very useful for me, so my log and timestamp features are not used at all

service: Count and n_unique of the service

message: the mean and variance of the statistical length

Do some crossover

Part of the feature engineering is as follows

'''

        feats['log_length'] = len(log)

        feats['log_service_nunique'] = log['service'].nunique()

        feats['message_length_std'] = log['message'].fillna("").map(len).std()

        feats['message_length_ptp'] = log['message'].fillna("").map(len).agg('ptp')

        feats['log_info_length'] = log['message'].map(lambda x:x.split("INFO")).map(len).agg('ptp')

        #developed feature

        # text_list = ['abnormal','error','error','user','mysql','true','failure']

        text_list = ['user','mysql']

        for text in text_list:

              # feats[f'message_{text}_sum'] = log['message'].str.contains(text, case=False).sum()

              feats[f'message_{text}_mean'] = log['message'].str.contains(text, case=False).mean()

        feats[f'message_mysql_mean'] = 1 if feats[f'message_mysql_mean'] > 0 else 0

'''

Metric

After digging for a long time, it was completely useless, and points will be lost when going offline. The difficulty lies in the wide range of data. The same tags have many types of ranges, and the ranges of different samples are different.

I think if this part cannot be dug out at all, it will only depend on the degree of overfitting of the A list.

model selection

1. The xgb I use does not have 9 binary classifications and OVR. I don’t want to adjust the model for a long time in the preliminary round. Let’s do it after increasing the amount of data in the semi-finals.

2. My nn is only 82 at most, I feel that I don’t have enough skills and I can’t go offline

iteration

After finding a clear decision boundary, don't worry even if you fall offline, it will go up online. can also see

- submit.groupby('source').max()

You can try it when the maximum auc of No Fault is increased.

In addition, when you go to one or two points offline, you can visualize the features made by label groupby, because some training is divided up, but the test set does not have this thing at all.

Addition is required for 82 and 83, and subtraction is required for 85.

02

Quickly move into Zhang Xianzhong Building Award-winning experience sharing

1. Data understanding

The official provides 3 data sources, metric is the monitoring measurement data, and the measurement values ​​under different services can be learned directly from the table, but the specific measurement content is not reflected in the table; log is the system log, which contains various The running output log of the system, the same service also contains different systems; trace is the tracking data, the content is relatively complete, and most of the ids appear alone.

2. Feature engineering

The more obvious feature of trace data is end_time - start_time. This feature indicates the time taken for a trace from start to end. The length of this time can tell whether there is a fault. You can make some derivatives around this feature, such as mean, std, max, min, etc., you can also consider crossing it with other variables. status_code also directly indicates whether there is a fault, and it can also be considered to cross with other variables.

The log data mainly considers how to process message information. Some keywords such as error, warning, etc. can be selected for word frequency statistics, and some nlp technologies can also be used, such as tfidf, pre-trained models gpt2, bert, etc. to calculate text representation.

The characteristics of metric data have not been clarified for the time being, and it is useless.

3. Model

There are a large number of missing values ​​in the data after the feature is completed, and the tree-based model can be directly processed, and xgboost, lightgbm, catboost, random forest, etc. are all optional. Imputation of missing values ​​is not recommended, as it may cause a severe shift in the distribution of the training and test sets.

If you use the original data directly, you can also consider various neural network structures that can process sequences, such as cnn, lstm, transformer, etc., and the effect is also good.

4. Unresolved issues

The metric data looks like time series data, but it seems that different services are not comparable. The effect of feature engineering does not look good, and no useful information can be learned with the neural network. I feel that this data is placed in it to make trouble.

The order of timestamps in the log is not consistent with the time in the message, which order is correct needs to be resolved.

The organizing committee of the competition provided exquisite and practical gifts for the weekly weekly star winners, including Tsinghua’s vibrant campus fashion print T-shirts, Mechanical Revolution Yao·C510 three-mode wireless game controller, Logitech wireless Bluetooth ultra-thin mute light Acoustic keyboard, Logitech wireless bluetooth mouse. Don't hesitate anymore, call your friends to compete together!

e086a18e755d6a70659c89c0ae76bf6d.png

The student team performed well this week, and the in-service team has to work hard! Quietly tell you, the qq exchange group is discussing the strategy of ranking, and the strong will be strong, come and join this high-level duel! Contestants can continue to download the training data and test data of the competition from the designated website, and submit the results online. The deadline for registration is 12:00 on July 24.

c86e09611a1d7f4d1be01a7b68999fe0.png

Welcome to learn more (data pie THU menu bar - competition entrance)

• Competition official website: http://nercbds.tsinghua.edu.cn/bdc/

• Contest applet: Kesai

• Contest email: [email protected]

• Contest QQ group: 762146461 / 901317172


Review of Zhou Zhouxing's experience sharing in the past :

497222debfed156ceecf3839ff8dc7ee.jpeg

e9e7e2e946f7745b71edad9ef1d28119.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131650777
Recommended