Newsletter|The Weekly Star (Third Week) Ranking List of the Big Data Challenge has been released, and experience sharing is here!

cf32579d4ba8061e14ead2847579e3fb.png

On July 17th, the China University Computer Contest-Big Data Challenge entered the last week of the preliminary stage, Zhou Zhouxing’s award selection process. Through the real-time evaluation of the online submission results of the contestants (based on the results of the public list at 12 o’clock on the 17th), the list of current students and in-service teams with the best results in Zhou Zhouxing’s third week was released (see the picture below). Congratulations to the two winning teams!

19f1bdd98811048deb14c878bc7564b4.png

What are the practical experience of the two winning teams in the competition? Let us listen to their sharing!

Team Lin's award-winning experience sharing

Comprehension

This competition provides three data sources. Our solution mainly uses trace and log, and metric has not been used for the time being (the effect of trying is not good). In the early stage, only simple excavation work was carried out on the trace, and the focus was on the log table. Later, the trace was further excavated, and the score was further improved. Looking at it now, using only the trace table and some simple log table feature lines can reach 0.85+.

feature engineering

Trace: Continue to use some statistical features like mean/std/ptp shared by Zhou Zhouxing last few times. Through data observation, it is found that there are a large number of endtime-starttime data of 0, 1, 2, etc., and some detailed statistics can also be performed. At the same time, use ip and service to group, and make some statistical features after diffing timestamp in the group. Secondly, these category features can be sorted by timestamp to construct a sequence list, using technologies such as w2v and tfidf (you can learn the processing of sequences in the WeChat big data open source solution in 2021), which can be 83->84+.

Log: Similar to what Zhou Zhouxing shared last week, to extract the frequency and number of various keywords in the message, and try to use w2v to train the corpus. After running for a long time, the effect is found to be unstable (the score did not change much after running twice, and the 0.003 in the first time was not very clear, and there were cases that could not be reproduced). In the future, you can try to use bert and so on.

Metric: There are inconsistencies in dimension and direction. We have tried clustering after normalization, and there is only a slight improvement online (may be shaken up). I don't know how to use this form...

model selection

I used the baseline shared by the big guys, ovr plus lgb. Tried nn but the line is not high only 0.825+. The tabnet is used, the gap is not big but it cannot be improved.

feature screening

Using correlation to filter features, there are only about 20 features whose correlation between features is greater than 0.8, and there is a 0.001 improvement on the deleted line.

In general, there are still some very common and common practices, and additions are still required to get to 86.

Go to Internet Cafe to Steal Headphones Team Award-winning Experience Sharing

Comprehension

Like most contestants, the three data sources provided in this competition only use trace and log, and the metric has not been used for the time being.

The content shared by Zhou Xingxing in the early stage is very valuable, and basically reproduces the characteristics and methods they mentioned.

feature engineering

Trace: Except that most players mentioned such as endtime-starttime, start_time diff, and timestamp diff, cross-statistics will be performed from id and service, host, and endpoint. At the same time, features related to the statistical results of service, host, and endpoint and global id statistical results are constructed, such as host_timestamp_max/timestamp_max.

Log: Similar to trace, statistics are made from the global and cross, and related features are constructed. Tried tfidf+svd, word2vec but no effect yet.

Metric: In order to avoid being affected by the actual value, many proportion-related features were tried, but none of them worked.

model selection

Currently using ovr plus catboost, not trying nn yet.

feature screening

After constructing about 1500-dimensional features, and then performing feature filtering (nunique is 1, the missing rate is greater than 0.95, and the correlation is greater than 0.98), about 1000-dimensional features are left, and the online score is increased by 0.005+.

def correlation(data, threshold):
    col_corr = []
    corr_matrix = data.corr()
    for i in range(len(corr_matrix)):
        for j in range(i):
            if abs(corr_matrix.iloc[i,j]) > threshold:
                colname = corr_matrix.columns[i]
                col_corr.append(colname)
    return list(set(col_corr))

For more details on the upper score, please refer to the feature extraction section of "Machine Learning Algorithm Competition Actual Combat".

The organizing committee of the competition provided exquisite and practical gifts for the weekly weekly star winners, including Tsinghua’s energetic campus fashion printed T-shirts, Mechanical Revolution Yao·C510 three-mode wireless game controller, Logitech wireless Bluetooth ultra-thin mute and light sound keyboard, and Logitech wireless Bluetooth mouse. Don't hesitate anymore, call your friends to compete together!

28b251a59c09f3661a342f75264443e8.png

After a week of optimization and adjustment, new teams are constantly appearing on the list this week, and the registration has entered the countdown stage. Continue to cheer for the semi-finals! Contestants can continue to download the training data and test data of the competition from the designated website, and submit the results online. The deadline for registration is 12:00 on July 24.

dbde3a6b9ec473ce4bc1db15c631de6f.png

Welcome to learn more (data pie THU menu bar - competition entrance)

• Competition official website: http://nercbds.tsinghua.edu.cn/bdc/

• Contest applet: Kesai

• Contest email: [email protected]

• Contest QQ group: 762146461 / 901317172


Review of Zhou Zhouxing's experience sharing in the past :

2c491ac3d61e383313f33b0256f9cace.jpeg

d28415950011c9da027dffcbefb6c9df.jpeg

7c8e2eaf99ecd85e07f51d92aecd3792.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131798871