Remember the experience and gains of the first CCF data algorithm competition (ranking top1, top2%, top8%), CCF will meet again next year!

I only started learning ml and dl this year. I didn’t learn much about dl, mainly ml. I only started to learn theoretical knowledge during the summer vacation. During the learning process, I learned that DataWhale, a relatively large open source organization in China, can learn from them. I got to a lot and got to know a lot of bigwigs. I slowly started to understand the competition. From the 0-based introduction to Kaggle, to Tianchi and then to the later CCF competition, in just a few months, I have a certain understanding of data science competitions. I hope You can get the front row awards in the subsequent competitions and next year's CCF competitions. This CCF competition participated in a total of 6 competitions, five official competitions and one training competition.

Insert picture description here
There are four structured competitions and two NLP competitions. There are only 3 competitions that you really participated in the whole process. There is also one that took some time. If there are two competitions, I clicked to participate, but there was no mention. And the result.

Let’s introduce some of the competitions that I participated in the whole process in turn, and also summarize some of the knowledge learned from the competition.



1. Classification of indoor user motion time series data

Question address: Classification of indoor user sports time series data.
This question and teammates worked together and achieved the first place result. I wrote a baseline before this competition.
Insert picture description here

1. Data introduction

Based on the above actual needs and the progress of deep learning, this training competition aims to build a general time series classification algorithm. Establish an accurate time series classification model through this question, and hope that everyone will explore more robust time series feature expression methods.

In fact, this contest question is rather vague, and no specific explanation is given. For the data, everyone directly throws the data into the model at the beginning, and then submits the result directly to get a good score.

2. Data description

Insert picture description here

3. Game harvest

It’s also the first time I have come into contact with this kind of time series problem. I didn’t know how to start at the beginning. With the dry goods shared by our teammates, we started to create features, added the created features to the original data, and then trained together. It has played a good role, and the score increase is also quite large.

Therefore, if there are fewer features for you when a game, you have to learn to dig out new features from the given data, such as

 #统计特征    
    max_X=data.x.max()
    min_X=data.x.min()
    range_X=max_X-min_X
    var_X=data.x.var()
    std_X=data.x.std()
    mean_X=data.x.mean()
    median_X=data.x.median()
    kurtosis_X=data.x.kurtosis()
    skewness_X =data.x.skew()
    Q25_X=data.x.quantile(q=0.25)
    Q75_X=data.x.quantile(q=0.75)
    #聚合特征
    
    #差分值
    max_diff1_x=data.x.diff(1).max()
    min_diff1_x=data.x.diff(1).min()
    range_diff1_x=max_diff1_x-min_diff1_x
    var_diff1_x=data.x.diff(1).var()
    std_diff1_x=data.x.diff(1).std()
    mean_diff1_x=data.x.diff(1).mean()
    median_diff1_x=data.x.diff(1).median()
    kurtosis_diff1_x=data.x.diff(1).kurtosis()
    skewness_diff1_x =data.x.diff(1).skew()
    Q25_diff1_X=data.x.diff(1).quantile(q=0.25)
    Q75_diff1_X=data.x.diff(1).quantile(q=0.75)

Wait, more than 30 new features have been created.

In addition, I learned the stacking fusion method from this game, plus the new features of the structure, and then use this fusion method, the score increase is particularly large, I don't know the other games, at least this game is like this!

2. Risk prediction of illegal fund-raising by enterprises

The official data of this game is quite a lot. How to extract useful features from multiple tables is the key to this game! I also wrote a baseline share before , which was improved on the basis of Shui Ge’s baseline A榜排名36,B榜79. Overall it is good.
A list
B list

1. Data introduction

The data set contains data of about 25,000 companies, of which about 15,000 companies have labeled data as the training set and the remaining data as the test set. The data consists of basic corporate information, corporate annual reports, corporate taxation status, etc. The data includes many data types (desensitized) such as numeric, character, and date. Some fields are missing in some companies. The first column is id. It is the unique identification of the company.

2. Data description

Here, take the first table base_info.csv as an example, which
contains the basic information of all the companies involved in data sets 7 and 8. Each row represents the basic data of a company, and each row has 33 columns. The id column is the unique identification of the company. The columns are separated by "," separators.

The data format is as follows:
Insert picture description here

#读取数据
base_info = pd.read_csv(PATH + 'base_info.csv')
#输出数据shape和不重复企业id数
print(base_info.shape, base_info['id'].nunique())
#读取数据
base_info.head(1)
#查看缺失值,这里借助了missingno这个包,import missingno as msno。
msno.bar(base_info)#查看缺失值

Result graph:
Insert picture description here
This graph clearly shows which data has missing values. The horizontal axis is the feature, and the vertical axis is the number of non-missing values. The white areas of each column represent missing values!

3. Game harvest

3.1 Feature selection and construction

In the data given by this competition, many tables have missing values. For the processing of missing values, the selection and construction of features, feature crossover, binning, etc., you need to have a certain understanding, because good features are the next step. The basis of model training. For the processing of features, please refer to an article I wrote before.

#orgid	机构标识 oplocdistrict	行政区划代码	  jobid	职位标识	
base_info['district_FLAG1'] = (base_info['orgid'].fillna('').apply(lambda x: str(x)[:6]) == \
    base_info['oplocdistrict'].fillna('').apply(lambda x: str(x)[:6])).astype(int)
base_info['district_FLAG2'] = (base_info['orgid'].fillna('').apply(lambda x: str(x)[:6]) == \
    base_info['jobid'].fillna('').apply(lambda x: str(x)[:6])).astype(int)
base_info['district_FLAG3'] = (base_info['oplocdistrict'].fillna('').apply(lambda x: str(x)[:6]) == \
    base_info['jobid'].fillna('').apply(lambda x: str(x)[:6])).astype(int)

#parnum	合伙人数	exenum	执行人数  empnum	从业人数
base_info['person_SUM'] = base_info[['empnum', 'parnum', 'exenum']].sum(1)
base_info['person_NULL_SUM'] = base_info[['empnum', 'parnum', 'exenum']].isnull().astype(int).sum(1)

#regcap	注册资本(金) congro	投资总额
# base_info['regcap_DIVDE_empnum'] = base_info['regcap'] / base_info['empnum']
# base_info['regcap_DIVDE_exenum'] = base_info['regcap'] / base_info['exenum']

# base_info['reccap_DIVDE_empnum'] = base_info['reccap'] / base_info['empnum']
# base_info['regcap_DIVDE_exenum'] = base_info['regcap'] / base_info['exenum']

#base_info['congro_DIVDE_empnum'] = base_info['congro'] / base_info['empnum']
#base_info['regcap_DIVDE_exenum'] = base_info['regcap'] / base_info['exenum']

base_info['opfrom'] = pd.to_datetime(base_info['opfrom'])#opfrom	经营期限起	
base_info['opto'] = pd.to_datetime(base_info['opto'])#opto	经营期限止
base_info['opfrom_TONOW'] = (datetime.now() - base_info['opfrom']).dt.days
base_info['opfrom_TIME'] = (base_info['opto'] - base_info['opfrom']).dt.days

#opscope	经营范围	
base_info['opscope_COUNT'] = base_info['opscope'].apply(lambda x: len(x.replace("\t", ",").replace("\n", ",").split('、')))

#对类别特征做处理
cat_col = ['oplocdistrict', 'industryphy', 'industryco', 'enttype',
           'enttypeitem', 'enttypeminu', 'enttypegb',
          'dom', 'oploc', 'opform','townsign']
#如果类别特征出现的次数小于10转为-1
for col in cat_col:
    base_info[col + '_COUNT'] = base_info[col].map(base_info[col].value_counts())
    col_idx = base_info[col].value_counts()
    for idx in col_idx[col_idx < 10].index:
        base_info[col] = base_info[col].replace(idx, -1)        

# base_info['opscope'] = base_info['opscope'].apply(lambda x: x.replace("\t", " ").replace("\n", " ").replace(",", " "))
# clf_tfidf = TfidfVectorizer(max_features=200)
# tfidf=clf_tfidf.fit_transform(base_info['opscope'])
# tfidf = pd.DataFrame(tfidf.toarray())
# tfidf.columns = ['opscope_' + str(x) for x in range(200)]
# base_info = pd.concat([base_info, tfidf], axis=1)

base_info = base_info.drop(['opfrom', 'opto'], axis=1)#删除时间

for col in ['industryphy', 'dom', 'opform', 'oploc']:
    base_info[col] = pd.factorize(base_info[col])[0]

I have added the meanings of these fields in the code so that I can understand these meanings and do some processing. You can also add these meanings.

3.2 Model selection

Regarding the choice of models, everyone basically chooses several popular models for integrated learning, such as XGboost, Lightgbm and Catboost.

These single models have good results in many competitions. If the results of these models are combined, the effect may be better. Of course, there may be differences in the effects of several models in some competitions, and these integrated learning models can automatically correct the missing Value characteristics are handled by themselves!

But it is recommended that you learn the theoretical knowledge of these integrated learning, which will be of great help to your subsequent study or competition.

3. Intelligent discovery and classification of data content for data security governance

This is an NLP-related competition. To be honest, I don’t read much about NLP-related theories, so this competition can be regarded as an introductory competition for personal learning. I didn’t use deep learning methods, but traditional machines. For learning to do, the laboratory does not have the conditions to do deep learning, and only a simple understanding of this model of deep learning.
Insert picture description here

1. Data introduction

(1) Annotated data: A total of 7000 documents, the categories include 7 categories, namely: finance, real estate, home furnishing, education, technology, fashion, and current affairs. Each category contains 1000 documents.
(2) Unlabeled data: 33,000 documents in total.
(3) Classification and classification test data: a total of 20,000 documents, including 10 categories: finance, real estate, home furnishing, education, technology, fashion, current affairs, games, entertainment, and sports.

2. Data description

This competition is generally a classification task, but one difficulty is that the official only gave 7 categories of training sets, but you need to predict ten categories, so the other three categories need to be given by yourself. They tag and train together!
Insert picture description here

3. Game harvest

The first time I came into contact with the knowledge of NLP, I was really a little white, and I didn't understand a lot of theoretical knowledge, so this game is also a way of opportunism, and I also learned some new concepts for text-type data.

Idea 1: TF-IDF + machine learning classifier
directly use TF-IDF to extract features from text, use classifiers for classification, and select classifiers can use SVM, LR, XGboost, etc.

Idea 2: FastText
FastText is an entry-level word vector. Using the FastText tool provided by Facebook, you can quickly build a classifier

Idea 3: WordVec+ deep learning classifier
WordVec is an advanced word vector, and the classification is completed by constructing a deep learning classification. The network structure of deep learning classification can choose TextCNN, TextRnn or BiLSTM.

Idea 4: Bert word vector
Bert is a highly matched word vector with powerful modeling and learning capabilities.

Four, summary

Participating in this kind of data science competition for the first time, I have gained a lot. I can grow up quickly in the competition, communicate with other students, and learn a lot of new knowledge. Some of the content may have only been seen in theory, but not I haven't done it in practice. Through the competition, I can learn more skills and methods that can't be seen in theory. The first CCF competition ended successfully. Move to other competitions. I will come back next year's CCF competition!

Recording time: December 7, 2020

Guess you like

Origin blog.csdn.net/weixin_42305672/article/details/110791479