Tencent anti-smudge production - from the discovery of the term black ambiguous jargon word recognition

Author: lorenzwang, Tencent TEG Security Engineer

Common Chinese NLP downstream task is generally based on segmentation as a starting point (except at the core of the transformer algorithm), take embedding for each word, as the model input. However, in the field of black ash production, this approach has a problem: if a large number of black / black is effective for the downstream word task, but not a common dictionary, word can not reliably lead to cut separated corresponding word. For example, this year 315 show the exposure of the "714," another example "hole." As well as some of the 00 common words I participate in new training instructor mentioning, "expanding column", "warm talk."

 

The author's waterproof wall team to integrate multiple heterogeneous data sources, production in the black population identification, threat penetration, black production confrontation scene with industry-leading capabilities. Black and gray as the basis for the establishment of production capacity, identify jargon / black words is critical. This article describes some of the solutions we find new words and multi-word on the meaning.

 

1 find new words

Many new words discovery method, article describes a relatively simple and effective new word detection program: + solidification of freedom.

 

We first define a problem:

What kind of characters can form a "word" in a Chinese semantics

 

1.1 having a rich above, below

Chinese word as a basic semantic unit, comprising a more significant features: more flexible can be applied to different scenarios. For example, "machine learning" is the word, the context of both can be used with many verbs and nouns, "learning artificial intelligence knowledge", "engaged in the industry of artificial intelligence," "elective courses Artificial Intelligence," "based on artificial intelligence xx", "Artificial Intelligence gives the xx "," artificial Intelligence identified xx "," artificial intelligence to achieve a xx ". But for "artificial intelligence" is the word, it is still above can be used with a lot of words, but basically only later with "energy" of the.

 

Another example of the author's hometown: Linyi. "Living in Linyi", "building great beauty Linyi", "Linyi pancake", "Linyi airport" there is abundant above, below, but for "Yi" (Tips: yi, the second tone), the following can take " pancake "," airport "," people ", but above the high probability only" temporary ", as well as several other little words.

 

I write to you, our readers can own their own consider whether some commonly used words in line with this.

 

Richness above and below can be used information entropy to measure:

 

Entropy is an index indicating the amount of information, entropy higher means that the higher the uncertainty, the more difficult to predict

 

Let's use a practical example to feel for. Suppose we have a coin, thrown up after the probability of each face up to the X-, the X-   the Range (0, 1, 0.1):

The relationship between the value of x and entropy

 

As can be seen from the figure, when the probability of face-up x = x = 0 or 1, the entropy = 0, so casually toss a coin at this time we can accurately predict whether up to the front. Maximum entropy occurs at x = 0.5 Chu, x = 0.5, the minimum probability we expect positive upward once quasi coin flip.

 

Corresponds to the word above and below, for those who are not rich words above (below), we can predict the quasi-word above or below relatively high probability, such as "Pearl Mourand" the word, it's next Wen is a high probability, "Mary", such as "Yi" is the word, it's a high probability above is "temporary", for "Pearl Mourand," we called it right information entropy is relatively small (smaller information entropy, the more uncertainty high), for "Yi" is the word, we call it the left entropy is relatively small (entropy, the higher certainty).

 

The richer the words above and below, it left entropy (the right information entropy) increases. Generally speaking, we get information about the minimum entropy in (this is why under consideration).

 

1.2 internal cohesion of the word to be high enough

For the above mentioned that a qualified words, the need to have a rich context (otherwise there is no need as a basic semantic unit can not be divided up), but to meet this thing on it? Let's look at an example.

Xiao Ming saw red in the school concert

 

Among them, "the school" almost wild, "concert" is, do not believe look at:

xx / school / xx: He often trouble at school, art exhibitions held at the Museum School

xx / concert / xx: Jay's concert great success of the concert too hard

 

However, and "concert" Obviously not a word on our intuition, why is there "school" this situation? Because "the" and "and" that appears in the Chinese too frequent ( "too often" is not also in line with rich above, below this?)

 

Obviously, the word is not only qualified to be a wealth of above, below on the outside, inside it itself needs to meet certain conditions. Above we mentioned, the word is a basic semantic unit, means that under normal circumstances should not continue subdivided, which means that the internal word to be relatively stable or relatively high degree of internal solidification. Stable means inseparable, indivisible how to measure?

 

Start with the conclusion that we use (between points) mutual information to measure the degree of internal cohesion of the word:

Far right is the formula  , if x and y are completely independent , the above equation = 0.

 

If there is an introduction Linyi article, a total of 100 words, which, Linyi appear more than once, the Pro appears once, Yi appears once, then p (Pro, Yi) = 1/100, p (Pro) = 1/100, p (Yi) = 1/100,

 

We can see inside, "Linyi" The word is still relatively stable, relatively strong centripetal force, centrifugal force is small

 

1.3 summary

Based on the above two points, we can draw the following conclusions: The reason why the words into words, it requires a relatively rich external Above and below, its internal stability is generally enough to not be divided.

 

Therefore, we can design the following indicators:

Select the appropriate text-based score, you can also take the entropy of the left and right entropy, PMI threshold for word to filter.

 

1.4 new word discovery process

  • Generating candidate word

In this step we will split the text by the word mosaic binary, triples, ..., k-tuples (usually k <= 5), such as "find new words and polysemy solution" corresponds to the binary group: [ "new" words, "the word fat", "discovery", "and now", "and a", "word", "multi-word", "ambiguity", "justice", "the solution "," Resolution "," never party "," plan "].

In this step if using existing segmentation tools word can cause a lot of words in this step may be split apart, and the corresponding term follow-up can not be identified.

  • Candidate word score calculation

For every word above, we calculate the corresponding score or , then filter based on a specific threshold to get new words.

  • The new word to the word's dictionary

Is added to the word obtained by the above-described steps word dictionary, as in jieba the following way:

jieba.add_word("德玛西亚")

 

Subsequent use of the word segmentation is performed, the corresponding new word

  • Calculated each word embedding vector

 

Any method may be employed such as word2vec, glove, bert like

  • Black seed word calculated based on the new word (or all words) and black seed word similarity, screened black word

 

For example, the seed selected drugs, and ultimately found that "skating" This word had seemingly harmless to humans and animals the seed of drug-related degree of similarity are high, you can guess he had discovered a new word in the field.

 

2 The term ambiguity

First, let us look at an example:

ice skating

For those students innocent, it is ice skating glided.

For those students who understand the things of it, ice skating is glided.

But, "this either-ice ice, nor that this slip also slipped."

 

And 714, expanding the column, warm new words to talk about this need to find different new word recognition (strictly speaking now expanding column has become a common word), skating is an existing word, meaning only occur in a particular scene changed. As in the following two sentences:

Weekends and small partners to go skating

Weekends and small partners in the rental skating

 

Enter nlp task is generally the word emb vector, the last step to complete the new words can be found to ensure that we embed text after word after word black / jargon can be normally found.

word2vec generated static embedded word can not solve the problem of polysemy, BERT, etc. can solve polysemy, but for pure new word discovery task / black word-proliferation and other tasks will seem a bit superfluous. So we opted to try ELMo at this time bert has big kill Quartet.

 

2.1 What is a static word vector

Static word vector generation process: training language model, the language model in the prediction of hidden state as word representation. Sequence of N tokens given ( ), prior to the language model is based on the previous k-1 input sequences to predict the position of the k-th token, i.e. training goals:

 

Like common word2vec, glove-generated word vectors are static. But the comparison is anti-common sense, word2vec, glove these generate a certain emb vector for each word will be.

 

For the next front "skating", we mentioned in different contexts it is clear that it should be different vec

 

2.2 ELMo of solution

ELMo determine the word no longer get emb vec, but get a trained language model (LM refers to the following generations, remember not likelihood maximization). The LM will be given based on the context of " dynamic " word embedded in each generation.

 

Here again few long-winded.

 

From the results point of view, we finally generated emb vec must be constant rather than random amount.

 

Dynamic herein refers to each input of a different context, are embedded in different process from the point of view the embedded dynamic. ELMo context the idea itself is dynamically adjusted according to emb.

 

Therefore, ELMO uses a typical two-stage process, the first stage pre-trained using a language model, the second stage is downstream of doing the task, the corresponding word extracted from the pre-trained language model emb added downstream task as a new feature .

 

The reason why the above write this because, when we started with a specific ELMo, ignoring the first and second stages, the language model that can be directly obtained in the first phase of training the word out as ELMo of emb output actually it is not. After the completion of the first phase of training, although each word also has a emb, but this is only an intermediate product emb, directly used to imagine the effect will be worse (painful to me !!)

 

This article does not go into details ELMo theory, the next two sections will be told about the pre-training and practical application of ELMo.

 

2.3 ELMo first stage - ELMo pre-training

First on the source , the steps of:

  1. Discover new words and word and word vector w2v training, get the word vector emb, corpus corpus and vocabulary vocab

  2. Modify the code

  3. Training model, vocab_embedding.hdf5

  4. Get model weights

 

2.3.1 found new words and word part

Analogy of our work in the first part of this article, the first new word discovery, the discovery of new words added to the vocabulary word, a word operation.

 

1) word vector part

Fake code:

 

for item in new_words:
      jieba.add_word(item)
stopwords = ...
df_mid = df.rdd.flatMap(lambda x: jieba.lcut(x)).toDF(['sentence'])

w2v = Word2Vec(vectorSize = , inputCol = 'sentence')
model = w2v.fit(df_mid)
word_df = model.getVectors()

 

The above operation can be obtained w2v word vector words. Since the subsequent vocabulary word frequency required by the highest to lowest, and the vocabulary of words in correspondence to embed vector and a vector word, the word thus obtained vector where after the embedding, the word vector file as required word frequency descending Sort.

 

Vector file opening words are the first line of the number of words and word vector dimension, form as follows:

 

FIG spaces are all of the separators, not '\ t'

 

2) corpus. corpus is a corpus of word, each line is a string, separated by a space:

 

 

3) vocabulary. To obtain a vector portion corresponding word word, words to be arranged in descending word frequency in the corpus. Given that we've sort of word frequency in accordance with the word vector 1), and here just put the word out to simply save it.

 

Word table must begin <S \>, </ S \>, <UNK \>, and case-sensitive, the following form:

1), and Code 3) of the following:

 

with open('trans_data/word_vectors','r', encoding='utf-8') as f:
    with open('trans_data/vectors.txt', 'w', encoding='utf-8') as fout:
        fout.write(str(word_count) + ' ' + str(dim) + '\n')
        with open('trans_data/vocab.txt', 'w', encoding='utf-8') as fvocab:
            fvocab.write('<S>')
            fvocab.write('\n')
            fvocab.write('</S>')
            fvocab.write('\n')
            fvocab.write('<UNK>')
            fvocab.write('\n')
            for line in f:
                x = line.split('\t')
                tmp = x[1]
                tmp = tmp.strip('')
                tmp = tmp.lstrip('[')
                tmp = tmp.rstrip(']\n')
                tmp = tmp.replace(',', ' ')
                vocab.append(x[0])
                item = x[0] + ' ' + tmp + '\n'
                fout.write(item)
                fvocab.write(x[0] + '\n')

 

2.3.2 modify the code

 

├── bilm # model file directory
│ ├── __init__.py
│ ├── data.py # Data Preparation inlet
│ ├── elmo.py # elmo different layers to obtain the sum output
│ ├── model.py # bi language model structure file
│ └── training.py # model architecture
├── bin # training file directory
│ ├── dump_weights.py
│ ├── restart.py
│ ├── run_test.py
│ └── train_elmo.py # entrance training
├── the Test
│ ├── test_data.py
│ ├── test_elmo.py
│ ├── test_model.py
│ └── test_training.py
└── example usage_token.py #

 

The most important of several files is bilm / training.py, bin / train_elmo.py

 

1) modify training parameters bin / train_elmo.py

 

The main practical application to modify the following parameters:

1.batch_size

2.epoch

3.n_gpus and cuda_visible_devices

4.n_train_tokens

5.projection_dim, determine the dimensions of the output vector elmo

 

2) Modify LanguageModel class bilm / training.py in

 

The vector generated above w2v words by initializer = tf.constant_initializer (tmp_embed) pass in, and then the corresponding word in the batch through the look-up table embedding_lookup

 

3) Save for embedding the second stage

 

4) training output loss

 

The initial output of each batch code does not correspond to the loss, leading to the beginning of the model can not determine whether convergence, after all, a simple solution is to periodically print train_perplexity in the log

 

5) printing for more information

No details of the original code output training information, advice information on the respective print key step, a subsequent transfer operation can be targeted excellent.

 

2.3.3 training model, get vocab_embedding.hdf5

nohup python3 -u bin/train_elmo.py \--train_prefix='/data/home/xxxx/elmo_data/trans_data/corpus.txt' \--vocab_file /data/home/xxxx/elmo_data/trans_data/vocab.txt \--save_dir /data/home/xxxx/bilm-tf/output_dir > /data/home/xxxx/bilm-tf/output_dir/bilm_out.txt 2>&1 &

 

among them,

​​​​​​​

nohup: 退出 shell 不退出进程train_elmo.py: 主程序入口train_prefix: 语料路径vocab_file: 词表路径save_dir: 训练日志、checkpoint、options.json 输出路径

 

Output files are as follows:

 

However, after completing training we also got one of the most important outputs: (see Part III 2.3.2) vocab_embedding.hdfs

 

2.3.4 get weights.hdf5

The above step was calculated ckpt file, further model weights below

nohup python3 -u bin/dump_weights.py \--save_dir /data1/home/xxxx/bilm-tf/output_dir \--outfile /data1/home/xxxx/elmo_data/trans_out/weights.hdf5 > /data1/home/xxxx/elmo_data/trans_out/bilm_out_weights.txt 2>&1 &

 

save_dir: save path ckpt above

outfile: model weights output path

 

2.3.5 summary

vocab_embedding is embedded vocab an initial, not the final output ELMo! Not ELMo final output! Not ELMo final output!

 

weights.hdf5 is the coefficient of the language model

 

With:

  • vocab_embedding.hdf5

  • weights.hdf5

  • options.json 

 

ELMo can put up with!

 

BTW: the training process, a total of 93,246,334 corpus (the text is short, and the text after the word was filtering, the average length of the text in about 10 words), the peak cpu usage 50g, 4 Zhang Tesla K40m Run 80 epoch needs 10 hours.

 

2.4 ELMo second phase - obtained corpus ELMo embedding

2.4.1 Code and training process

The source code is an example of ELMo usage_token.py the second stage, but not good examples, can be rewritten based on this example. The following provides a pseudo-code:

import tensorflow as tfimport osimport numpy as npfrom bilm import TokenBatcher, BidirectionalLanguageModel, weight_layers, dump_token_embeddings# 根据实际情况进行修改vocab_file = '/data/home/xxxx/elmo_data/trans_data/vocab4.txt'options_file = '/data/home/xxxx/bilm-tf/output_dir/options.json'weight_file = '/data/home/xxxx/elmo_data/trans_out/weights_8.hdf5'token_embedding_file = '/data/home/xxxx/elmo_data/trans_out/vocab_embedding_8.hdf5'tokenized_context = [['吸毒', '溜冰', '贩毒', '吸毒', '贩毒', '吸毒', '毒品', '吸毒'],                                                     ['定期', '组织', '吸毒', '活动', '贩毒', '制毒', '毒品', '情况', '溜冰', '吸毒'],                     ['星期天', '中午', '组队', '体育场', '文化宫', '溜冰', '热爱', '轮滑', '溜友', '踊跃报名', '参加']]

 

# Create a TokenBatcher to map text to token ids.batcher = TokenBatcher(vocab_file)# Input placeholders to the biLM.context_token_ids = tf.placeholder('int32', shape=(None, None))# Build the biLM graph.bilm = BidirectionalLanguageModel(    options_file,    weight_file,    use_character_inputs=False,    embedding_weight_file=token_embedding_file)# Get ops to compute the LM embeddings.context_embeddings_op = bilm(context_token_ids)elmo_context_output = weight_layers('output', context_embeddings_op, l2_coef=0.0)with tf.Session() as sess:# It is necessary to initialize variables once before running inference.    sess.run(tf.global_variables_initializer())# Create batches of data.    context_ids = batcher.batch_sentences(tokenized_context)# Compute ELMo representations (here for the output).    elmo_context_output_ = sess.run(        elmo_context_output['weighted_op'],        feed_dict={context_token_ids: context_ids}    )    print('elmo_context_ouput_:')    print(elmo_context_output_.shape)    print(elmo_context_output_)# ------------------------elmo_context_output才是elmo真正的输出------------------------------## sentences similarities    d1, d2, d3 = elmo_context_output_.shape# d1 = 3, d2 = 11, d3 = 128, d2=所有sentences中最大长度# 维度128 = projection_dim * 2(因为elmo会把前向和后向语言模型concat起来,所以最终生成的维度是128)    group_vector_output = np.array([]).reshape(0, 128)for i in range(d1):        tmp_vec_out = np.sum(elmo_context_output_[i, :, :], axis=0) # 把每个句子中所有token的emb加总起来        sentence_vector_output = np.vstack([sentence_vector_output, tmp_vec_out])        print(str(i)+"th sentence_vector_output: ")        print(sentence_vector_output)    print('output result')# 接下来计算三个句子间的similaritiesfor i in range(d1):        vec1 = sentence_vector_output[i, :]for j in range(i+1, d1):            vec2 = sentence_vector_output[j, :]            num = vec1.dot(vec2.T)            denom = np.linalg.norm(vec1) * np.linalg.norm(vec2)            cos = num / denom            print(str(i)+ ' ' + str(j) + ' ' + str(cos))# 接下来计算三个句子中“溜冰”这个单词的相似度# elmo_context_output_[0, 1, :]对应第一个句子中的第2个token,# elmo_context_output_[1, 8, :]对应第二个句子中的第9个token,# elmo_context_output_[2, 5, :]对应第三个句子中的第6个token,# 正好分别对应着各自句子中溜冰的位置    print('0 1')    num = elmo_context_output_[0, 1, :].dot(elmo_context_output_[1, 8, :].T)    denom = np.linalg.norm(elmo_context_output_[0, 1, :]) * np.linalg.norm(elmo_context_output_[1, 8, :])print (num / denom)    print('1 2')    num = elmo_context_output_[1, 8, :].dot(elmo_context_output_[2, 5, :].T)    denom = np.linalg.norm(elmo_context_output_[1, 8, :]) * np.linalg.norm(elmo_context_output_[2, 5, :])    print(num / denom)    print('0 2')    num = elmo_context_output_[0, 1, :].dot(elmo_context_output_[2, 5, :].T)    denom = np.linalg.norm(elmo_context_output_[0, 1, :]) * np.linalg.norm(elmo_context_output_[2, 5, :])    print(num / denom)

 

Output:

 

The upper panel shows the degree of similarity between the three sentences twenty-two:

 

The upper panel shows the similarity between the three sentences skating, ice skating highest degree of similarity can be seen that the first and second sentences, 3 and 1, 2 and 3 will lower the degree of similarity skating, preliminary look in line with our expectations.

 

If the above code is sen2vec.py, which only need to run step

  •  
python3 sen2vec.py

 

2.4.2 Summary

Practical applications may have been the candidate text again elmo, to save the generated emb hadoop table inside, feel free to call would be more efficient.

 

Also new words can be used with the above discovery, the better.

 

"Water wall" is built by a team of Tencent security covering finance, advertising, electricity providers, new retail and other sectors of security products, to create in the financial sector covering anti-fraud, anti-money laundering, counter-intelligence collection and early warning of the risk of the whole process product matrix; in the field of advertising, to provide anti-cheat flow, Kingsman, content monitoring and KOL selection and other services; new retail sector, covering production, distribution, sales and other core aspects of risk control for the super, shoes, cosmetic and other KA security services provide the whole process, the depth of the problem solving wool party. Waterproof wall can provide protection such as registration, login protection, authentication code, anti-brush and other services activities, internal and external customers currently provide 50 billion daily security + times, more details can be found: https://007.qq.com

Published 363 original articles · won praise 74 · views 190 000 +

Guess you like

Origin blog.csdn.net/sinat_26811377/article/details/104579556