[Kaggle] Text Normalization Challenge - English Language

非關內文

月底了,心中一直想著要再記錄一下,於是我終於抽空寫了十天前就結束的Text Normalization Challenge 的心得(其實只是懶 =3=)。
這項競賽一樣是在Kaggle上發起的競賽,根據競賽描述,我們要設計能將文章語句轉換成口說語法的機器學習模型,舉個例子:

Example 1.
原文: A baby giraffe is 6ft tall and weighs 150lb.
轉換: A baby giraffe is six feet tall and weighs one hundred fifty pounds sil

Example 2.
原文: $22,750
轉換: twenty two thousand seven hundred fifty dollars

Example 3.
原文: September 5, 1895
轉換: september fifth eighteen ninety five

心得

根據競賽的定義,在英文項目底下的文字總共可以分成17個類別,比如日期、數字、地址…,這些類別裡,文字的轉換是有跡可循的,但是偶爾會出現不明確的轉換,比如同樣是1972,在日期類別底下會轉換成nineteen seventy two,但是在數值類別的話就是one thousand nine hundred seventy two。

因為有用英文寫了一下大概的解題思路,因此這邊也就懶的翻譯了,直接上原文說明:


My solution is based on BingQing Wei’s public kernel, then I use several step to optimized it:

1. Use xgboost to predict test cases’ class:
The model is similar to XGboost With Context Label Data (ACC: 99.637%) (the author is also BingQing Wei, big thanks to his work)

In addition, I use extra xgboost model to predict a 4 digit number is ‘DATE’ or ‘CARDINAL’.

2. For some class, use customized normalize function to deal with it:
I treat MEASURE, DATE, MONEY, DECIMAL, CARDINAL, and DIGIT. Because they have specific form. Each customized normalize function can reach from 98.9% to 99.7% acc. (But my customized normalize function can’t handle some rare case, such like Sept. 21th 2017. I’m wondering that did the top team have smarter way.)

For example, to deal with the ‘DECIMAL’ class. I will use a function to normalized it.

       
       
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
       
       
def (key):
# 100% acc if change
if(len(key.split()) == 2):
# e.g. 0.21 million
unit_words = [ 'hundred', 'thousand', 'million', 'billion']
if( not is_decimal(key.split()[ 0])):
return key
else:
if(((key.split()[ 1]).lower() in unit_words):
return decimal2word(key.split()[ 0]) + ' ' + (key.split()[ 1]).lower()
else:
return key
else:
if( not is_decimal(key)):
return key
digit_dict = { '0': 'o', '1': 'one', '2': 'two', '3': 'three', '4': 'four', '5': 'five', '6': 'six', '7': 'seven', '8': 'eight', '9': 'nine'}
out = []
if(key[ 0] == '.'):
# e.g. .021 to point o two one
out.append( 'point')
for v in key.replace( '.', ''):
out.append(digit_dict[v])
else:
n1, n2 = str(int(key.split( '.')[ 0])), key.split( '.')[ 1]
out.append(digit2word(n1))
out.append( 'point')
if(len(n2) == 1 and n2[ 0] == '0'):
out.append( 'zero')
else:
for v in n2:
out.append(digit_dict[v])
word = ' '.join(out)
return word

3. Use xgboost to deal with binary ambiguous case:
Binary ambiguous case such like the ‘-‘ and the ‘:’, which have two target norm, original char and ‘to’. With xgboost model, it’s able to handel ~98% precision and ~99.3% recall.

最後我以99.32%的accuacy完成了英文項目的競賽,算是滿意了,而且透過閱讀前十名參賽者的解題思路,也學習到了很多新的、有趣的技術..不過還是要等下個機會實際應用,才能更得其精隨。

改天有空再發俄文版的解題思路,俄文的語法真的有夠難 (╯°Д°)╯ ┻━┻ 。

完結灑花!
To be continued..

原文:大专栏  [Kaggle] Text Normalization Challenge - English Language


猜你喜欢

转载自www.cnblogs.com/chinatrump/p/11597058.html