NLP open source dictionary and tools summary

Foreword

With the popularity of pre-trained models such as BERT, ERNIE, XLNet, etc., it does not seem to be a bit outdated without solving the pre-trained model when solving NLP problems. But this is obviously wrong.

As we all know, regardless of training or inference, pre-trained models consume a lot of computing power and are highly dependent on GPU computing resources. However, there are a lot of NLP problems that can actually be done with just dictionary + rules , then the forced and cumbersome model at this time is tantamount to anti-aircraft guns hitting mosquitoes.

So a relatively small evening went from crazy in github repo for everyone carefully selected the 45 more practical open source gadgets and dictionaries, so that everyone in the building NLP systems, auxiliary process of alchemy in a little less dependent on the model and calculate the force , More small and beautiful code.

Repo address:

https://github.com/fighting41love/funNLP

Note: This is a very heartbroken repo, which contains more than 300 items, but it is more mixed, so remember to compare it horizontally


Come, feel m (_ _) m

Ghost knows how I read these 300 repo
(╯ ° □ °) ╯︵ ┻━┻

1. textfilter: Chinese and English sensitive word filtering

repo: observerss/textfilter

 >>> f = DFAFilter()
 >>> f.add("sexy")
 >>> f.filter("hello sexy baby")
 hello **** baby

Sensitive words include topics such as politics and profanity. The principle is mainly based on the dictionary search (keyword file in the project), the content is very halal

2. langid: 97 languages ​​detection

repo: saffsd/langid.py

pip install langid

>>> import langid>>> langid.classify("This is a test")
('en', -54.41310358047485)

3. langdetect: another language detection

Address: https://code.google.com/archive/p/language-detection

pip install langdetect

from langdetect import detectfrom langdetect import detect_langs

s1 = "本篇博客主要介绍两款语言探测工具,用于区分文本到底是什么语言,"s2 = 'We are pleased to introduce today a new technology'print(detect(s1))
print(detect(s2))
print(detect_langs(s3))    # detect_langs()输出探测出的所有语言类型及其所占的比例

The output results are as follows: Note: The language type mainly refers to the ISO 639-1 language coding standard, see ISO 639-1 Baidu Encyclopedia for details

Compared with the previous language test, the accuracy is low and the efficiency is high.

4. phone China mobile phone attribution query:

repo: ls0f/phone

Has been integrated into python package cocoNLP

from phone import Phone
p  = Phone()
p.find(18100065143)#return {'phone': '18100065143', 'province': '上海', 'city': '上海', 'zip_code': '200000', 'area_code': '021', 'phone_type': '电信'}

Support number segment: 13 , 15 , 18 *, 14 [5,7], 17 [0,6,7,8]

Number of records: 360569 (updated: April 2017)

The author provides data phone.dat to facilitate the load data of non-python users.

5. phone international mobile phone, phone attribution query:

repo: AfterShip/phone

npm install phone

import phone from 'phone';
phone('+852 6569-8900'); // return ['+85265698900', 'HKG']
phone('(817) 569-8900'); // return ['+18175698900, 'USA']

6. ngender judges the gender based on the name:

repo: observers / ngender

Probability based on Naive Bayes calculation

pip install ngender

>>> import ngender>>> ngender.guess('赵本山')
('male', 0.9836229687547046)>>> ngender.guess('宋丹丹')
('female', 0.9759486128949907)

7. Extract regular expressions of email

Has been integrated into python package cocoNLP

email_pattern = '^[*#\u4e00-\u9fa5 a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.[a-zA-Z0-9]{2,6}$'emails = re.findall(email_pattern, text, flags=0)

8. Extract the regular expression of phone_number

Has been integrated into python package cocoNLP

cellphone_pattern = '^((13[0-9])|(14[0-9])|(15[0-9])|(17[0-9])|(18[0-9]))\d{8}$'phoneNumbers = re.findall(cellphone_pattern, text, flags=0)

9. Regular expression for extracting ID number

IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])$'IDs = re.findall(IDCards_pattern, text, flags=0)

10. Name corpus:

repo: wainshine/Chinese-Names-Corpus

The name extraction function has been added to python package cocoNLP

中文(现代、古代)名字、日文名字、中文的姓和名、称呼(大姨妈、小姨妈等)、英文->中文名字(李约翰)、成语词典

(Can be used for Chinese word segmentation and name recognition)

11. Chinese abbreviation library:

repo: zhangyics/Chinese-abbreviation-dataset

全国人大: 全国/n 人民/n 代表大会/n
中国: 中华人民共和国/ns
女网赛: 女子/n 网球/n 比赛/vn

12. Chinese word breaking dictionary:

repo: kfcd / chaizi

漢字    拆法 (一)    拆法 (二)    拆法 (三)
拆    手 斥    扌 斥    才 斥

13. Vocabulary sentiment value:

repo: rainarch / SentiBridge

山泉水    充沛    0.400704566541    0.370067395878
视野            宽广    0.305762728932    0.325320747491
大峡谷    惊险    0.312137906517    0.378594957281

14. Chinese thesaurus, stop words, sensitive words

repo: dongxiexidian/Chinese

The sensitive thesaurus classification of this package is more detailed:

Reactionary thesaurus, sensitive thesaurus table statistics, terror thesaurus, people's livelihood thesaurus, pornographic thesaurus

15. Chinese character to pinyin:

repo: mozillazg/python-pinyin

Text correction will be used

16. Chinese Traditional and Simplified Chinese:

repo: skydark/nstools

17. English simulation Chinese pronunciation engine

repo: tinyfool/ChineseWithEnglish

say wo i ni
#说:我爱你

Equivalent to using English phonetic transcription to simulate Chinese pronunciation.

18. Thesaurus, antonyms and negative thesaurus:

repo: guotong1988/chinese_dictionary

19. Chinese character data

repo: skishore/makemeahanzi

  • Simplified / Traditional Chinese Character Stroke Order

  • Vector strokes

20. English string segmentation and word extraction without spaces:

repo: keredson/wordninja

>>> import wordninja>>> wordninja.split('derekanderson')
['derek', 'anderson']>>> wordninja.split('imateapot')
['im', 'a', 'teapot']

21. Regular expression of IP address:

(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)

22. Regular expression of Tencent QQ number:

[1-9]([0-9]{5,11})

23. Regular expression of domestic fixed-line number:

[0-9-()()]{7,18}

24. User name regular expression:

[A-Za-z0-9_\-\u4e00-\u9fa5]+

25. g2pC: Context-based automatic marking module for Chinese pronunciation

repo: Kyubyong / g2pC

26. Time extraction:

Has been integrated into python package cocoNLP

在2016年6月7日9:44执行測試,结果如下

Hi,all。下周一下午三点开会

>> 2016-06-13 15:00:00-false

周一开会

>> 2016-06-13 00:00:00-true

下下周一开会

>> 2016-06-20 00:00:00-true

java version:
https://github.com/shinyke/Time-NLP

python version:
https://github.com/zhanzecheng/Time_NLP

27. Quickly convert "Chinese numerals" and "Arabic numerals"

repo: HaveTwoBrush/cn2an

  • Convert Chinese and Arabic numbers

  • Chinese and Arabic numbers mixed, under development

28. Company Name Encyclopedia

repo: wainshine/Company-Names-Corpus

29. Ancient Poetry Thesaurus

repo: panhaiqi/AncientPoetry

A more comprehensive library of ancient poems:
https://github.com/chinese-poetry/chinese-poetry

30. Thesaurus sorted by THU

repo: http://thuocl.thunlp.org/

It has been organized into the data folder of this repo.

IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库

31. PDF table data extraction tool

repo: camelot-dev/camelot

32. Regular matching of domestic phone numbers (three major operators + virtual etc.)

repo: VincentSit/ChinaMobilePhoneNumberRegex

33. User name blacklist list:

repo: marteinn/The-Big-Username-Blacklist

Contains a list of prohibited user names, such as:

administrator
administration
autoconfig
autodiscover
broadcasthost
domain
editor
guest
host
hostmaster
info
keybase.txt
localdomain
localhost
master
mail
mail0
mail

34. Microsoft multilingual numbers / units / such as date and time identification package:

repo: Microsoft/Recognizers-Text

35. chinese-xinhua Chinese Xinhua dictionary database and api, including common Xiehou, idioms, words and Chinese characters

repo: pwxcoo/chinese-xinhua

36. Automatic generation of document atlas

repo: liuhuanyong / TextGrapher

  • TextGrapher-Text Content Grapher based on keyinfo extraction by NLP method. Enter a document, extract the key information of the document, structure it, and finally organize it into a graph organization form to form a graphical display of the semantic information of the article

37. Numbers in 186 languages ​​are called Faku

repo: google/UniNum

38. Traditional and Simplified Conversion

repo: berniey/hanziconv

39. Chinese character feature extractor (featurizer), which extracts the features of Chinese characters (pronunciation features, glyph features) for deep learning features

repo: howl-anderson/hanzi_char_featurizer

40. Chinese Abbreviation Data Set

repo: zhangyics/Chinese-abbreviation-dataset

41. Wudao Dictionary-The command-line version of Youdao Dictionary, which supports English-Chinese mutual search and online query

repo: ChestnutHeng / Wudao-dict

42. The Best Chinese Numbers (Chinese Numbers) -Arabic Number Conversion Tool

repo: Wall-ee/chinese2digits

43. LineFlow: NLP data efficient loader for all deep learning frameworks

repo: tofunlp/lineflow

44. Parsing and converting natural language numeric strings to integers and floating-point numbers

repo: jaidevd/numerizer

45. Large list of English swear words

repo: zacanger/profane-words

In addition, this repo also contains a lot of data sets, but it is also quite messy. Xiao Xi has passed it here, and friends who need it can go through the repo.

In addition,  the github repo of the Stanford CS224n-2020-winter chase program has also been built. Friends who have difficulties in doing homework and chasing videos can find notes, summary perceptions, assignments and projects organized by other friends in this repo.

At the same time, it is also an open source open assignment correction platform. Submit your own assignments and course understanding to the repo. There will be teaching assistants and other small partners to help review the changes. Hurry up and submit & star

Published 45 original articles · won praise 2 · Views 5228

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/105037084