Foreword
With the popularity of pre-trained models such as BERT, ERNIE, XLNet, etc., it does not seem to be a bit outdated without solving the pre-trained model when solving NLP problems. But this is obviously wrong.
As we all know, regardless of training or inference, pre-trained models consume a lot of computing power and are highly dependent on GPU computing resources. However, there are a lot of NLP problems that can actually be done with just dictionary + rules , then the forced and cumbersome model at this time is tantamount to anti-aircraft guns hitting mosquitoes.
So a relatively small evening went from crazy in github repo for everyone carefully selected the 45 more practical open source gadgets and dictionaries, so that everyone in the building NLP systems, auxiliary process of alchemy in a little less dependent on the model and calculate the force , More small and beautiful code.
Repo address:
https://github.com/fighting41love/funNLP
Note: This is a very heartbroken repo, which contains more than 300 items, but it is more mixed, so remember to compare it horizontally
Come, feel m (_ _) m
Ghost knows how I read these 300 repo
(╯ ° □ °) ╯︵ ┻━┻
1. textfilter: Chinese and English sensitive word filtering
repo: observerss/textfilter
>>> f = DFAFilter()
>>> f.add("sexy")
>>> f.filter("hello sexy baby")
hello **** baby
Sensitive words include topics such as politics and profanity. The principle is mainly based on the dictionary search (keyword file in the project), the content is very halal
2. langid: 97 languages detection
repo: saffsd/langid.py
pip install langid
>>> import langid>>> langid.classify("This is a test")
('en', -54.41310358047485)
3. langdetect: another language detection
Address: https://code.google.com/archive/p/language-detection
pip install langdetect
from langdetect import detectfrom langdetect import detect_langs
s1 = "本篇博客主要介绍两款语言探测工具,用于区分文本到底是什么语言,"s2 = 'We are pleased to introduce today a new technology'print(detect(s1))
print(detect(s2))
print(detect_langs(s3)) # detect_langs()输出探测出的所有语言类型及其所占的比例
The output results are as follows: Note: The language type mainly refers to the ISO 639-1 language coding standard, see ISO 639-1 Baidu Encyclopedia for details
Compared with the previous language test, the accuracy is low and the efficiency is high.
4. phone China mobile phone attribution query:
repo: ls0f/phone
Has been integrated into python package cocoNLP
from phone import Phone
p = Phone()
p.find(18100065143)#return {'phone': '18100065143', 'province': '上海', 'city': '上海', 'zip_code': '200000', 'area_code': '021', 'phone_type': '电信'}
Support number segment: 13 , 15 , 18 *, 14 [5,7], 17 [0,6,7,8]
Number of records: 360569 (updated: April 2017)
The author provides data phone.dat to facilitate the load data of non-python users.
5. phone international mobile phone, phone attribution query:
repo: AfterShip/phone
npm install phone
import phone from 'phone';
phone('+852 6569-8900'); // return ['+85265698900', 'HKG']
phone('(817) 569-8900'); // return ['+18175698900, 'USA']
6. ngender judges the gender based on the name:
repo: observers / ngender
Probability based on Naive Bayes calculation
pip install ngender
>>> import ngender>>> ngender.guess('赵本山')
('male', 0.9836229687547046)>>> ngender.guess('宋丹丹')
('female', 0.9759486128949907)
7. Extract regular expressions of email
Has been integrated into python package cocoNLP
email_pattern = '^[*#\u4e00-\u9fa5 a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.[a-zA-Z0-9]{2,6}$'emails = re.findall(email_pattern, text, flags=0)
8. Extract the regular expression of phone_number
Has been integrated into python package cocoNLP
cellphone_pattern = '^((13[0-9])|(14[0-9])|(15[0-9])|(17[0-9])|(18[0-9]))\d{8}$'phoneNumbers = re.findall(cellphone_pattern, text, flags=0)
9. Regular expression for extracting ID number
IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])$'IDs = re.findall(IDCards_pattern, text, flags=0)
10. Name corpus:
repo: wainshine/Chinese-Names-Corpus
The name extraction function has been added to python package cocoNLP
中文(现代、古代)名字、日文名字、中文的姓和名、称呼(大姨妈、小姨妈等)、英文->中文名字(李约翰)、成语词典
(Can be used for Chinese word segmentation and name recognition)
11. Chinese abbreviation library:
repo: zhangyics/Chinese-abbreviation-dataset
全国人大: 全国/n 人民/n 代表大会/n
中国: 中华人民共和国/ns
女网赛: 女子/n 网球/n 比赛/vn
12. Chinese word breaking dictionary:
repo: kfcd / chaizi
漢字 拆法 (一) 拆法 (二) 拆法 (三)
拆 手 斥 扌 斥 才 斥
13. Vocabulary sentiment value:
repo: rainarch / SentiBridge
山泉水 充沛 0.400704566541 0.370067395878
视野 宽广 0.305762728932 0.325320747491
大峡谷 惊险 0.312137906517 0.378594957281
14. Chinese thesaurus, stop words, sensitive words
repo: dongxiexidian/Chinese
The sensitive thesaurus classification of this package is more detailed:
Reactionary thesaurus, sensitive thesaurus table statistics, terror thesaurus, people's livelihood thesaurus, pornographic thesaurus
15. Chinese character to pinyin:
repo: mozillazg/python-pinyin
Text correction will be used
16. Chinese Traditional and Simplified Chinese:
repo: skydark/nstools
17. English simulation Chinese pronunciation engine
repo: tinyfool/ChineseWithEnglish
say wo i ni
#说:我爱你
Equivalent to using English phonetic transcription to simulate Chinese pronunciation.
18. Thesaurus, antonyms and negative thesaurus:
repo: guotong1988/chinese_dictionary
19. Chinese character data
repo: skishore/makemeahanzi
-
Simplified / Traditional Chinese Character Stroke Order
-
Vector strokes
20. English string segmentation and word extraction without spaces:
repo: keredson/wordninja
>>> import wordninja>>> wordninja.split('derekanderson')
['derek', 'anderson']>>> wordninja.split('imateapot')
['im', 'a', 'teapot']
21. Regular expression of IP address:
(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)
22. Regular expression of Tencent QQ number:
[1-9]([0-9]{5,11})
23. Regular expression of domestic fixed-line number:
[0-9-()()]{7,18}
24. User name regular expression:
[A-Za-z0-9_\-\u4e00-\u9fa5]+
25. g2pC: Context-based automatic marking module for Chinese pronunciation
repo: Kyubyong / g2pC
26. Time extraction:
Has been integrated into python package cocoNLP
在2016年6月7日9:44执行測試,结果如下
Hi,all。下周一下午三点开会
>> 2016-06-13 15:00:00-false
周一开会
>> 2016-06-13 00:00:00-true
下下周一开会
>> 2016-06-20 00:00:00-true
java version:
https://github.com/shinyke/Time-NLP
python version:
https://github.com/zhanzecheng/Time_NLP
27. Quickly convert "Chinese numerals" and "Arabic numerals"
repo: HaveTwoBrush/cn2an
-
Convert Chinese and Arabic numbers
-
Chinese and Arabic numbers mixed, under development
28. Company Name Encyclopedia
repo: wainshine/Company-Names-Corpus
29. Ancient Poetry Thesaurus
repo: panhaiqi/AncientPoetry
A more comprehensive library of ancient poems:
https://github.com/chinese-poetry/chinese-poetry
30. Thesaurus sorted by THU
repo: http://thuocl.thunlp.org/
It has been organized into the data folder of this repo.
IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库
31. PDF table data extraction tool
repo: camelot-dev/camelot
32. Regular matching of domestic phone numbers (three major operators + virtual etc.)
repo: VincentSit/ChinaMobilePhoneNumberRegex
33. User name blacklist list:
repo: marteinn/The-Big-Username-Blacklist
Contains a list of prohibited user names, such as:
administrator
administration
autoconfig
autodiscover
broadcasthost
domain
editor
guest
host
hostmaster
info
keybase.txt
localdomain
localhost
master
mail
mail0
mail
34. Microsoft multilingual numbers / units / such as date and time identification package:
repo: Microsoft/Recognizers-Text
35. chinese-xinhua Chinese Xinhua dictionary database and api, including common Xiehou, idioms, words and Chinese characters
repo: pwxcoo/chinese-xinhua
36. Automatic generation of document atlas
repo: liuhuanyong / TextGrapher
-
TextGrapher-Text Content Grapher based on keyinfo extraction by NLP method. Enter a document, extract the key information of the document, structure it, and finally organize it into a graph organization form to form a graphical display of the semantic information of the article
37. Numbers in 186 languages are called Faku
repo: google/UniNum
38. Traditional and Simplified Conversion
repo: berniey/hanziconv
39. Chinese character feature extractor (featurizer), which extracts the features of Chinese characters (pronunciation features, glyph features) for deep learning features
repo: howl-anderson/hanzi_char_featurizer
40. Chinese Abbreviation Data Set
repo: zhangyics/Chinese-abbreviation-dataset
41. Wudao Dictionary-The command-line version of Youdao Dictionary, which supports English-Chinese mutual search and online query
repo: ChestnutHeng / Wudao-dict
42. The Best Chinese Numbers (Chinese Numbers) -Arabic Number Conversion Tool
repo: Wall-ee/chinese2digits
43. LineFlow: NLP data efficient loader for all deep learning frameworks
repo: tofunlp/lineflow
44. Parsing and converting natural language numeric strings to integers and floating-point numbers
repo: jaidevd/numerizer
45. Large list of English swear words
repo: zacanger/profane-words
In addition, this repo also contains a lot of data sets, but it is also quite messy. Xiao Xi has passed it here, and friends who need it can go through the repo.
In addition, the github repo of the Stanford CS224n-2020-winter chase program has also been built. Friends who have difficulties in doing homework and chasing videos can find notes, summary perceptions, assignments and projects organized by other friends in this repo.
At the same time, it is also an open source open assignment correction platform. Submit your own assignments and course understanding to the repo. There will be teaching assistants and other small partners to help review the changes. Hurry up and submit & star