This article introduces three python libraries that run locally and can recognize languages in text without networking.
1. Chard
Chardet library is character encoding autodetection in python. encode means encoding
installation:pip install chardet
Local string language detection uses:
import chardet
print(chardet.detect("Я люблю вкусные пампушки".encode('cp1251')))
输出:
{
'encoding': 'windows-1251', 'confidence': 0.9787849417942193, 'language': 'Russian'}
Detect the language in the file xml:
import glob
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
for filename in glob.glob('*.xml'):
print(filename.ljust(60), end='')
detector.reset()
for line in open(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print(detector.result)
2. Langdetect
A very practical python library for small needs.
Install:pip install langdetect
Detection:
from langdetect import detect, DetectorFactory, detect_langs
# DetectorFactory.seed = 0
print(detect('今一はお前さん'))
输出:
ja
The language detection algorithm is non-deterministic, which means that if you try to run it on text that is too short or too vague, you may get different results each time you run it. If you want to enforce consistent results, you can call before language detection DetectorFactory.seed = 0
:
If you want to output the probabilities of the top languages:
from langdetect import detect_langs
detect_langs ( "Otec matka syn." )
输出:
[ sk : 0.572770823327 , pl : 0.292872522702 , cs : 0.134356653968 ]
3. Heaven
Install:pip install langid
Detection:
import langid
print(langid.classify("今一はお前さん"))
输出:
('ja', -143.23792815208435)
that's all.