Python library to identify language in text, language detection

This article introduces three python libraries that run locally and can recognize languages ​​in text without networking.

1. Chard

Chardet library is character encoding autodetection in python. encode means encoding
installation:pip install chardet

Local string language detection uses:

import chardet

print(chardet.detect("Я люблю вкусные пампушки".encode('cp1251')))


输出:
{
    
    'encoding': 'windows-1251', 'confidence': 0.9787849417942193, 'language': 'Russian'}

Detect the language in the file xml:

import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob('*.xml'):
    print(filename.ljust(60), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector.result)

2. Langdetect

A very practical python library for small needs.
Install:pip install langdetect

Detection:

from langdetect import detect, DetectorFactory, detect_langs
# DetectorFactory.seed = 0
print(detect('今一はお前さん'))


输出:
ja

The language detection algorithm is non-deterministic, which means that if you try to run it on text that is too short or too vague, you may get different results each time you run it. If you want to enforce consistent results, you can call before language detection DetectorFactory.seed = 0:

If you want to output the probabilities of the top languages:

from  langdetect  import  detect_langs 
detect_langs ( "Otec matka syn." ) 


输出:
[ sk : 0.572770823327 ,  pl : 0.292872522702 ,  cs : 0.134356653968 ]

3. Heaven

Install:pip install langid

Detection:

import langid
print(langid.classify("今一はお前さん"))


输出:
('ja', -143.23792815208435)

that's all.

Guess you like

Origin blog.csdn.net/qq_41608408/article/details/128210466