Win11 environment Mecab Japanese word segmentation and part-of-speech analysis and dynamic library DLL not found problem (Python3.10)

Insert image description here

Due to the existence of pseudonyms in Japanese, translation software will cause the translation problem to be too heavy when translating. For example, most software will translate the word "accumulation" (つんどく) into: "accumulation of reading", but in fact, it means that I bought the book but did not read it. , meaning to put light. Sometimes you need to look up the definitions of the words in the sentence separately to understand the meaning of the sentence, but at first glance they are all pseudonyms, and it is impossible to perform simple word segmentation operations like Chinese or English.

This time we use the third-party library Mecab of Python 3.10 to perform word segmentation and part-of-speech analysis on Japanese.

Install and configure Mecab

First download the latest 64-bit installation package of Mecab0.996:

https://github.com/ikegami-yukino/mecab/releases

Then double-click to install, pay attention to select the national standard code utf-8 for encoding:

The default Shift_JIS is a coding table commonly used in Japanese computer systems, which can accommodate full-width and half-width Latin letters, hiragana, katakana, symbols and Japanese kanji.

Of course, if your computer is a Japanese system, choose Shift_JIS, but utf-8 is universal.

After successful installation, it is best to add the bin directory to the system environment variables.

Dynamic library DLL not found problem

Then install the corresponding Python dependencies:

pip install mecab-python3

Subsequent import into the Mecab library may report DLL not found.

This is because the system cannot find Mecab’s runtime library libmecab.dll

At this time, you can consider copying libmecab.dll in the bin directory in the Mecab installation directory to the system's C:/windows/system32 directory.

Because in the Windows operating system, DLL files are dynamic link library files, which contain many functions that can be called by other programs. If you want a program to use a DLL file, you need to ensure that the DLL file has been correctly installed in the system directory, and system32 is the dynamic library installation directory of the Win11 system.

In short, placing the DLL file in the C:\Windows\System32 directory can make it visible to other programs, but you need to pay attention to user permission issues.

Mecab Japanese word segmentation and part-of-speech analysis

Then write the code test.py:

import MeCab  
  
CONTENT = "私はpythonを使用して、プログラミングを勉強しています。積ん読"  
  
tagger = MeCab.Tagger()  
parse = tagger.parse(CONTENT)  
  
print(parse)

operation result:

PS D:\jiyun\积云\boo3_public> python -u "d:\jiyun\积云\boo3_public\mecab_test.py"  
私      ワタクシ        ワタクシ        私-代名詞       代名詞                  0  
は      ワ      ハ      は      助詞-係助詞  
python  python  python  python  名詞-普通名詞-一般                      0  
を      オ      ヲ      を      助詞-格助詞  
使用    シヨー  シヨウ  使用    名詞-普通名詞-サ変可能                  0  
し      シ      スル    為る    動詞-非自立可能 サ行変格        連用形-一般     0  
て      テ      テ      て      助詞-接続助詞  
、                      、      補助記号-読点  
プログラミング  プログラミング  プログラミング  プログラミング-programming      名詞-普通名詞-サ変可能                  4  
を      オ      ヲ      を      助詞-格助詞  
勉強    ベンキョー      ベンキョウ      勉強    名詞-普通名詞-サ変可能                  0  
し      シ      スル    為る    動詞-非自立可能 サ行変格        連用形-一般     0  
て      テ      テ      て      助詞-接続助詞  
い      イ      イル    居る    動詞-非自立可能 上一段-ア行     連用形-一般     0  
ます    マス    マス    ます    助動詞  助動詞-マス     終止形-一般  
。                      。      補助記号-句点  
積ん読  ツンドク        ツンドク        積ん読  名詞-普通名詞-一般

You can see that the private python is used here, and the プログラミングを reluctantly is しています. The complete Japanese sentence "氓読" is divided into words, and the parts of speech are marked, such as the word "氓読" mentioned above.

If it is a large text, it can also be segmented and interpreted by reading a file:

import MeCab  
  
FILE_NAME = "sample.txt"  
  
with open(FILE_NAME, "r", encoding="utf-8") as f:  
    CONTENT = f.read()  
  
tagger = MeCab.Tagger()  
parse = tagger.parse(CONTENT)  
  
print(parse)

Note that when reading the file here, you need to declare that the encoding is utf-8.

The program returns:

私      名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ  
は      助詞,係助詞,*,*,*,*,は,ハ,ワ  
python  名詞,一般,*,*,*,*,*  
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ  
使用    名詞,サ変接続,*,*,*,*,使用,シヨウ,シヨー  
し      動詞,自立,*,*,サ変・スル,連用形,する,シ,シ  
て      助詞,接続助詞,*,*,*,*,て,テ,テ  
、      記号,読点,*,*,*,*,、,、,、  
プログラミング  名詞,サ変接続,*,*,*,*,プログラミング,プログラミング,プログラミング  
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ  
勉強    名詞,サ変接続,*,*,*,*,勉強,ベンキョウ,ベンキョー  
し      動詞,自立,*,*,サ変・スル,連用形,する,シ,シ  
て      助詞,接続助詞,*,*,*,*,て,テ,テ  
い      動詞,非自立,*,*,一段,連用形,いる,イ,イ  
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス  
。      記号,句点,*,*,*,*,。,。,。 

Conclusion

Mecab was originally developed at Nara Advanced Institute of Science and Technology University and is currently maintained by Taku Kudou as part of the Google Japanese input project. MeCab's name comes from the developer's favorite food, "mekabu," a Japanese dish made from wakame leaves.

MeCab's advantages include accurate analysis of Japanese, fast analysis speed, and cross-platform support for different operating systems. MeCab is an important tool for Japanese text processing, providing powerful support for Japanese text analysis and processing.

Guess you like

Origin blog.csdn.net/zcxey2911/article/details/135338822