Introduction
The native jieba word segmentation is a powerful Python component that can be used for keyword extraction, annotation and positioning.
The java version supports three modes
- Precise mode: attempts to cut sentences into the most precise form, suitable for text analysis;
- Full mode: scans out all the words in the sentence that can be turned into words, which is very fast, but cannot resolve ambiguities;
- Search engine mode: Based on the precise mode, long words are segmented again to improve the recall rate. It is suitable for search engine word segmentation.
use
Import maven dependencies
Project address: GitHub - huaban/jieba-analysis: Stuttering word segmentation (java version)
com.huaban
jieba-analysis
1.0.2
Three modes of use
Prepare a piece of text . Ollie gives me a lighting fixture. Ordinary safety exit sign lamp DC36V 6W wall type . Look at the keyword differences extracted from the three modes.
- code
- Effect
accurate mode: ["Oli", "Give", "I", "Yes", "Lighting", "Tool", "Normal", "Safety", "Exit", "Marking Light"," DC36V6W","Wall type"]
INDEX mode: ["Oli","Give","I","Yes","Lighting","Bright light","Lighting light","Touch","Normal", "Normal type", "Safety", "Exit", "Sign", "Mark light", "dc36v6w", "Wall type"] SEARCH mode: [
"Oli", "Give", "I", "Yes" ","lighting","tool","ordinary","safety","exit","sign light","dc36v6w","wall"] It can be seen that there is not much difference between serch mode and precise
mode
Custom dictionary
Jieba word segmentation has a built-in commonly used dictionary, and there is a dic.txt file in the source code directory.
When the built-in dictionary does not meet our business scenarios, you can customize the dictionary. The dictionary
format is the same as dict.txt, with one word occupying one line; each line is divided into three parts: word, word frequency (can be omitted), part of speech (can be omitted), with spaces Separate, the order cannot be reversed
For example, if you define the words "OLI GUI" and "I AM A LIGHTING LAMP" in the text as keywords, you have to define them like this: "
OLI GIVE" 50
I am a lighting fixture 50
- Create a new custom dictionary file:
Create a new jiebaCon directory under the resource directory and create a new custom dictionary file - Load user dictionary file
- Effect
Dynamically load user dictionaries
Idea: Read dictionary data from the outside and generate temporary files for use by jieba word segmentation component
- code
- Effect