Stuttering Word Segmentation-Powerful Chinese Word Segmentation Practice (java version)

Introduction

The native jieba word segmentation is a powerful Python component that can be used for keyword extraction, annotation and positioning.

The java version supports three modes

  • Precise mode: attempts to cut sentences into the most precise form, suitable for text analysis;
  • Full mode: scans out all the words in the sentence that can be turned into words, which is very fast, but cannot resolve ambiguities;
  • Search engine mode: Based on the precise mode, long words are segmented again to improve the recall rate. It is suitable for search engine word segmentation.

use

Import maven dependencies

Project address: GitHub - huaban/jieba-analysis: Stuttering word segmentation (java version)

com.huaban
jieba-analysis
1.0.2

Three modes of use

Prepare a piece of text  . Ollie gives me a lighting fixture. Ordinary safety exit sign lamp DC36V 6W wall type . Look at the keyword differences extracted from the three modes.

  • code

     

  • Effect
    accurate mode: ["Oli", "Give", "I", "Yes", "Lighting", "Tool", "Normal", "Safety", "Exit", "Marking Light"," DC36V6W","Wall type"]
    INDEX mode: ["Oli","Give","I","Yes","Lighting","Bright light","Lighting light","Touch","Normal", "Normal type", "Safety", "Exit", "Sign", "Mark light", "dc36v6w", "Wall type"] SEARCH mode: [
    "Oli", "Give", "I", "Yes" ","lighting","tool","ordinary","safety","exit","sign light","dc36v6w","wall"] It can be seen that there is not much difference between serch mode and precise
    mode

Custom dictionary

Jieba word segmentation has a built-in commonly used dictionary, and there is a dic.txt file in the source code directory.

When the built-in dictionary does not meet our business scenarios, you can customize the dictionary. The dictionary
format is the same as dict.txt, with one word occupying one line; each line is divided into three parts: word, word frequency (can be omitted), part of speech (can be omitted), with spaces Separate, the order cannot be reversed

For example, if you define the words "OLI GUI" and "I AM A LIGHTING LAMP" in the text as keywords, you have to define them like this: "
OLI GIVE" 50
I am a lighting fixture 50

  • Create a new custom dictionary file:
    Create a new jiebaCon directory under the resource directory and create a new custom dictionary file

  • Load user dictionary file

  • Effect

Dynamically load user dictionaries

Idea: Read dictionary data from the outside and generate temporary files for use by jieba word segmentation component

  • code

  • Effect

Guess you like

Origin blog.csdn.net/2301_78834737/article/details/131990541