Simple sorting of word segmentation processing (including word attribute processing) using HanLP in Android Studio's Android

Table of contents

1. Brief introduction

2. Implementation principle

3. Matters needing attention

4. Effect preview

5. Implementation steps

6. Key code

Appendix: In HanLP, the nature field of the Term object indicates the part of speech

1. Brief introduction

Some basic operations in Android development are sorted out for later use.

This section introduces how to use HanLP to perform word segmentation processing (including word attribute processing) of sentence paragraphs in Android.

On the Android platform, in addition to HanLP, there are other algorithms and tools that can be used for Chinese word segmentation. The following are some common Chinese word segmentation algorithms, and some advantages of HanLP in word segmentation:

Common Chinese word segmentation algorithms and tools:

ansj_seg: ansj_seg is a Chinese word segmentation tool based on CRF and HMM model, suitable for Java platform. It supports fine-grained and coarse-grained word segmentation, and has certain custom dictionary and part-of-speech tagging functions.

jieba: jieba is a Chinese word segmentation library widely used in Python, but also has its Java version. It uses a word segmentation method based on a prefix dictionary, and performs well in terms of speed and effect.

lucene-analyzers-smartcn: This is a Chinese tokenizer in the Apache Lucene project, using a rule-based word segmentation algorithm. It is widely used in Lucene search engine.

ictclas4j: ictclas4j is a Chinese word segmentation tool developed by the Institute of Computing Technology, Chinese Academy of Sciences, based on the HMM model. It supports custom dictionaries and part-of-speech tagging.

Advantages of HanLP word segmentation:

Multi-domain applicability: HanLP is designed as a multi-domain Chinese natural language processing toolkit, which not only includes word segmentation, but also supports various tasks such as part-of-speech tagging, named entity recognition, and dependency syntax analysis.

Performance and effect: HanLP has been trained and optimized on multiple standard datasets, and has good word segmentation effect and performance.

Flexible dictionary support: HanLP supports custom dictionaries, and you can add vocabulary in professional fields as needed to improve word segmentation.

Open Source: HanLP is open source, you can use, modify and distribute it freely, which facilitates customization and integration into your projects.

Multi-language support: HanLP not only supports Chinese, but also supports other languages, such as English, Japanese, etc., which facilitates cross-language processing.

Active community: HanLP has an active community and maintenance team that helps with problem solving and support.

In a word, HanLP is a feature-rich and high-performance Chinese natural language processing tool, which is suitable for various application scenarios, especially in multi-domain text processing tasks. However, the final choice depends on your specific needs and project context.

HanLP Official Website: HanLP | Online Demo

HanLP GitHub: GitHub - hankcs/HanLP: Chinese Word SegmentationPart-of-Speech TagNamed Entity RecognitionDependencySyntax AnalysisComponentSyntax AnalysisSemantic Dependency AnalysisSemantic Role LabelingReferencing ResolutionStyle ConversionSemantic SimilarityNew Word DiscoveryKeyword Phrase ExtractionAutomatic SummarizationText ClassificationClusteringPinyin Simplified and Traditional Conversion Natural Language Processing

2. Implementation principle

1. Use StandardTokenizer.segment(text) to pass in the text Text content for word segmentation

2. Use Term.word; to get the participle content, and Term.nature.toString() to get the participle attributes

3. Matters needing attention

1. Chinese words will have a more accurate corresponding attribute, but English words may not

4. Effect preview

5. Implementation steps

1. Open Android Studio to create an empty project, and introduce HanLP in build.gradle

implementation 'com.hankcs:hanlp:portable-1.7.5' Remember Sync nNow

2. Create the script ChineseSegmentationExample to realize the word segmentation function

3. Call it in the main script, and input the content to be segmented

4. Package and run on the Android machine, the effect is as above

6. Key code

1、ChineseSegmentationExample

package com.xxxx.testchinesesegmentationexample;

import com.hankcs.hanlp.seg.common.Term;
import com.hankcs.hanlp.tokenizer.StandardTokenizer;

import java.util.List;

public class ChineseSegmentationExample {

    /**
     * 分词
     * @param wordsContent 要进行分词的内容
     */
    public static void SegmentWords(String wordsContent) {
        String text = wordsContent;

        // 进行分词
        List<Term> terms = StandardTokenizer.segment(text);

        // 遍历分词结果，判断词性并打印
        for (Term term : terms) {
            String word = term.word;
            String pos = term.nature.toString();

            String posInfo = getPosInfo(pos); // 判断词性属性

            System.out.println("Word: " + word + ", POS: " + pos + ", Attribute: " + posInfo);
        }
    }

    /**
     * 判断词性属性
     * @param pos
     * @return 属性
     */
    static String getPosInfo(String pos) {
        // 这里你可以根据需要添加更多的判断逻辑来确定词性属性
        if (pos.equals("n")) {
            return "名词";
        } else if (pos.equals("v")) {
            return "动词";
        } else if (pos.equals("ns")) {
            return "地名";
        }else if (pos.equals("t")) {
            return "时间";
        }
        else {
            return "其他";
        }
    }

}

2、MainActivity

ackage com.xxxxx.testchinesesegmentationexample;

import androidx.appcompat.app.AppCompatActivity;

import android.os.Bundle;

public class MainActivity extends AppCompatActivity {

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        ChineseSegmentationExample.SegmentWords("现在几号，几点钟，今天明天后天昨天北京深圳的天气如何。");
    }
}

Addendum: In HanLP, `Term`the object's `nature`field represents the part of speech

In HanLP, the field Termof the object naturerepresents Part of Speech (POS). HanLP uses a standard Chinese part-of-speech tagging system, and each part of speech has a unique identifier. Here are some common Chinese part-of-speech tags and their meanings:

noun class:

n: common noun

nr: name

ns: place name

nt: Organization name

nz: other proper names

nl: noun idiom

ng: noun morpheme

time class:

t: time word

Verbs:

v:verb

vd: Adverb

vn: noun verb

vshi: verb "to be"

vyou: verb "to have"

Adjective class:

a:adjective

ad: adverb

Adverb class:

d:adverb

Pronoun class:

r:pronoun

rr:Personal Pronouns

rz:Demonstrative

rzt: time demonstrative pronoun

Conjunction class:

c:conjunction

Particle class:

u:particle

Numeral class:

m:numeral

Quantifier class:

q:quantifier

Parts of speech:

y:Modal

Interjection class:

e:interjection

Onomatopoeia:

o:Onomatopoeia

Part of speech:

f:Position of the word

Status part of speech:

z: status word

Preposition class:

p:preposition

Prefix class:

h: prefix

Suffix class:

k:suffix

Punctuation classes:

w: Punctuation

Please note that the above are just some common part-of-speech tags and their meanings, and the actual situation may be more complicated. You can investigate HanLP documentation for more details on part-of-speech tagging as needed. Based on these part-of-speech tags, you can write code to judge the attributes of words (such as verbs, nouns, place names, etc.) and perform corresponding processing.