Unity Android's simple sorting of word segmentation processing (including word attribute processing) using HanLP

Table of contents

1. Brief introduction

2. Implementation principle

3. Matters needing attention

4. Effect preview

5. Implementation steps

6. Key code

Appendix: In HanLP, the nature field of the Term object indicates the part of speech

1. Brief introduction

The technologies that will be used in Unity Android development are simply sorted out, which is convenient for you to view in the future, and it would be better if it can help everyone.

This section introduces that in the development of Unity, the method of encapsulating the word segmentation function of HanLP from Android and calling it in Unity is organized and encapsulated to be called by Unity. The method is not unique, and corrections are welcome.

On the Android platform, in addition to HanLP, there are other algorithms and tools that can be used for Chinese word segmentation. The following are some common Chinese word segmentation algorithms, and some advantages of HanLP in word segmentation:

Common Chinese word segmentation algorithms and tools:

    ansj_seg: ansj_seg is a Chinese word segmentation tool based on CRF and HMM model, suitable for Java platform. It supports fine-grained and coarse-grained word segmentation, and has certain custom dictionary and part-of-speech tagging functions.

    jieba: jieba is a Chinese word segmentation library widely used in Python, but also has its Java version. It uses a word segmentation method based on a prefix dictionary, and performs well in terms of speed and effect.

    lucene-analyzers-smartcn: This is a Chinese tokenizer in the Apache Lucene project, using a rule-based word segmentation algorithm. It is widely used in Lucene search engine.

    ictclas4j: ictclas4j is a Chinese word segmentation tool developed by the Institute of Computing Technology, Chinese Academy of Sciences, based on the HMM model. It supports custom dictionaries and part-of-speech tagging.

Advantages of HanLP word segmentation:

    Multi-domain applicability: HanLP is designed as a multi-domain Chinese natural language processing toolkit, which not only includes word segmentation, but also supports various tasks such as part-of-speech tagging, named entity recognition, and dependency syntax analysis.

    Performance and effect: HanLP has been trained and optimized on multiple standard datasets, and has good word segmentation effect and performance.

    Flexible dictionary support: HanLP supports custom dictionaries, and you can add vocabulary in professional fields as needed to improve word segmentation.

    Open Source: HanLP is open source, you can use, modify and distribute it freely, which facilitates customization and integration into your projects.

    Multilingual support: HanLP not only supports Chinese, but also supports other languages, such as English, Japanese, etc., which facilitates cross-language processing.

    Active community: HanLP has an active community and maintenance team that helps with problem solving and support.

In a word, HanLP is a feature-rich and high-performance Chinese natural language processing tool, which is suitable for various application scenarios, especially in multi-domain text processing tasks. However, the final choice depends on your specific needs and project context.

HanLP Official Website: HanLP | Online Demo

HanLP GitHub: GitHub - hankcs/HanLP: Chinese Word SegmentationPart-of-Speech TagNamed Entity RecognitionDependencySyntax AnalysisComponentSyntax AnalysisSemantic Dependency AnalysisSemantic Role LabelingReferencing ResolutionStyle ConversionSemantic SimilarityNew Word DiscoveryKeyword Phrase ExtractionAutomatic SummarizationText ClassificationClusteringPinyin Simplified and Traditional Conversion Natural Language Processing

2. Implementation principle

1. On the Android side, use StandardTokenizer.segment(text) to pass in the Text content for word segmentation, use Term.word; to get the word segmentation content, and Term.nature.toString() to get the attributes of the word segmentation

2. Expose the functional interface encapsulated on the Android side to Unity calls

    /**
     * 开始分词
     * @param wordsContent
     * @return 返回分词结果，和此属性
     */
    public String segmentWork(String wordsContent)

3. Obtain the object interface of the Android side on the Unity side, and simply process the information to make it more suitable for use on the Unity side

MAndroidJavaObject.Call<string>("segmentWork", wordsContent)

3. Matters needing attention

1. Chinese words will have a more accurate corresponding attribute, but English words may not

2. Generally, the interaction between Android and Unity can only transfer the basic data types, and the advanced objects of the list object may not be transferred. Here, the data of the list object is assembled into a string and passed to Unity. Unity parses the corresponding information based on the string

4. Effect preview

(Here, the part of speech only makes a simple correspondence, if you need more correspondence, please refer to the appendix part of speech information)

5. Implementation steps

HanLP package (hanlp-portable-1.7.5.jar) can be downloaded directly in Android Studio here

Simple sorting of word segmentation processing (including word attribute processing) using HanLP in Android Studio's Android

1. Open Android Studio, create a module project, and add the hanlp-portable-1.7.5.jar package

Note: Remember to add as a library

2. Create a script and add word segmentation function

3. Create a Unity project and add the compiled aar to Unity

4. Create a script in Unity, call the interface encapsulated in Android, and write a script to test the function

5. Add the test script to the scene

6. Package, install and run on the machine, the effect is as above

6. Key code

1、TestChineseSegmentationHanlpHelper.cs

using System.Collections.Generic;
using UnityEngine;

public class TestChineseSegmentationHanlpHelper : MonoBehaviour
{
    // Start is called before the first frame update
    void Start()
    {
        List<SegWordStrct> segWordStrcts = ChineseSegmentationHanlpHelper.SegmentWords("今天深圳的天气如何");
        foreach (var word in segWordStrcts)
        {
            Debug.Log($"{word.word},{word.WordAttribute}");
        }
    }
}

2、ChineseSegmentationHanlpHelper.cs



using System.Collections.Generic;
using UnityEngine;

/// <summary>
/// 句子段落中文分词助手
/// </summary>
public class ChineseSegmentationHanlpHelper
{

    #region Data

    static AndroidJavaObject _mAndroidJavaObject;
    protected static AndroidJavaObject MAndroidJavaObject
    {
        get
        {
            if (_mAndroidJavaObject == null)
            {
                _mAndroidJavaObject = new AndroidJavaObject("com.ffalcon.chinesesegmentationhanlphelper.HanlpHelper");
            }

            return _mAndroidJavaObject;
        }
    }

    #endregion
    /// <summary>
    /// 中文句子分词
    /// </summary>
    /// <param name="wordsContent">分词的内容</param>
    /// <returns>分词的结果</returns>
    public static List<SegWordStrct> SegmentWords(string wordsContent)
    {
        wordsContent = wordsContent.Trim();
        if (string.IsNullOrEmpty(wordsContent)) return null;

        string[] dataArray = MAndroidJavaObject.Call<string>("segmentWork", wordsContent).Split(',');
        if (dataArray!=null&& dataArray.Length>0)
        {
            List < SegWordStrct > rltLst = new List<SegWordStrct>();
            int length = dataArray.Length;
            for (int i = 0; i < length; i += 2)
            {
                rltLst.Add(new SegWordStrct() { 
                    word= dataArray[i],
                    WordAttribute= GetPosInfo(dataArray[i+1])
                });
            }
            return rltLst;
        }

        return null;
    }

    /// <summary>
    /// 简单获取词的属性
    /// </summary>
    /// <param name="pos">属性标签</param>
    /// <returns></returns>
    private static string GetPosInfo(string pos)
    {
        // 这里你可以根据需要添加更多的判断逻辑来确定词性属性
        if (pos.Equals("n"))
        {
            return WordAttributeStrDefine.Noun;
        }
        else if (pos.Equals("v"))
        {
            return WordAttributeStrDefine.Verb;
        }
        else if (pos.Equals("ns"))
        {
            return WordAttributeStrDefine.PlaceName;
        }
        else if (pos.Equals("t"))
        {
            return WordAttributeStrDefine.Time;
        }
        else
        {
            return WordAttributeStrDefine.Other;
        }
    }
}

/// <summary>
/// 数据分词结构
/// </summary>
public struct SegWordStrct
{
    public string word;
    public string WordAttribute;
}

/// <summary>
/// 此属性文字定义
/// 较多，这里只定义了部分
/// </summary>
public class WordAttributeStrDefine {
    public const string Noun ="名词";
    public const string Verb ="动词";
    public const string PlaceName ="地名";
    public const string Time ="时间";
    public const string Other ="其他";
}

3、HanlpHelper.java

package com.xxxx.chinesesegmentationhanlphelper;

import com.hankcs.hanlp.seg.common.Term;
import com.hankcs.hanlp.tokenizer.StandardTokenizer;

import java.util.ArrayList;
import java.util.List;

/**
 * 分词处理助手
 */
public class HanlpHelper {

    /**
     * 开始分词
     * @param wordsContent
     * @return 返回分词结果，和此属性
     */
    public String segmentWork(String wordsContent){

        // 进行分词
        List<Term> terms = StandardTokenizer.segment(wordsContent);

        if(terms!=null && terms.size()>0){

            List<String> rltWordAttr = new ArrayList<>();

            // 遍历分词结果，判断词性并打印
            for (Term term : terms) {
                String word = term.word;
                String pos = term.nature.toString();

                String posInfo = getPosInfo(pos); // 判断词性属性

                System.out.println("Word: " + word + ", POS: " + pos + ", Attribute: " + posInfo);

                rltWordAttr.add(word);
                rltWordAttr.add(pos);
            }
            String[] dataArray = rltWordAttr.toArray(new String[0]);

            return String.join(",", dataArray);
        }

        return null;
    }

    /**
     * 判断词性属性
     * @param pos
     * @return
     */
    public String getPosInfo(String pos) {
        // 这里你可以根据需要添加更多的判断逻辑来确定词性属性
        if (pos.equals("n")) {
            return "名词";
        } else if (pos.equals("v")) {
            return "动词";
        } else if (pos.equals("ns")) {
            return "地名";
        }else if (pos.equals("t")) {
            return "时间";
        }
        else {
            return "其他";
        }
    }
}

Addendum: In HanLP, `Term`the object's `nature`field represents the part of speech

In HanLP, the nature field of the Term object represents the Part of Speech (POS). HanLP uses a standard Chinese part-of-speech tagging system, and each part of speech has a unique identifier. Here are some common Chinese part-of-speech tags and their meanings:

    Nouns:
        n: common nouns
        nr: personal
        names ns: place names
        nt: institution names
        nz: other proper names
        nl: noun idioms
        ng: noun morphemes

    Time class:
        t: time word

    Verb categories:
        v: verb
        vd: adverb
        vn: noun verb
        vshi: verb "is"
        vyou: verb "has"

    Adjective class:
        a: adjective
        ad: adverb

    Adverb class:
        d: adverb

    Pronoun class:
        r: pronoun
        rr: personal pronoun
        rz: demonstrative pronoun
        rzt: time demonstrative pronoun

    Conjunctions:
        c: Conjunctions

    Particle class:
        u: particle

    Numeral class:
        m: Numeral

    Quantifier class:
        q: quantifier

    Modal particle class:
        y: Modal particle

    Interjection class:
        e: interjection

    Onomatopoeic part of speech:
        o: onomatopoeia

    Position part of speech:
        f: position word

    State part of speech:
        z: state word

    Preposition class:
        p: preposition

    Prefix class:
        h:prefix

    Suffix class:
        k:suffix

    Punctuation class:
        w: punctuation

Please note that the above are just some common part-of-speech tags and their meanings, and the actual situation may be more complicated. You can investigate HanLP documentation for more details on part-of-speech tagging as needed. Based on these part-of-speech tags, you can write code to judge the attributes of words (such as verbs, nouns, place names, etc.) and perform corresponding processing.