Hanlp usage Chinese intelligent word segmentation automatic recognition text extraction

Requirement: The customer gives the salesman his personal information, and the sales help the customer place an order. This process requires the salesperson to manually copy and paste the harvesting address, phone number, name, etc., an intelligent word segmentation system allows the salesperson to identify the above information with one click

After research, I found an open source project

1, word tokenizer
2, ansj tokenizer
3, mmseg4j tokenizer
4, ik-analyzer tokenizer
5, jcseg tokenizer
6, fudannlp tokenizer
7, smartcn tokenizer
8, jieba tokenizer
9, stanford tokenizer
10, hanlp tokenizer

Finally, I chose hanlp, the steps are available on the official website, the following demonstrates the intelligent matching address

List<Term> list = HanLP.newSegment().seg("汤姆江西省南昌市红谷滩新区111号电话12023232323");
System.out.println(list);

output

[汤姆/nrf, 江西省/ns, 南昌市/ns, 红谷滩/nz, 新区/n, 111/m, 号/q, 电话/n, 12023232323/m]

It can be seen that it has been initially identified, but what we need is the complete address. It is easy to guess the type after the string by outputting ns, nz, etc. At this time, the first thing that comes to mind is to judge the type, and the type that matches the address is spliced ​​with characters. string, but someone in the omnipotent open source community must have written it long ago. Check the official website and find that such a field
NLP分词 NLPTokenizer 会执行全部命名实体识别和词性标注。
seems to be what I want.

 terms = NLPTokenizer.segment("汤姆江西省南昌市红谷滩新区111号电话12023232323");">
System.out.println(terms);

result

[汤姆/nr, 江西省南昌市红谷滩新区/nt, 111/m, 号/q, 电话/n, 12023232323/m]

The big announcement is successful, but the premise is that the data package of more than 600 M must be downloaded and imported before the address can be identified, otherwise it is just a preliminary identification

Attach complete code

        String str = "汤姆   江西省南昌市红谷滩新区111号     12023232323";
        String address = "";
        String phone = "";
        String name = "";
        List<Term> terms = NLPTokenizer.segment(str);
        System.out.println(terms);
        for (Term term : terms) {
            if (term.nature.startsWith("nr")){
                //nr代表人名
                name = term.word;
                System.out.println("name: " + term.word);
            }else if (term.nature.startsWith("m") && term.word.length() == 11){
                //m代表数字
                phone = term.word;
                System.out.println("电话: " + term.word);
            }
        }

        //由于地址包含了数字,解析的时候数字成为单独的个体,与实际不符,所以通过差集求出地址
        address = str.replace(phone, "").replace(name, "").trim();
        System.out.println("address: " + address);

operation result

name: 汤姆 
电话: 12023232323
address: 江西省南昌市红谷滩新区111

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324729201&siteId=291194637