Natural language processing (NLP) learning and the first acquaintance with HanLP

Table of contents

Preface

 1. Basic knowledge of natural language processing

1. NLP category

2. Core tasks

2. Brief introduction to Hanlp

3. Hanlp cloud service capabilities

1. New cloud native 2.x

 2. Python api call

 3. Go api call

4. Java api call

 4. Hanlp native service

 1. Local development

 Summarize


Preface

        Under the rolling wave of ChatGPT, along with the development of artificial intelligence technology, many applications in life use "AI (artificial intelligence)" technology. For example, computers can be used to help translate foreign documents, machines can draw pictures by themselves, and automatically generate Video footage. But sometimes artificial intelligence can have some small bugs and become less intelligent. This is commonly known as "artificial retardation". Content that is easy for humans to understand is very difficult to express or understand through computers, especially in language processing.

        So how can we make artificial intelligence more "intelligent"? Natural language processing technology is an important way. Natural language processing technology (natural language processing, referred to as NPL) is an important branch of artificial intelligence. Its purpose is to use computers to intelligently process natural language. Basic natural language processing technology mainly focuses on different levels of language, including phoneme (the pronunciation pattern of the language), morphology (how words and letters form words, and the morphological changes of words), vocabulary (the relationship between words), syntax ( There are 7 levels: how words form sentences), semantics (the corresponding meaning of language expressions), pragmatics (semantic interpretation in different contexts), and discourse (how sentences are combined into paragraphs). These basic natural language processing technologies are often used in various downstream natural language processing tasks, such as machine translation, dialogue, question and answer, document summarization, etc.

        The purpose of scientists studying natural language processing technology (NLP) is to enable machines to understand human language, communicate with humans in natural language, and ultimately possess "intelligence." In the AI ​​​​era, we hope that computers will have the capabilities of vision, hearing, language and action. Language is one of the most important features that distinguish humans from animals. Language is the carrier of human thinking and the carrier of knowledge condensation and inheritance. In the field of artificial intelligence, the purpose of studying natural language processing technology is to allow machines to understand and generate human language, so as to communicate equally and smoothly with humans.

        This article will briefly introduce the relevant knowledge of natural language processing, focus on the functions of the Hanlp component, and introduce the two modes of cloud and local deployment, which will be helpful to friends in need.

 1. Basic knowledge of natural language processing

1. NLP category

        1. Text mining: It mainly includes text classification, clustering, information extraction, summary, sentiment analysis, and visualization and interactive presentation interface of mined information and knowledge. These are collectively called text mining tasks.

        2. Information retrieval: index large-scale documents. You can simply assign different weights to the words in the document to create an index, or you can use algorithms to create a deeper index. When querying, first analyze the input, then search for matching candidate documents in the index, then sort the candidate documents according to a sorting mechanism, and finally output the document with the highest ranking score.

        3. Syntactic and semantic analysis: Perform various syntactic analyzes on the target sentence, such as word segmentation, part-of-speech tagging, named entity recognition and syntactic analysis, semantic role recognition and polysemy disambiguation, etc.

        4. Machine translation: With the rapid development of communication technology and Internet technology, the rapid increase of information, and the closer international connections, the challenge of allowing everyone in the world to obtain information across language barriers has exceeded the capabilities of human translation. Due to its high efficiency and low cost, machine translation meets the needs of rapid translation of multi-lingual information in various countries around the world, from the earliest rule-based methods to the statistics-based methods twenty years ago, to today's deep learning (coding and decoding)-based methods. , gradually formed a relatively rigorous method system. Machine translation is a branch of natural language information processing that can automatically generate another natural language based on one natural language. At present, translation platforms launched by artificial intelligence industry giants such as Google Translate, Baidu Translate, and Sogou Translate have gradually taken a dominant position in the translation industry by virtue of the efficiency and accuracy of their translation processes.

        5. Question and answer system: With the rapid development of the Internet, the amount of network information continues to increase, and people need to obtain more accurate information. Traditional search engine technology can no longer meet people's increasing needs, and automatic question and answer technology has become an effective means to solve this problem. Automatic question and answer refers to the task of using computers to automatically answer questions raised by users to meet user knowledge needs. When answering user questions, we must first correctly understand the questions raised by users, extract key information, and add them to the existing corpus or knowledge Search and match in the database, and feedback the obtained answers to the user.

        6. Dialogue system: The system chats, answers, and completes certain tasks with users through multi-turn dialogues. It mainly involves user intention recognition, general chat engine, question and answer engine, dialogue management system and other technologies. In addition, in order to realize contextual relevance, it is necessary to have the ability to have multiple rounds of dialogue. At the same time, in order to realize personalization, the dialogue system also needs to make personalized replies based on user portraits.

2. Core tasks

        Generally speaking, natural language processing has two core tasks, natural language understanding (NLU) and natural language generation (NLG). For humans, understanding language is a natural thing, but it is very difficult for machines. The robustness of language is the main difficulty in natural language understanding, including: language diversity, ambiguity, knowledge dependence, contextual relationships, etc. These difficulties will also bring about a series of difficulties in actual processing: whether the grammatical structure and semantic expression of the generated statement are accurate, whether the information is repeated, etc.

        In order to solve the above problems, some basic natural language processing directions have emerged, including: word segmentation, part-of-speech tagging, lemmatization, dependency analysis, named entity recognition, sequence annotation, sentence relationship recognition, etc.

2. Brief introduction to Hanlp

        Hanlp is a cutting-edge multilingual natural language processing technology for production environments. According to different scenarios and project needs, HanLP provides two APIs, RESTful and native, for lightweight and massive scenarios respectively. Regardless of the API or language, the HanLP interface remains semantically consistent and remains open source in code.

        A multilingual natural language processing toolkit for production environments, based on PyTorch and TensorFlow 2.x dual engines, with the goal of popularizing and implementing the most cutting-edge NLP technology. HanLP has the characteristics of complete functions, accurate accuracy, efficient performance, up-to-date corpus, clear structure, and customizability.

        Hanlp supports the following functions, see the figure below for details:

3. Hanlp cloud service capabilities

        Hanlp is divided into two major directions: online calling and local programming. For online applications, when access to the Internet is allowed, you can directly call Hanlp's cloud service in a RESTful manner, so that loading, training and learning of data and models are all convenient. Great way to handle it. The offline environment can enable native local api construction (taking Java language as an example).

1. New cloud native 2.x

Introduction to cloud environment, hanlp official website :

Cloud environment web page address: HanLP github address .

 2. Python api call

        Only a few KB, suitable for agile development, mobile APP and other scenarios. Simple and easy to use, no GPU configuration required, quick installation. It has more corpus, larger model, and higher accuracy. It is highly recommended . The server GPU computing power is limited and the anonymous user quota is small. It is recommended to apply for a free public welfare API key auth .

python code:

pip install hanlp_restful

 Create a client and fill in the server address and secret key:

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh') # auth不填则匿名,zh中文,mul多语种

 3. Go api call

Install go

go get -u github.com/hankcs/gohanlp@main

 Create a client and fill in the server address and secret key

HanLP := hanlp.HanLPClient(hanlp.WithAuth(""),hanlp.WithLanguage("zh")) // auth不填则匿名,zh中文,mul多语种

4. Java api call

Add dependencies in pom.xml:

<dependency>
    <groupId>com.hankcs.hanlp.restful</groupId>
    <artifactId>hanlp-restful</artifactId>
    <version>0.0.12</version>
</dependency>

 Create a client and fill in the server address and secret key:

HanLPClient HanLP = new HanLPClient("https://www.hanlp.com/api", null, "zh"); // auth不填则匿名,zh中文,mul多语种

 4. Hanlp native service

        In addition to providing rich cloud capabilities, hanlp also supports local calls. In the official open source version 1.x, the local development capabilities of Java development are open sourced. You need to pay attention to switching different tags on github. You need to switch to version 1.X to see the project source code. This is a standard java project.

        After downloading the package, you can see the complete project source code.

 1. Local development

        Download the hanlp code locally to debug the code locally. If you are only making code calls, you can directly call the packaged dependencies. Let's take Eclipse as an example to create a Hanlp sample function and debug the code.

[签约/v, 仪式/n, 前/f, ,/w, 秦光荣/nr, 、/w, 李纪恒/nr, 、/w, 仇和/nr, 等/u, 一同/d, 会见/v, 了/ul, 参加/v, 签约/v, 的/uj, 企业家/n, 。/w]
[武大靖/nr, 创/vg, 世界纪录/nz, 夺冠/v, ,/w, 中国代表团/nt, 平昌/ns, 首金/n]
[区长/n, 庄木弟/nr, 新年/t, 致辞/v]
[朱立伦/nr, :/w, 两岸/n, 都/d, 希望/v, 共/d, 创/vg, 双/m, 赢/v,  /w, 习/ng, 朱/nr, 历史/n, 会晤/v, 在即/v]
[陕西/ns, 首富/n, 吴一坚/nr, 被/p, 带走/v,  /w, 与/p, 令计划/nr, 妻子/n, 有/v, 交集/n]
[据/p, 美国之音/n, 电台/n, 网站/n, 4/m, 月/q, 28/m, 日/j, 报道/v, ,/w, 8/m, 岁/q, 的/uj, 凯瑟琳/nrf, ·/w, 克罗尔/nrf, (/w, 凤甫娟/nr, )/w, 和/c, 很多/m, 华裔/n, 美国/ns, 小朋友/n, 一样/u, ,/w, 小小年纪/n, 就/d, 开始/v, 学/v, 小提琴/n, 了/ul, 。/w, 她/r, 的/uj, 妈妈/n, 是/v, 位/q, 虎/n, 妈/n, 么/y, ?/w]
[凯瑟琳/nrf, 和/c, 露西/nrf, (/w, 庐瑞媛/nr, )/w, ,/w, 跟/p, 她们/r, 的/uj, 哥哥/n, 们/k, 有/v, 一些/m, 不同/a, 。/w]
[王国强/nr, 、/w, 高峰/n, 、/w, 汪洋/n, 、/w, 张朝阳/nr, 光着头/l, 、/w, 韩寒/nr, 、/w, 小四/nr]
[张浩和/nr, 胡健康/nr, 复员/vn, 回家/v, 了/ul]
[王总/nr, 和/c, 小丽/nr, 结婚/v, 了/ul]
[编剧/n, 邵钧林/nr, 和/c, 稽道青/nr, 说/v]
[这里/r, 有/v, 关天培/nr, 的/uj, 有关/vn, 事迹/n]
[龚学平/nr, 等/u, 领导/n, 说/v, ,/w, 邓颖超/nr, 生前/t, 杜绝/v, 超生/vn, ,/w, 2023/m, 年/q, 在/p, 湖南省/ns, 长沙市/ns, 天心区/ns, 暮云镇/ns, 开启/v, 的/uj, 互联网/n, 大会/n, ,/w, 首次/mq, 提出/v]
+++++++++++++++++++++++++++++++++++++++++++
[签约/v, 仪式/n, 前/f, ,/w, 秦光荣/nr, 、/w, 李纪恒/nr, 、/w, 仇和/nr, 等/u, 一同/d, 会见/v, 了/ul, 参加/v, 签约/v, 的/uj, 企业家/n, 。/w]
[武大靖/nr, 创/vg, 世界纪录/nz, 夺冠/v, ,/w, 中国代表团/nt, 平昌/ns, 首金/n]
[区长/n, 庄木弟/nr, 新年/t, 致辞/v]
[朱立伦/nr, :/w, 两岸/n, 都/d, 希望/v, 共/d, 创/vg, 双/m, 赢/v,  /w, 习/ng, 朱/nr, 历史/n, 会晤/v, 在即/v]
[陕西/ns, 首富/n, 吴一坚/nr, 被/p, 带走/v,  /w, 与/p, 令计划/nr, 妻子/n, 有/v, 交集/n]
[据/p, 美国之音/n, 电台网站/nt, 4/m, 月/q, 28/m, 日/j, 报道/v, ,/w, 8/m, 岁/q, 的/uj, 凯瑟琳/nrf, ·/w, 克罗尔/nrf, (/w, 凤甫娟/nr, )/w, 和/c, 很多/m, 华裔/n, 美国/ns, 小朋友一/nrj, 样/q, ,/w, 小小年纪/n, 就/d, 开始/v, 学/v, 小提琴/n, 了/ul, 。/w, 她/r, 的/uj, 妈妈/n, 是/v, 位/q, 虎/n, 妈/n, 么/y, ?/w]
[凯瑟琳/nrf, 和/c, 露西/nrf, (/w, 庐瑞媛/nr, )/w, ,/w, 跟/p, 她们/r, 的/uj, 哥哥/n, 们/k, 有/v, 一些/m, 不同/a, 。/w]
[王国强/nr, 、/w, 高峰/n, 、/w, 汪洋/n, 、/w, 张朝阳/nr, 光着头/l, 、/w, 韩寒/nr, 、/w, 小四/nr]
[张浩和/nr, 胡健康/nr, 复员/vn, 回家/v, 了/ul]
[王总/nr, 和/c, 小丽/nr, 结婚/v, 了/ul]
[编剧/n, 邵钧林/nr, 和/c, 稽道青/nr, 说/v]
[这里/r, 有/v, 关天培/nr, 的/uj, 有关/vn, 事迹/n]
[龚学平/nr, 等/u, 领导/n, 说/v, ,/w, 邓颖超/nr, 生前/t, 杜绝/v, 超生/vn, ,/w, 2023/m, 年/q, 在/p, 湖南省/ns, 长沙市/ns, 天心区/ns, 暮云镇/ns, 开启/v, 的/uj, 互联网大会/nt, ,/w, 首次/mq, 提出/v]
over......

 Summarize

         The above is the main content of the article. This article will briefly introduce the relevant knowledge of natural language processing, focus on the functions of the Hanlp component, and introduce the two modes of cloud and local deployment respectively, which will be helpful to friends in need. The writing is hurried. If there is anything inappropriate, please leave a message to criticize and correct me.

Reference materials: 1. Artificial intelligence and natural language processing technology .

                  2. Artificial Intelligence: Natural Language Processing .

                  3. Hanlp official website .

                  4. Introduction to NLP (Natural Language Processing) .

Guess you like

Origin blog.csdn.net/yelangkingwuzuhu/article/details/133353684