Natural language processing (NLP) tasks

1. What is NLP

Natural language processing is a subfield of artificial intelligence. Natural language processing is a discipline that studies language issues in human-human interaction and in human-computer interaction. The main fields of artificial intelligence application include: data mining, recommendation algorithms, intelligent search, advertising recommendation, natural language processing, computer vision, autonomous driving, etc.

2. Main research directions of NLP

  1. Information extraction : Extract important information from a given text, such as time, place, task, event, cause, result, number, date, currency, proper nouns, etc. In layman's terms, it means to understand who did what, to whom, when, why, and with what consequences.
  2. Text generation : Machines express and write like humans using natural language processing. Depending on the input, text generation techniques include book-to-text generation and text-to-text generation. Data-to-text generation refers to the conversion of data containing key-value pairs into natural language text; text-to-text generation converts and processes input text to generate new text.
  3. Question-and-answer system : For a question expressed in natural language, the question-and-answer system will give an accurate answer. It is necessary to perform some degree of semantic analysis on natural language query statements, including entity connection, relationship identification, and formation of logical expressions, and then search for possible candidate answers in the knowledge base and find the best answer through a sorting mechanism.
  4. Dialogue system : The system chats, answers, and completes a certain task with the user through a series of dialogues. It involves user intention understanding, universal chat engine, dialogue management and other technologies. In addition, in order to reflect contextual relevance, it is necessary to have the ability to have multiple rounds of dialogue.
  5. Text mining : including text clustering, classification, sentiment analysis, and visual and interactive expression interfaces for mining information and knowledge. The current mainstream technologies are based on statistical machine learning.
  6. Speech recognition and generation : Speech recognition is the conversion of speech symbol recognition input into the computer into written language representation. Speech generation, also known as text-to-speech conversion and speech synthesis, is the automatic conversion of written text into corresponding speech representations.
  7. Information filtering : Automatically identify and filter document information that meets specific conditions through a computer system. Usually refers to the automatic identification and filtering of harmful information on the Internet, mainly used for information security and protection, network content management, etc.
  8. Public opinion analysis : refers to the collection and processing of massive amounts of information, and the automated analysis of online public opinion, in order to achieve the purpose of responding to online public opinion in a timely manner.
  9. Information retrieval : Indexing large-scale documents. You can simply assign different weights to the words in the document to create an index, or you can create a deeper index. When querying, the input query expression, such as a search term or a sentence, is analyzed, and then matching candidate documents are found in the index, and then the candidate documents are sorted according to a sorting mechanism, and finally the document with the highest ranking score is output.
  10. Machine translation : The input source language text is automatically translated to obtain text in another language. Machine translation has gradually formed a relatively rigorous method system, from the earliest rule-based method to the statistics-based method twenty years ago, to today's neural network (encoding-decoding) based method.

3. NLP algorithm engineers need real skills to master

  1. Regular expression. In addition to simple text matching scenarios, various document structuring and cold starts of information extraction basically rely on this. Because in actual business, it is impossible to obtain high-quality annotation data immediately, let alone various sota model training, so you may master various regular expressions, which will undoubtedly have a great effect. For example, there is no xx before xx. character and there is an xx character after it, or the xx character appears several times and the xx character must appear after it. (From Zhihu: In fact, I have always wanted to make a model that automatically provides regular expression recommendations based on human language. After researching for a while, I found that it is very difficult. From another perspective, it also shows that it is also very difficult to write regular expressions well)
  2. Commonly used syntax analysis tools. This is generally used in key phrase extraction and information extraction. Due to the lack of annotation data, the sota model cannot be used, and because regular expressions alone cannot cover various cases, it is sometimes necessary to combine them with syntactic analysis tools and use simple grammar, such as subject-predicate structure, verb-object structure, Complement structures, etc., to find out the information unit we need. However, training a syntax tool from scratch is usually too late to meet the business launch deadline, so some existing tools such as Harbin Institute of Technology's LTP are usually used. Although there are some problems, it is sufficient in most cases.
  3. Awareness of performance optimization. This does not specifically refer to a certain performance optimization method, but to have this awareness when writing code on a daily basis, such as how to optimize the efficiency of loops, parallelize operations where possible, and how to read data when the model Optimize the efficiency of reading. Whether the model meets the performance requirements of the business when it is deployed online in inference. If not, you can find the location for performance optimization. For example, whether the number of redundant layers is designed during model design, and whether overly complex layers are used. Model, whether some efficient deployment methods (such as tensorflow-serving) are used when deploying, whether TensorRT can be used to optimize some ops, and whether tensorflow can be recompiled according to the hardware environment on the deployment platform. These all need to be climbed out of the pit in actual projects, haha.
  4. linux. This skill is also indispensable, but it does not require you to be proficient in Linux operations, but you must at least know common operation commands, such as monitoring the status of the server, configuring firewall policies, simple operations of docker, and operations of commonly used databases (mysql, postgresql, etc.), compression and decompression commands, ftp/sftp commands, chmod commands, vi, cat, ps, etc. In addition to the above skills, by default everyone is qualified for technology in the NLP field, so there are no additional instructions. If you don’t know much about basic statistical machine learning, RNN, transformer, BERT, etc., it really doesn’t make sense. The above skills are aimed at algorithm engineering that requires implementation, but if you specialize in algorithm research and have no requirements for implementation, you can ignore the above things. At this time, all you need is to work on papers, competitions, and research groups. , Pin Tutor, Pin Senior and Senior Sister. . .

4. Basic tasks

The highly efficient vector representation method of words, words or sentences greatly reduces our dependence on artificial feature engineering. On this basis, natural language processing has a series of basic tasks.
If a piece of text is understood as a sequence and various tags are understood as different categories, then the basic NLP tasks can be divided into the following types according to the nature of the problem.

  • Generate sequences from categories: including tasks such as text generation and image description generation.
  • Generate categories from sequences: including tasks such as text classification, sentiment analysis, relationship extraction, etc.
  • From sequence to synchronous sequence generation: including tasks such as word segmentation, part-of-speech tagging, semantic role tagging, and entity recognition.
  • Asynchronously generate sequences from series: including tasks such as machine translation, automatic summarization, pinyin input, etc.

Guess you like

Origin blog.csdn.net/diaozhida/article/details/118611416