Intelligent word segmentation of NLPIR text is the key to semantic mining

  Lexical analysis is the foundation and key of natural language processing. In Chinese natural language processing, words are the smallest meaningful language components that can act independently. Chinese uses words as the basic writing unit, and there is no obvious distinguishing mark between words. Therefore, Chinese natural language processing usually first divides the strings in Chinese text into reasonable word sequences, and then performs other analysis on this basis. deal with. Chinese word segmentation is a basic link of Chinese information processing, and has been widely used in Chinese text processing, information extraction, text mining and other applications.
  A natural language processing system must take into account a lot of knowledge about the language itself and its structure—what words are, how words make up sentences, what words mean, how word meanings contribute to sentence meaning, etc. not enough. For example, if a system wants to answer questions or directly participate in dialogue, it not only needs to know a lot of knowledge of language structures, but also needs to know the general knowledge of the human world and have human reasoning ability. Therefore, many linguists usually divide the analysis and understanding of language into the following main levels: lexical analysis, syntactic analysis, semantic analysis, and textual analysis.
  First, lexical analysis—mainly including word segmentation, part-of-speech tagging, word sense disambiguation, new word recognition, etc.—is to obtain relevant language information by means of word segmentation, word frequency and location statistics.
  Secondly, syntactic analysis analyzes sentence structure features by characterizing sentence components, finds out the interrelationship of words, phrases, etc. and their respective roles in the sentence through the analysis of sentence and phrase structure, and expresses relationships such as subordination with a certain structure , component relationship, etc., the purpose is to determine various structural components in the sentence.
  Third, in order to understand a question, more semantic and pragmatic knowledge is generally needed to help understand the meaning of the sentence. Through analysis, we can find out the meaning of words, structural meanings and their combined meanings, so as to determine the real meaning expressed by the sentence. The marking of information needs to include the support of the complete set of concepts and the relationship diagram, and it is necessary to make a detailed semantic classification of the syntactic components. It should generally include the language level (that is, the knowledge that reflects the surface phenomenon of the language, such as synonym relations, hierarchical relations, etc.), ontology Theoretical level (describes complex semantic relationships between concepts), commonsense level, etc. Although the work is voluminous, some preliminary results have been achieved so far.
  Finally, discourse analysis is used to analyze the structural or semantic relationship between multiple sentences and paragraphs.
  The NLPIR word segmentation system is accumulated after years of research work, and its main functions include Chinese word segmentation; English word segmentation; part of speech tagging; named entity recognition; new word recognition; keyword extraction; support for user professional dictionaries and microblog analysis. The NLPIR system supports multiple encodings (GBK encoding, UTF8 encoding, BIG5 encoding), multiple operating systems, and multiple development languages ​​and platforms.
  Main functions of NLPIR/ICTCLAS2018 word segmentation system
  1) Chinese-English mixed word segmentation function
  Automatically perform word segmentation and part-of-speech tagging for Chinese and English information, covering Chinese word segmentation, English word segmentation, part-of-speech tagging, unregistered word recognition and user dictionary functions.
  2) Keyword extraction
  function The algorithm of cross information entropy is used to automatically calculate keywords, including new words and known words,
  3) New word recognition and adaptive word segmentation function
  From longer text content, based on information cross entropy, it automatically finds new words. Feature language, and adaptively test the language probability distribution model of the corpus to achieve adaptive word segmentation.
  4) User professional dictionary function
  You can import user dictionaries individually or in batches. For example, a "sensitive point of a report letter" can be defined, in which the report letter is a user word, and the sensitive point is a user-defined part-of-speech tag.
  The ICTCLAS word segmentation method uses dictionary matching to perform initial segmentation to obtain a segmentation word graph, and then uses the word frequency information to find N shortest paths in the word graph N-shortest path method. Other researchers use dictionaries to find all cross ambiguities, and then use the Bigram language model or its variants to disambiguate.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326262096&siteId=291194637