NLPIR technology helps Chinese intelligent data mining

  With the rapid development and wide application of data technology, many enterprises and departments have established their own data management systems. After years of hard work, more and more data have been accumulated. As a result, people are eager to obtain more information that is helpful for decision-making through the analysis of these huge data. Although the current data system can efficiently realize data entry, query, statistics and other functions, due to the huge amount of data and the serious lack of analysis methods in the database system, it cannot find the hidden interconnections in the data, let alone based on the data. Current and historical data to predict future trends. Therefore, the phenomenon of so-called "more data, less knowledge" has appeared, resulting in a serious waste of resources.
  The emergence of computer decision support systems based on data systems provides a good idea and method for high-level data decision analysis. However, due to the limitations of decision support systems in data collection and flexibility of analysis methods, people have to seek more effective ways to develop ideas for data decision analysis. Computer artificial intelligence has made a huge contribution to this. Artificial intelligence has gone through the stages of gaming, natural language understanding, knowledge engineering, etc., and has entered the hot stage of machine learning.
  The NLPIR text search and mining system integrates the technologies of natural language understanding, network search and text mining for the needs of Internet content processing. The development platform consists of multiple middleware, and each middleware API can be seamlessly integrated into various complex application systems of customers, adapting to many application scenarios.
  Twelve functions of the NLPIR text search and mining development platform:
  1. Full-text accurate retrieval: supports various data types such as text, numbers, dates, strings, etc., multi-field efficient search, and supports AND/OR/NOT and NEAR proximity, etc. The query grammar supports the retrieval of various minority languages ​​such as Uighur, Tibetan, Mongolian, Arabic, and Korean. It can be seamlessly integrated with existing text processing systems and database systems.
  2. New word discovery: The new word list with connotation is mined from the file collection, which can be used for the compilation of the user's professional dictionary; it can also be further edited and marked and imported into the word segmentation dictionary, thereby improving the accuracy of the word segmentation system and adapting to the new word segmentation system. language changes.
  3. Word segmentation and tagging: perform word segmentation on the original corpus, automatically identify unregistered words such as names of people, places and institutions, tag new words, and tag parts of speech. And can import user-defined dictionaries during analysis.
  4. Statistical analysis and term translation: For segmentation and labeling results, the system can automatically perform unigram word frequency statistics and binary word transition probability statistics (statistics about the frequency of the left and right connections of two words, i.e. probability). For commonly used terms, corresponding English explanations will be given automatically.
  5. Text clustering and hot spot analysis: It can automatically analyze hot events from large-scale data, and provide key feature descriptions of event topics. It is also suitable for hot spot analysis of long texts and short texts such as short messages and Weibo.
  6. Classification and filtering: For the pre-specified rules and sample samples, the system automatically filters out the samples that meet the needs from the massive documents.
  7. Positive and negative analysis: For the pre-specified analysis objects and sample samples, the system automatically filters out positive and negative scores and sentence samples from a large number of documents.
  8. Automatic summary: It can automatically extract the essence of the content for a single or multiple articles, which is convenient for users to quickly browse the text content.
  9. Keyword extraction: It can extract several words or phrases representing the central idea of ​​the article from a single article or a collection of articles, which can be used for refined reading, semantic query and quick matching.
  10. Document deduplication: It can quickly and accurately determine whether there are records with the same or similar content in the document collection or database, and find all duplicate records at the same time.
  11. HTML text extraction: automatically remove web pages of navigation nature, remove HTML tags, navigation, advertisements and other interfering words in web pages, and return valuable text content. It is suitable for preprocessing and analysis of large-scale Internet information.
  12. Automatic encoding identification and conversion: Automatically identify the encoding of the content, and convert the encoding into GBK encoding uniformly.
  Data mining is an interdisciplinary subject, which brings together different disciplines and fields such as database, artificial intelligence, statistics, visualization, parallel computing, etc., and has received extensive attention from all walks of life in recent years.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326355462&siteId=291194637