IKAnalyzer details

IK Analyzer is an open source, lightweight Chinese word segmentation toolkit developed based on java language. Since the release of version 1.0 in December 2006, IKAnalyzer has launched 4 major versions. Initially, it was based on the open source project Luence, a Chinese word segmentation component that combined dictionary segmentation and grammar analysis algorithms. Since version 3.0, IK has developed into a common word segmentation component for Java, which is independent of the Lucene project and provides a default optimized implementation of Lucene. In the 2012 version, IK implemented a simple word segmentation ambiguity elimination algorithm, marking the derivation of the IK tokenizer from simple dictionary segmentation to simulated semantic segmentation.

IK Analyzer 2012 Features:

  1. The unique "forward iterative fine-grained segmentation algorithm" is adopted, which supports two segmentation modes: fine-grained and intelligent word segmentation;

  2. In the system environment: Core2 i7 3.4G dual-core, 4G memory, window 7 64-bit, Sun JDK 1.6_29 64-bit ordinary PC environment test, IK2012 has a high-speed processing capacity of 1.6 million words/second (3000KB/S).

  3. The 2012 version of the intelligent word segmentation mode supports simple word segmentation and disambiguation processing and quantifier combined output.

  4. Adopt multi-subprocessor analysis mode, support: word segmentation processing of English letters, numbers, Chinese vocabulary, etc., compatible with Korean and Japanese characters

  5. Optimized dictionary storage, smaller memory footprint. User dictionary extension definitions are supported. In particular, in the 2012 version, the dictionary supports Chinese, English, and digital mixed words.

IKAnalyzer also has an unofficial .NET version - IKAnalyzer.NET

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326490546&siteId=291194637