IKAnalyzer Chinese word segmentation 1

The version used is IKAnalyzer2012FF_u1, which is compatible with Lucene 4.x

public static void test1() throws IOException {
		String keyWord = "What is the effect of IKAnalyzer's word segmentation? Let's take a look. If you can't or don't want to link pages for your website"
				+ "If the content is guaranteed (for example, untrustworthy user comments or message board entries), you should use nofollow for these links. This will prevent spam"
				+ "Content publishers target your site and help prevent your site from inadvertently transmitting PageRank to "neighbors" on the web. Especially if spam"
				+ "Comment publishers who find that untrusted links are not being tracked in the service may decide not to target the corresponding content management system or blogging service. If you wish"
				+ "Recognize and reward reliable content providers, then you can decide to automatically remove links posted by members or users who consistently provide high-quality content"
				+ "Or manually remove the nofollow attribute from it. Paid Links: A site's ranking in Google search results depends in part on the importance of other sites linking to it"
				+ "Analysis. To prevent paid links from affecting search results and adversely affecting users, we recommend that webmasters use nofollow for such links."
				+ "Search engine guidelines require paid links to be published in a machine-readable form (e.g., full-page"
				+ "Newspaper ads may use the headline "Advertising")";
		// Create IKAnalyzer Chinese word segmentation object
		// IKAnalyzer analyzer = new IKAnalyzer();
		// use smart word segmentation
		// analyzer.setUseSmart(true);
		// print word segmentation result
		IKSegmenter ikseg = new IKSegmenter(new StringReader(keyWord), true);
		Lexeme lex = null;
		while (null != (lex = ikseg.next())) {
			System.out.print(lex.getLexemeText() + "|");

		}
		System.out.println();
	}

	public static void main(String[] args) throws IOException {
		test1();
	}

 

IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer Extended Configuration</comment>
	<!--Users can configure their own extension dictionary here-->
	<entry key="ext_dict">ext.dic;</entry>
	<!--Users can configure their own extended stop word dictionary here-->
	<entry key="ext_stopwords">stopword.dic;</entry>
</properties>

 

ext.dic (UTF8 encoding without BOM format)

content provider
high quality content
Google search
Content Management System

 stopword.dic

not used yet

 

The problem encountered, when modifying ext.dic, seems to be the original result every time it runs, be sure to put the

Change the ext.dic in the IKAnalyzer.cfg.xml configuration file to /ext.dic and run it again, and then change it back to ext.dic, which is a very strange problem. . .

 

 

Results without ext.dic

ikanalyzer|Participle|Effect|In the end|How|What|We|Look at|Let's|If you|Can't|Or|Don't|Want to do it for|self|Website|All|Link|Webpage|Content|Provide|Guarantee| For example | not | trustworthy | users | comments | or | message boards | entries | then | should | right | these | links | use | nofollow | and |helps|prevents|you|sites|inadvertently|inadvertently transmits|pagerank|to|on the|evil|neighbor|neighbor|especially if|spam|spam|comments|publishers|discovers|untrusted|trusted |Link|LinkedIn|Services|In|Not|Being|Tracked|May|Decided|Indeterminate|Location|Corresponding|Content Management|System|Or |Blog|Services|If You|Want|Recognize|And |Rewards|Reliable |content|provider|then|for |consistently|providing|providing|high quality|content|members|or |users|published|posting|links|you|may|decided|automatically|delete|or |manually|delete| where |nofollow|attributes|paid|links|sites|stands|google|search|results|in|ranking|depends|depends|rights|links|links to|this|sites|other|sites|analytics|for|prevents |paid|links|affects|search|results|and |to|users|causing|adverse|affecting|us|suggestions|sites|admins|rights|such|links|uses|nofollow|search engines|guidelines|requirements| Follow|Online|And|Offline|Customer|Wish|Wish|Paid|Relationship|Announce|Way|In |Machine|Can|Read|Way|Announce|Paid|Link|Example|Full Page|Newspaper Advertisement|May|Adopt |Advertising|Header|

 

Results of using ext.dic

ikanalyzer|Participle|Effect|In the end|How|What|We|Look at|Let's|If you|Can't|Or|Don't|Want to do it for|self|Website|All|Link|Webpage|Content|Provide|Guarantee| For example | not | trustworthy | users | comments | or | message boards | entries | then | should | right | these | links | use | nofollow | and |helps|prevents|you|sites|inadvertently|inadvertently transmits|pagerank|to|on the|evil|neighbor|neighbor|especially if|spam|spam|comments|publishers|discovers|untrusted|trusted |Link|LinkedIn|Services|In|Not|Being|Tracked|Possible|Decided|Indeterminate|Located|Corresponding| Content Providers | Then | For | Consistently | Provide | High Quality Content | Members | or | Users | By | Posts | Links | You | May | Decide | Automatic | Delete | |attributes|paid|links|sites|stands|google search|results|ranks|parts|depends|on |links|links|links|this|sites|other|sites|analytics|for|prevent|paid|links |Affects|Search|Results|And|Affects|Affects|Affects|We|Recommends|Sites|Administrators|Suchs|Links|Uses|nofollow|Search Engines|Guidelines|Requirements|Follow|Online| and|offline|client|desire|want|paid|relationship|published|method|with|machine|may|read|mandable|published|paid|link|example|full page|newspaper ad|may|adopted|advertising|headline |

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326942632&siteId=291194637