Study Notes CB011: lucene search engine library, IKAnalyzer Chinese word segmentation tool, retrieval service, query index, diversion, word2vec

The characteristics of the subtitle chat corpus of film and television dramas lists more than 30 million Chinese words in the words of the film and television dramas with carriage return and line feed. The second adjacent sentence is probably the best answer to the first sentence. There are many kinds of answers to a question, which can be sorted according to the degree of relevancy and all the answers in the historical chat records, and finding the best one is a search sorting process.

lucene+ik. Lucene open source free search engine library, developed in java language. ik IKAnalyzer, an open source Chinese word segmentation tool. The corpus is segmented and indexed, the text is searched for text relevance retrieval, the next sentence is taken out as the answer candidate set, the answer is sorted, and the question is analyzed.

Build an index. Eclipse creates a maven project, maven automatically generates a pom.xml file, configures package dependency information, and adds dependencies to the dependencies tag:

<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
< version>4.10.4</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>4.10.4</version>
</ dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>4.10.



<artifactId>netty-all</artifactId>
<version>5.0.0.Alpha2</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.1.41</version>
</dependency>

The project tag adds configuration, and depends on the jar package to automatically copy the lib directory:

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>prepare-package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>theMainClass</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>

https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/ik-analyzer/IK%20Analyzer%202012FF_hf1_source.rar Download the ik source code and copy the src/org directory to the chatbotv1 project Under src/main/java, refresh the maven project.

Under the com.shareditor.chatbotv1 package, maven automatically generates App.java and changes it to Indexer.java:

Analyzer analyzer = new IKAnalyzer(true);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setUseCompoundFile(true);
IndexWriter indexWriter = new IndexWriter(FSDirectory.open(new File(indexPath)), iwc);

BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(corpusPath), "UTF-8"));
String line = "";
String last = "";
long lineNum = 0;
while ((line = br.readLine()) != null) {
line = line.trim();

if (0 == line.length()) {
continue;
}

if (!last.equals("")) {
Document doc = new Document();
doc.add(new TextField("question", last, Store.YES));
doc.add(new StoredField("answer", line));
indexWriter.addDocument(doc);
}
last = line;
lineNum++;
if (lineNum % 100000 == 0) {
System.out.println("add doc " + lineNum);
}
}
br.close();

indexWriter.forceMerge(1);
indexWriter.close();


Compile and copy all files in src/main/resources to the target directory, and execute the target directory

java -cp $CLASSPATH:./lib/:./chatbotv1-0.0.1-SNAPSHOT.jar com.shareditor.chatbotv1.Indexer ../../subtitle/raw_subtitles/subtitle.corpus ./index


The generated index directory index is viewed through lukeall-4.9.0.jar.

Retrieval service. Netty creates an http service server, and the code is in the chatbotv1 directory of https://github.com/warmheartli/ChatBotCourse:

Analyzer analyzer = new IKAnalyzer(true);
QueryParser qp = new QueryParser(Version.LUCENE_4_9, "question", analyzer);
if (topDocs.totalHits == 0) {
qp.setDefaultOperator(Operator.AND);
query = qp.parse(q);
System.out.println(query.toString());
indexSearcher.search(query, collector);
topDocs = collector.topDocs();
}

if (topDocs.totalHits == 0) {
qp.setDefaultOperator(Operator.OR);
query = qp.parse(q);
System.out.println(query.toString());
indexSearcher.search(query, collector);
topDocs = collector.topDocs();
}

ret.put("total", topDocs.totalHits);
ret.put("q", q);
JSONArray result = new JSONArray();
for (ScoreDoc d : topDocs.scoreDocs) {
Document doc = indexSearcher.doc(d.doc);
String question = doc.get("question");
String answer = doc.get("answer");
JSONObject item = new JSONObject();
item.put("question", question);
item.put("answer", answer);
item.put("score", d.score);
item.put("doc", d.doc);
result.add(item);
}
ret.put("result", result);


Query the index, use the query word to spell lucene query, retrieve the question field of the index, match and return the answer field value as a candidate set, and pick out a candidate set as an answer. The server is accessed through http, such as http://127.0.0.1:8765/?q=hello. Chinese needs to be sent by urlcode, and the java side reads and parses according to urlcode. The server startup method:


java -cp $CLASSPATH:./lib/:./chatbotv1-0.0.1-SNAPSHOT.jar com.shareditor.chatbotv1.Searcher


chat interface. A box to display chat content, select ckeditor, support html format content display, an input box and send button, html code:

<div class="col-sm-4 col-xs-10">
<div class="row">
<textarea id="chatarea">
<div style='color: blue; text-align: left; padding: 5px;'>Robot: Hey, hello brother, you are finally willing to chat with me, come and talk, I will not refuse!</div>
<div style='color: blue; text-align: left; padding : 5px;'>Robot: What? You ask me how smart I can chat? Because I just ate a bunch of movie and TV subtitles!</div>
</textarea>
</div>
<br />

<div class=" row">
<div class="input-group">
<input type="text" id="input" class="form-control" autofocus="autofocus" onkeydown="submitByEnter()" />
<span class= "input-group-btn">
<button class="btn btn-default" type="button" onclick="submit()">发送</button>
</span>
</div>
</div>
</div>

<script type="text/javascript">

CKEDITOR.replace('chatarea',
{
readOnly: true,
toolbar: ['Source'],
height: 500,
removePlugins: 'elementspath',
resize_enabled: false,
allowedContent: true
});

</script>

Call the chat server, and want a send request to get the result controller:

public function queryAction(Request $request)
{
$q = $request->get('input');
$opts = array(
'http'=>array(
'method'=>"GET",
'timeout'=>60,
)
);
$context = stream_context_create($opts);
$clientIp = $request->getClientIp();
$response = file_get_contents('http://127.0.0.1:8765/?q=' . urlencode($q) . '&clientIp=' . $clientIp, false, $context);
$res = json_decode($response, true);
$total = $res['total'];
$result = '';
if ($total > 0) {
$result = $res['result'][0]['answer'];
}
return new Response($result);
}


Controller routing configuration:

chatbot_query:
path: /chatbot/query
defaults: { _controller: AppBundle:ChatBot:query }


The response time of the chat server is relatively long, which does not cause the web interface to be stuck. When executing submit, the request is sent and the result is received asynchronously:

var xmlHttp;
function submit() {
if (window.ActiveXObject) {
xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
}
else if (window.XMLHttpRequest) {
xmlHttp = new XMLHttpRequest();
}
var input = $("#input").val().trim();
if (input == '') {
jQuery('#input').val('');
return;
}
addText(input, false);
jQuery('#input').val('');
var datastr = "input=" + input;
datastr = encodeURI(datastr);
var url = "/chatbot/query";
xmlHttp.open("POST", url, true);
xmlHttp.onreadystatechange = callback;
xmlHttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
xmlHttp.send(datastr);
}

function callback() {
if (xmlHttp.readyState == 4 && xmlHttp.status == 200) {
var responseText = xmlHttp.responseText;
addText(responseText, true);
}
}


addText adds a piece of text to ckeditor:

function addText(text, is_response) {
var oldText = CKEDITOR.instances.chatarea.getData();
var prefix = '';
if (is_response) {
prefix = "<div style='color: blue; text-align: left; padding: 5px;'>机器人: "
} else {
prefix = "<div style='color: darkgreen; text-align: right; padding: 5px;'>我: "
}
CKEDITOR.instances.chatarea.setData(oldText + "" + prefix + text + "</div>");
}


Code:
https://github.com/warmheartli/ChatBotCourse
https://github.com/warmheartli/shareditor.com

Effect demo: http://www.shareditor.com/chatbot/

diversion. Statistics of website traffic. The cnzz statistics look at the traffic of the visited pages in the last half month, and users visit the centralized pages. Add gallery dynamic button. To attract users to click, place a small dynamic icon in the lower right corner of each page, it will not move when the page is scrolled, and the user clicks and jumps directly to the page they want to attract traffic. Search customer service floating code.
Create a js file, lrtk.js:

$(function()
{
var tophtml="<a href=\"http://www.shareditor.com/chatbot/\" target=\"_blank\"><div id=\"izl_rmenu\" class=\"izl-rmenu\"><div class=\"btn btn-phone\"></div><div class=\"btn btn-top\"></div></div></a>";
$("#top").html(tophtml);
$("#izl_rmenu").each(function()
{
$(this).find(".btn-phone").mouseenter(function()
{
$(this).find(".phone").fadeIn("fast");
});
$(this).find(".btn-phone").mouseleave(function()
{
$(this).find(".phone").fadeOut("fast");
});
$(this).find(".btn-top").click(function()
{
$("html, body").animate({
"scroll-top":0
},"fast");
});
});
var lastRmenuStatus=false;

$(window).scroll(function()
{
var _top=$(window).scrollTop();
if(_top>=0)
{
$("#izl_rmenu").data("expanded",true);
}
else
{
$("#izl_rmenu").data("expanded",false);
}
if($("#izl_rmenu").data("expanded")!=lastRmenuStatus)
{
lastRmenuStatus=$("#izl_rmenu").data("expanded");
if(lastRmenuStatus)
{
$("#izl_rmenu .btn-top").slideDown();
}
else
{
$("#izl_rmenu .btn-top").slideUp();
}
}
});
});


The upper part defines the content of the div tag with id=top. A div with an id of izl_rmenu, the css format is defined in another file lrtk.css:

.izl-rmenu{position:fixed;left:85%;bottom:10px;padding-bottom:73px;z-index:999;}
.izl-rmenu .btn{width:72px;height:73px;margin-bottom:1px;cursor:pointer;position:relative;}
.izl-rmenu .btn-top{background:url(http://www.shareditor.com/uploads/media/default/0001/01/thumb_416_default_big.png) 0px 0px no-repeat;background-size: 70px 70px;display:none;}


The bottom half of the div expands when the page scrolls.

Added in the common code section of all pages

<div id="top"></div>


The use of huge corpus, LSTM-RNN training, the conversion of Chinese corpus into algorithm recognition vector form, the most powerful word embedding tool word2vec.

word2vec input word segmentation text file, movie and TV subtitle corpus, carriage return and line feed to separate complete sentences, so we first segment it, word_segment.py file:

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import jieba
from jieba import analyse

def segment(input, output):
input_file = open(input, "r")
output_file = open(output, "w")
while True:
line = input_file.readline()
if line:
line = line.strip()
seg_list = jieba.cut(line)
segments = ""
for str in seg_list:
segments = segments + " " + str
segments = segments + "\n"
output_file.write(segments)
else:
break
input_file.close()
output_file.close()

if __name__ == '__main__':
if 3 != len(sys.argv):
print("Usage: ", sys.argv[0], "input output")
sys.exit(-1)
segment(sys.argv[1], sys.argv[2]);


use:

 

python word_segment.py subtitle/raw_subtitles/subtitle.corpus segment_result


word2vec generates word vectors. word2vec can be obtained from https://github.com/warmheartli/ChatBotCourse/tree/master/word2vec, and make compiles and generates binary files.
implement:

 

./word2vec -train ../segment_result -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

Generate vectors.bin word vector, binary format, word2vec comes with distance tool to verify:

./distance vectors.bin


Word vector binary file format loading. word2vec generates word vectors in binary format: the number of words (spaces) vector dimension.
Load the word vector binary file python script:

 

# coding:utf-8

import sys
import struct
import math
import numpy as np

reload(sys)
sys.setdefaultencoding( "utf-8" )

max_w = 50
float_size = 4

def load_vectors(input):
print "begin load vectors"

input_file = open(input, "rb")

# 获取词表数目及向量维度
words_and_size = input_file.readline()
words_and_size = words_and_size.strip()
words = long(words_and_size.split(' ')[0])
size = long(words_and_size.split(' ')[1])
print "words =", words
print "size =", size

word_vector = {}

for b in range(0, words):
a = 0
word = ''while True:
# Read a word

c = input_file.read(1)
word = word + c
if False == c or c == ' ':
break
if a < max_w and c != '\n':
a = a + 1
word = word.strip()

# 读取词向量
vector = np.empty([200])
for index in range(0, size):
m = input_file.read(float_size)
(weight,) = struct.unpack('f', m)
vector[index] = weight

# 将词及其对应的向量存到dict中
word_vector[word.decode('utf-8')] = vector

input_file.close()

print "load vectors finish"
return word_vector

if __name__ == '__main__':
if 2 != len(sys.argv):
print "Usage: ", sys.argv[0], "vectors.bin"
sys.exit(-1)
d = load_vectors(sys.argv[1])
print d[u'true']


It works like this:

 

python word_vectors_loader.py vectors.bin


References:

"Python Natural Language Processing"

http://www.shareditor.com/blogshow?blogId=113

http://www.shareditor.com/blogshow?blogId=114

http://www.shareditor.com/blogshow?blogId=115

Welcome to recommend machine learning job opportunities in Shanghai, my WeChat: qingxingfengzi

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324664331&siteId=291194637