Study Notes CB010: Recurrent Neural Network, LSTM, Automatic Capturing Captions

Recurrent neural network can store memory neural network, LSTM is one of them, and it works well in the field of NLP.

Recurrent Neural Network (RNN), Time Recurrent Neural Network, Structural Recurrent Neural Network. The connections between neurons in the time recurrent neural network form a directed graph, and the structural recurrent neural network uses similar neural network structures to recursively construct more complex deep networks. Both training belong to the same algorithm variant.

Time Recurrent Neural Networks. Traditional neural network FNN (Feed-Forward Neural Networks), forward feedback neural network. RNN introduces a directed loop, and neurons form a directed loop for nodes, which can express the relationship between the front and the back. The hidden layer nodes are fully connected, and the output of one hidden layer node can be used as another hidden layer node or its own input. U, V, W are transformation probability matrices, x is the input and o is the output. The key of RNN is the hidden layer, which captures sequence information and memory ability. The U, V, and W parameters in RNN are shared, each step is doing the same thing, the input is different, and the number of parameters and the amount of calculation are reduced. RNN is widely used in NLP. The language model predicts the probability of the next word when the words that have already appeared are known, which is a time series model. The appearance of the next word depends on the previous words, which corresponds to the internal connection between the hidden layers of the RNN.

The training method of RNN. The training parameters are updated with the BP error back-propagation algorithm. The steps from the input to the output are uncertain, and the forward calculation is performed using the sequential method. Assume that x represents the input value, s represents the value of the input x after U matrix transformation, h represents the activation value of the hidden layer, o represents the output layer value, and f represents the hidden layer. layer activation function, and g represents the output layer activation function. When t=0, the input is x0 and the hidden layer is h0. When t=1, the input is x1, s1 = Ux1+Wh0, h1 = f(s1), o1 = g(Vh1). When t=2, s2 = Ux2+Wh1, h2 = f(s2), o2 = g(Vh2). st = Uxt + Wh(t-1), ht = f(st), ot = g(Vht). h=f (existing input + past memory summary), which fully reflects the RNN memory ability.
UVW transformation probability matrix, x input, s xU matrix transformed value, f hidden layer activation function, h hidden layer activation value, g output layer activation function, o output. time, input, transform (input, front hidden), hidden (transform), output (hidden). output(hidden(transform(time, input, pre-hidden))). Reversely correct the parameters, output the error of o and the actual o value at each step, use the error to reverse derivation, chain derivation to obtain the gradient of each layer, and update the parameters.

LSTM (Long Short Tem Momery networks). RNN has the problem of Long-Term Dependencies. The probability of occurrence of the next word is related to the word very long ago. Considering the amount of calculation, the limit depends on the length. http://colah.github.io/posts/2015-08-Understanding-LSTMs. The schematic diagram of the traditional RNN contains only one hidden layer, tanh is the excitation function, and the "memory" is reflected in the t sliding window, and there are as many memories as there are t.

LSTM design, neural network layers (weight coefficients and activation functions, σ means sigmoid activation function, tanh means tanh activation function), matrix operations (matrix multiplication or matrix addition). Historical information transmission and memory, adjust the size of the valve (multiplied by a coefficient between 0 and 1), the first sigmoid layer calculates and outputs a coefficient between 0 and 1, and acts on the × gate. This operation expresses the memory passed from the previous stage. How much to keep, how much to forget. How much memory is forgotten depends on the output h{t-1} of the previous hidden layer and the input x{t} of this layer. The output h{t-1} of the previous layer and the input x{t} of this layer obtain new information and store it in memory. Calculate the output value Ct part of the tanh neuron and calculate the scale coefficient sigmoid neuron (the sigmoid value range is [0,1] as the scale coefficient, and the tanh value range [-1,1] is used as an output value). The hidden layer output h is calculated, considering all the current information (the output of the previous time series hidden layer, the input x of this layer and the current overall memory information), the state part C of this unit is activated through tanh and a filter is performed (the output value of the previous time series and the current input value via the sigmoid activation coefficient). A word is input x in different time series, and the probability of word A appearing at a certain time t can be calculated by LSTM. The occurrence probability of word A depends on the previous word, and it depends on how many words are uncertain. LSTM stores memory information C, which is closer probability.

Chatbots are fan question answering systems.

Corpus acquisition. Fan question answering system generally collects corpus information from the Internet, such as Baidu and Google, and constructs question and answer pairs to form a corpus. The corpus is divided into multiple training sets, development sets, and test sets. The question answering system trains a model to find a correct answer among a bunch of answers. The training process does not put all the answers into a vector space, do grouping, collect samples in the corpus, collect each question corresponding to 500 answer sets, 500 of which have positive samples, randomly select some negative samples, highlight the positive sample effect.

Based on CNN system design, sparse interaction (sparse interaction), parameter sharing (parameter sharing), equivalent respresentation (equivalent representation), suitable for automatic question answering system answer selection model training.

General training method. Obtain the question word vector Vq during training (the word vector can be trained by google word2vec, and a positive answer word vector Va+, and a negative answer word vector Va-, compare the similarity between the question and the two answers, and the difference between the two similarities is greater than A threshold m updates the model parameters, selects the answer in the candidate pool, and does not update the model if it is smaller than m. Parameter update, gradient descent, chain derivation. Test data, calculate the cos distance between the question and the candidate answer, and the maximum similarity is the correct answer prediction.

Neural network structure design. HL hide layer hidden layer, activation function z = tanh(Wx+B), CNN convolutional layer, P pooling layer, pooling step size 1, T tanh layer, P+T output is a vector representation, and finally outputs two vectors cos similarity. HL or CNN chained together means sharing the same weights. The CNN output dimension depends on how many convolutional features are made. Paper "Applying Deep Learning To Answer Selection- A Study And An Open Task".

Deep learning is applied to chatbots, 1. Neural network structure selection, combination and optimization. 2. Natural language processing, machine recognition of word vectors. 3. Similarity or matching relationship consider similarity calculation, typical method cos distance. 4. Use CNN or LSTM for global information of text sequences. 5. Layers can be added if the accuracy is not high. 6. Excessive computation, parameter sharing and pooling.

Chatbot learning requires massive chat corpus. American TV subtitles. The foreign language movie or TV series subtitle file is a natural chat corpus, and the dialogue is better than American TV series. Subtitle library website www.zimuku.net.

Automatically grab subtitles. Grabber code (https://github.com/warmheartli/ChatBotCourse). Create a directory result under subtitle,
and add the parameter dont_filter=True when the scrapy.Request method is called:

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import scrapy
from subtitle_crawler.items import SubtitleCrawlerItem

class SubTitleSpider(scrapy.Spider):
name = "subtitle"
allowed_domains = ["zimuku.net"]
start_urls = [
"http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=20",
"http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=21",
"http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=22",
]

def parse(self, response):
hrefs = response.selector.xpath('//div[contains(@class, "persub")]/h1/a/@href').extract()
for href in hrefs:
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_detail, dont_filter=True)
yield request

def parse_detail(self, response):
url = response.selector.xpath('//li[contains(@class, "dlsub")]/div/a/@href').extract()[0]
print("processing: ", url)
request = scrapy.Request(url, callback=self.parse_file, dont_filter=True)
yield request

def parse_file(self, response):
body = response.body
item = SubtitleCrawlerItem()
item['url'] = response.url
item['body'] = body
return item

# -*- coding: utf-8 -*-

class SubtitleCrawlerPipeline(object):
def process_item(self, item, spider):
url = item['url']
file_name = url.replace('/','_').replace(':','_')+'.rar'
fp = open('result/'+file_name, 'wb+')
fp.write(item['body'])
fp.close()
return item

ls result/|head -1 , ls result/|wc -l , du -hs result/ 。

Unzip the subtitle file, and linux directly executes unzip file.zip. Linux decompresses the rar file, http://www.rarlab.com/download.htm. wget http://www.rarlab.com/rar/rarlinux-x64-5.4.0.tar.gz . tar zxvf rarlinux-x64-5.4.0.tar.gz
./rar/unrar . Unzip command, unrar x file.rar. Linux decompresses the 7z file, http://downloads.sourceforge.net/project/p7zip to download the source file, decompress and execute make to compile bin/7za available, use bin/7za x file.7z.

Programs and scripts are at https://github.com/warmheartli/ChatBotCourse. The first step: Crawling the subtitles of film and television dramas. Step 2: Compression format classification. There are many files that cannot be ls, the file name has special characters, the file name is overwritten by mistake, and the extension is strange. The python script mv_zip.py:

import glob
import os
import fnmatch
import shutil
import sys

def iterfindfiles(path, fnexp):
for root, dirs, files in os.walk(path):
for filename in fnmatch.filter(files, fnexp):
yield os.path.join(root, filename)

i=0
for filename in iterfindfiles(r"./input/", "*.ZIP"):
i=i+1
newfilename = "zip/" + str(i) + "_" + os.path.basename(filename)
print(filename + " <===> " + newfilename)
shutil.move(filename, newfilename)
#sys.exit(-1)

The extension is modified *.rar, *.RAR, *.zip, *.ZIP according to the compressed file. Step 3: Unzip. Download different decompression tools according to the operating system, it is recommended to unrar and unzip, scripts to achieve batch decompression:

i=0; for file in `ls`; do mkdir output/${i}; echo "unzip $file -d output/${i}";unzip -P abc $file -d output/${i} > /dev/null; ((i++)); done
i=0; for file in `ls`; do mkdir output/${i}; echo "${i} unrar x $file output/${i}";unrar x $file output/${i} > /dev/null; ((i++)); done

Step 4: Sort the srt, ass, and ssa subtitle files. Subtitle file types srt, lrc, ass, ssa, sup, idx, str, vtt. Step 5: Clean up the directory. Automatically clean empty directories script clear_empty_dir.py:

import glob
import os
import fnmatch
import shutil
import sys

def iterfindfiles(path, fnexp):
for root, dirs, files in os.walk(path):
if 0 == len(files) and len(dirs) == 0:
print(root)
os.rmdir(root)

iterfindfiles(r"./input/", "*.srt")

Step 6: Clean up non-subtitle files. Batch delete script del_file.py:

import glob
import os
import fnmatch
import shutil
import sys

def iterfindfiles(path, fnexp):
for root, dirs, files in os.walk(path):
for filename in fnmatch.filter(files, fnexp):
yield os.path.join (root, filename)

for suffix in ("*.mp4", "*.txt", "*.JPG", "*.htm", "*.doc", "*.docx", "*.nfo" , "*.sub", "*.idx"):
for filename in iterfindfiles(r"./input/", suffix):
print(filename)
os.remove(filename)

Step 7: Multi-layer decompression. Step 8: Discard the remaining few files. No extension, special extension, a small number of compressed files, the total does not exceed 50M. The ninth step: encoding identification and transcoding. utf-8, utf-16, gbk, unicode, iso8859, unified utf-8, get_charset_and_conv.py:

import chardet
import sys
import os

if __name__ == '__main__':
if len(sys.argv) == 2:
for root, dirs, files in os.walk(sys.argv[1]):
for file in files:
file_path = root + "/" + file
f = open(file_path,'r')
data = f.read()
f.close()
encoding = chardet.detect(data)["encoding"]
if encoding not in ("UTF-8-SIG", "UTF-16LE", "utf-8", "ascii"):
try:
gb_content = data.decode("gb18030")
gb_content.encode('utf-8')
f = open(file_path, 'w')
f.write(gb_content.encode('utf-8'))
f.close()
except:
print("except:", file_path)

Step 10: Screen Chinese. extract_sentence_srt.py :

# coding:utf-8
import chardet
import os
import re

cn=ur"([u4e00-u9fa5]+)"
pattern_cn = re.compile(cn)
jp1=ur"([u3040-u309F]+)"
pattern_jp1 = re.compile(jp1)
jp2=ur"([u30A0-u30FF]+)"
pattern_jp2 = re.compile(jp2)

for root, dirs, files in os.walk("./srt"):
file_count = len(files)
if file_count > 0:
for index, file in enumerate(files):
f = open(root + "/" + file, "r")
content = f.read()
f.close()
encoding = chardet.detect(content)["encoding"]
try:
for sentence in content.decode(encoding).split('n'):
if len(sentence) > 0:
match_cn = pattern_cn.findall(sentence)
match_jp1 = pattern_jp1.findall(sentence)
match_jp2 = pattern_jp2.findall(sentence)
sentence = sentence.strip()
if len(match_cn)>0 and len(match_jp1)==0 and len(match_jp2) == 0 and len(sentence)>1 and len(sentence.split(' ')) < 10:
print(sentence.encode('utf-8'))
except:
continue

Step 11: Sentence extraction in subtitles.

# coding:utf-8
import chardet
import os
import re

cn=ur"([u4e00-u9fa5]+)"
pattern_cn = re.compile(cn)
jp1=ur"([u3040-u309F]+)"
pattern_jp1 = re.compile(jp1)
jp2=ur"([u30A0-u30FF]+)"
pattern_jp2 = re.compile(jp2)

for root, dirs, files in os.walk("./ssa"):
file_count = len(files)
if file_count > 0:
for index, file in enumerate(files):
f = open(root + "/" + file, "r")
content = f.read()
f.close()
encoding = chardet.detect(content)["encoding"]
try:
for line in content.decode(encoding).split('n'):
if line.find('Dialogue') == 0 and len(line) < 500:
fields = line.split(',')
sentence = fields[len(fields)-1]
tag_fields = sentence.split('}')
if len(tag_fields) > 1:
sentence = tag_fields[len(tag_fields)-1]
match_cn = pattern_cn.findall(sentence)
match_jp1 = pattern_jp1.findall(sentence)
match_jp2 = pattern_jp2.findall(sentence)
sentence = sentence.strip()
if len(match_cn)>0 and len(match_jp1)==0 and len(match_jp2) == 0 and len(sentence)>1 and len(sentence.split(' ')) < 10:
sentence = sentence.replace('N', '')
print(sentence.encode('utf-8'))
except:
continue

Step 12: Content filtering. Filter special unicode characters, keywords, remove subtitle style tags, html tags, continuous special characters, escape characters, episode information:

# coding:utf-8
import sys
import re
import chardet

if __name__ == '__main__':
#illegal=ur"([u2000-u2010]+)"
illegal=ur"([u0000-u2010]+)"
pattern_illegals = [ re.compile(ur"([u2000-u2010]+)"), re.compile(ur"([u0090-u0099]+)")]
filters = ["Subtitles", "Timeline:", "Proofreading: ", "Translation:", "Post-production:", "Producer:"]
filters.append("Timeline:")
filters.append("Proofreading:")
filters.append("Translation:")
filters.append(" Later: ")
filters.append("Producer:")
filters.append("It is forbidden to use it for any commercial profit")
filters.append("http")
htmltagregex = re.compile(r'<[^>]+>',re.S)
brace_regex = re.compile(r'{.*}',re.S)
slash_regex = re.compile(r'\w',re.S)
repeat_regex = re.compile(r'[-=]{10}',re.S)
f = open("./corpus/all.out", "r")
count=0
while True:
line = f.readline()
if line:
line = line.strip()

# 编码识别,不是utf-8就过滤
gb_content = ''
try:
gb_content = line.decode("utf-8")
except Exception as e:
sys.stderr.write("decode error: ", line)
continue

# 中文识别,不是中文就过滤
need_continue = False
for pattern_illegal in pattern_illegals:
match_illegal = pattern_illegal.findall(gb_content)
if len(match_illegal) > 0:
sys.stderr.write("match_illegal error: %sn"% line)
need_continue = True
break
if need_continue:
continue

# 关键词过滤
need_continue = False
for filter in filters:
try:
line.index(filter)
sys.stderr.write("filter keyword of %s %sn" % (filter, line))
need_continue = True
break
except:
pass
if need_continue:
continue

# 去掉剧集信息
if re.match('.*第.*季.*', line):
sys.stderr.write("filter copora %sn" % line)
continue
if re.match('.*第.*集.*', line):
sys.stderr.write("filter copora %sn" % line)
continue
if re.match('.*第.*帧.*', line):
sys.stderr.write("filter copora %sn" % line)
continue

# 去html标签
line = htmltagregex.sub('',line)

# To remove braces
line = brace_regex.sub('', line)

# To escape
line = slash_regex.sub('', line)

# To repeat
new_line = repeat_regex.sub('', line)
if len(new_line) != len(line):
continue

# remove special characters
line = line.replace('-', '').strip()

if len(line) > 0:
sys.stdout.write("%sn" % line)
count+=1
else:
break
f.close()
pass

References:

"Python Natural Language Processing"

http://www.shareditor.com/blogshow?blogId=103

http://www.shareditor.com/blogshow?blogId=104

http://www.shareditor.com/blogshow?blogId=105

http://www.shareditor.com/blogshow?blogId=112

Welcome to recommend machine learning job opportunities in Shanghai, my WeChat: qingxingfengzi

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324376624&siteId=291194637