Training set preprocessing and word vector generation

The format of the original training corpus is as follows:

{
	"sentText": "But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .",
	"articleId": "/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/nyt-2005-2006.backup/1677367.xml.pb",
	"relationMentions": [{
		"em1Text": "Bobby Fischer",
		"em2Text": "Iceland",
		"label": "/people/person/nationality"
	}, {
		"em1Text": "Iceland",
		"em2Text": "Reykjavik",
		"label": "/location/country/capital"
	}, {
		"em1Text": "Iceland",
		"em2Text": "Reykjavik",
		"label": "/location/location/contains"
	}, {
		"em1Text": "Bobby Fischer",
		"em2Text": "Reykjavik",
		"label": "/people/deceased_person/place_of_death"
	}],
	"entityMentions": [{
		"start": 0,
		"label": "PERSON",
		"text": "Bobby Fischer"
	}, {
		"start": 1,
		"label": "LOCATION",
		"text": "Reykjavik"
	}, {
		"start": 2,
		"label": "LOCATION",
		"text": "Iceland"
	}],
	"sentId": "1"
}

 Need to be processed into a statement-only format:

But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .

code show as below:

import json
import io
train = "./train.json"
result = './trainResult.txt'
fw = open(result, 'w')
with io.open(train, 'r', encoding='utf-8') as f:
	for line in f: 
		data = json.loads(line)
        	fw.write(data['sentText'])
		fw.write('\n')
fw.close()

After generating the input file of the word vector, the next step is to generate the word vector corresponding to each word. You need to use the word2vec tool. The address is:

https://github.com/dav/word2vec

The core command is in the create-text8-vector-data.sh file

time $BIN_DIR/word2vec -train $TEXT_DATA -output $VECTOR_DATA -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

-binary=0 means that the output is a text word vector, if -binary=1 means that the result is a binary word vector

Guess you like

Origin blog.csdn.net/u011939633/article/details/93871728