Python + Hadoop streamming write distributed applications

He gave a good task is to develop its own model and program execution up into the Hadoop platform, just before the development model, is not responsible for these jobs, first time to record it.

step1: write a Mapper.py

The part of the code of the prediction encapsulated in the whole Mapper.py has two parts, a first part of the model is loaded, the second part is the model prediction,

similar:

# encoding=utf-8
import sys
import terminal_predict as pred
#加载模型
pred.model_init()
#预测
print('input the test sentence:')
for line in sys.stdin:
    pred.predict_online(line)

step2: Hadoop server settings

1. Under our code and models, all on a Hadoop server's local path, such as: / data / person_ner_online below, the entire project will need all the files in the same directory next level, can not exist secondary directory , but also not in the program from a file import your own py folder.

2. Log on to the next directory of your Hadoop data entry:

hadoop fs -ls /data/work

3. Create an input folder:

 hadoop fs -mkdir /data/work/input

4. Create your own input file in the local terminal

vim 123.txt

5. The input file upload it to the input directory Hadoop:

hadoop fs -put 123.txt /data/work/input

6. Create operating results stored output folder output

hadoop fs -mkdir /data/work/output

step3: write shell scripts

Script template as follows, under normal circumstances we can only make some changes for the input and output files.

#!/bin/bash

# 修改自己的输入输出目录即可
INPUT_DIR=/data/input/testCorpus.txt
OUTPUT_DIR=/data/output

hadoop fs -test -e ${OUTPUT_DIR}
if [ $? -eq 0 ] ;then
    hadoop fs -rmr ${OUTPUT_DIR}
else
    echo "${OUTPUT_DIR} not found!"
fi

hadoop jar /usr/local/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
	-archives hdfs://ns3-backup/dw_ext/up/ext/content/Python.zip#Python \
	-input hdfs://ns3-backup${INPUT_DIR} \
	-output hdfs://ns3-backup${OUTPUT_DIR} \
	-mapper "Python/bin/python Mapper.py" \
	-jobconf mapred.map.tasks=15 \
	-jobconf mapred.job.name="Mapper" \
	-file bert_config.json \
	-file bert_model.ckpt.data-00000-of-00001 \
	-file bert_model.ckpt.index \
	-file bert_model.ckpt.meta \
	-file checkpoint \
	-file bert_lstm_ner.py \
	-file conlleval.py \
	-file data_process.py \
	-file lstm_crf_layer.py \
	-file label2id.pkl \
	-file label_list.pkl \
	-file terminal_predict.py \
	-file tf_metrics.py \
	-file utils.py \
	-file model.ckpt-3173.data-00000-of-00001 \
	-file model.ckpt-3173.index \
	-file model.ckpt-3173.meta \
	-file Mapper.py \
	-file tokenization.py \
	-file modeling.py \
	-file vocab.txt

pay attention:

1. archives hdfs general Python path and there's typically the Python environment on the default Hadoop platform, if we run in the local code is Python3, with Python2 default on the platforms, incompatibility problems can occur, so we a can package their own development environment, and then put under the specified path, then specify the path of our development environment by this parameter, so that you can solve the environmental compatibility issues.

2. -file in all input files we need to call when the code is executed, if the code is placed in a package (folder), or the model in a folder, we have to put all of these codes and files dragged out on the outermost level directory, so that it can push up to Hadoop.

step4: .sh script will be placed in the local Hadoop platform to run .sh

Our model predicts that can run on Hadoop, after the end of the last run can go

hadoop fs -ls /data/work/output

Under Review the output file, you can see the prediction:

hadoop fs -cat /data/output/part-00000

 

 

About Hadoop data and writing distributed programs, you can refer to this blog:

https://www.cnblogs.com/joyeecheung/p/3760386.html

Guess you like

Origin blog.csdn.net/orangefly0214/article/details/91600849