Hadoop Streaming sample program (WordCount)
run_hadoop_word_counter.sh
$HADOOP_BIN streaming \
-input "${INPUT}" \
-output "${OUT_DIR}" \
-cacheArchive "${TOOL_DIR}/python2.7.2.tgz""#." \
-file "mapper_word_counter.py" \
-file "reducer_word_counter.py" \
-file "filter_word_counter.py" \
-mapper "./python2.7.2/bin/python mapper_word_counter.py" \
-combiner "./python2.7.2/bin/python reducer_word_counter.py" \
-reducer "./python2.7.2/bin/python reducer_word_counter.py" \
-jobconf abaci.job.base.environment="centos6u3_hadoop" \
-jobconf mapred.job.priority="NORMAL" \
-jobconf mapred.job.name="${TASK_NAME}" \
-jobconf mapred.map.tasks="${MAP_NUM}" \
-jobconf mapred.reduce.tasks="${REDUCE_NUM}" \
-jobconf mapred.map.memory.limit="1000" \
-jobconf mapred.reduce.memory.limit="1000" \
-jobconf mapred.job.map.capacity="3000" \
-jobconf mapred.job.reduce.capacity="2500" \
-jobconf mapred.job.keep.files.hours=12 \
-jobconf mapred.max.map.failures.percent=1 \
-jobconf mapred.reduce.tasks.speculative.execution="false"
mapper_word_counter.py
import sys
for line in sys.stdin:
fields = line.strip().split('\t')
try:
cnt = 1
dateval = fields[1]
sys.stdout.write('%s\t%d\n' %(dateval, cnt))
except Exception as exp:
sys.stderr.write("exp:%s, %s" %(str(exp), line))
reducer_word_counter.py
import sys
word_pre = None
counter_pre = 0
for line in sys.stdin:
try:
word, cnt = line.strip().split('\t')
cnt = int(cnt)
except Exception as exp:
sys.stderr.write('Exp:%s,line:%s' %(str(exp), line.strip()))
continue
if word == word_pre:
counter_pre += cnt
else:
if word_pre:
print('%s\t%d' %(word_pre, counter_pre))
word_pre = word
counter_pre = cnt
if word_pre:
print('%s\t%d' %(word_pre, counter_pre))
Plain text input format
- Each mapper enter several lines
-inputformat "org.apache.hadoop.mapred.TextInputFormat" - Specifies the number of lines per input mapper
-inputformat "org.apache.hadoop.mapred.lib.NLineInputFormat" -jobconf mapred.line.input.format.linespermap = " 5"
File distribution methods:
-file the client's local file upload labeled jar package HDFS then distributed to a computing node;
-cacheFile file to the HDFS distributed computing node;
-cacheArchive compressed file distributed to the HDFS and extract the computing node;
Min sorting bucket &
Hadoop default map will output a first row encountered in the separator (default \ T) as a part of the front key, as the latter value, if the output line does not delimiter, then the entire line as a key, value It is set to the empty string. key mapper output through the partition distributed to different reduce inside.
- Application Examples
${HADOOP_BIN} streaming \
-input "${INPUT}" \
-output "${OUT_DIR}" \
-mapper cat \
-reducer cat \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-jobconf stream.num.map.output.key.fields=4 \
-jobconf stream.map.output.field.separator=. \
-jobconf map.output.key.field.separator=. \
-jobconf mapred.text.key.partitioner.options=-k1,2 \
-jobconf mapred.text.key.comparator.options="-k3,3 -k4nr" \
-jobconf stream.reduce.output.field.separator=. \
-jobconf stream.num.reduce.output.key.fields=4 \
-jobconf mapred.reduce.tasks=5
Description:
- Mapper output setting key
stream.map.output.field.separator setting map output field separator
stream.num.map.output.key.fields map output provided as the key field in the first few - Dividing the rules set key barrel
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner partition type
field separator disposed within map.output.key.field.separator key (KeyFieldBasedPartitioner and KeyFieldBasedComparator specific)
num.key.fields .for.partition disposed within key field is used for the first few Partition
mapred.text.key.partitioner.options key to specify which fields to do a separate partition, and num.key.fields.for.partition used with num.key .fields.for.partition prevail - According to the rules set sort key
advanced KeyFieldBasedComparator comparator can be set flexibly, the default Text-based lexicographically by -n to or based on the digital comparator
mapred.text.key.comparator.options provided in the key field or the word to be compared section range - Reducer set the output key
stream.reduce.output.field.separator disposed reduce the output field separator
stream.num.reduce.output.key.fields reduce the output provided as the key field in the first few
Multiple Output
Hadoop support multiple outputs, the output data to a plurality of processing MapReduce part-xxxxx-X file (X a total of 26 letters AZ is one). The program needs maper (facing the mapper only task MR) / reducer (the reducer for task contains) in the form of program output by the <key, value> becomes <key, value # X>, so as to output a particular file with the extension in. #X which only used to specify the output file suffix does not appear in the output content.
Startup scripts need to specify
-outputformat org.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat
or
-outputformat org.apache.hadoop.mapred.lib.SuffixMultipleSequenceFileOutputFormat
- Application examples
run_hadoop.sh
${HADOOP_BIN} streaming \
-input "${INPUT}" \
-output "${OUT_DIR}" \
-cacheArchive "${TOOL_DIR}/python2.7.2.tgz""#." \
-file "mapper_worker.sh" \
-file "reducer_worker.py" \
-mapper "sh mapper_worker.sh" \
-reducer "python2.7.2/bin/python reducer_worker.py" \
-inputformat "org.apache.hadoop.mapred.TextInputFormat" \
-outputformat "org.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat" \
-jobconf mapred.job.priority="NORMAL" \
-jobconf mapred.job.name="${TASK_NAME}" \
-jobconf mapred.map.tasks="${MAP_NUM}" \
-jobconf mapred.reduce.tasks="${REDUCE_NUM}" \
-jobconf mapred.max.split.size=134217728 \
-jobconf mapred.map.memory.limit="800" \
-jobconf mapred.reduce.memory.limit="500" \
-jobconf mapred.job.map.capacity="3500" \
-jobconf mapred.job.reduce.capacity="2000" \
-jobconf mapred.job.keep.files.hours=12 \
-jobconf mapred.max.map.failures.percent=1 \
-jobconf mapred.reduce.tasks.speculative.execution="false"
reducer_worder.py
for line in sys.stdin:
record = line.strip()
fields = record.split('\t')
if len(fields) != 7:
continue
vcpurl, playurl, title, poster, duration, pubtime, accept = fields
duration = int(duration)
pubtime = int(pubtime)
accept = int(accept)
if duration < 60:
sys.stdout.write('%s#A\n' %(record))
elif duration < 300:
sys.stdout.write('%s#B\n' %(record))
else:
sys.stdout.write('%s#C\n' %(record))
Local Debugging
To avoid discovery in the MR task before starting a program bug, running ahead of the best in the process of local analog MR, it verifies the results are in line with expectations
cat inputfile | ./mapper_task.sh | sort -t$'\t' -k1,1 | ./reducer.sh
Compression output
Hadoop default support gzip compression, streaming operation can specify the following parameters in the output gzip compressed form.
-D mapreduce.output.fileoutputformat.compress=true
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
Hadoop is readable self gzip compressed data, no special input is specified Gzip compression. Gzip compression is characterized by relatively high, Hadoop native support, the disadvantage is not very high compression efficiency, compression ratio and efficiency can not have both, you need to consider other compression methods.
Hadoop Common Configuration Item
Configuration name | Explanation |
---|---|
abaci.job.base.environment | centos6u3_hadoop If the system needs to be upgraded environment, you can specify centos6u3_hadoop support later versions of glibc |
stream.memory.limit | Single map / reduce the maximum working memory, the default 800M |
mapred.map.memory.limit | The use of a single memory map highest priority is higher than stream.memory.limit |
mapred.reduce.memory.limit | Reduce the maximum use of a single memory, a higher priority than stream.memory.limit |
mapred.map.capacity.per.tasktracker | Each machine up to start at the same time the number of map |
mapred.reduce.capacity.per.tasktracker | Each machine up to start at the same time reduce the number of |
mapred.job.map.capacity | The number of concurrent map |
mapred.job.reduce.capacity | reduce the number of concurrent |
abaci.job.map.max.capacity | map concurrency limits, default 10000 |
abaci.job.reduce.max.capacity | reduce concurrency limits, default 3000 |
mapred.map.tasks | The number of map |
mapred.reduce.tasks | Reduce the number |
mapred.job.reuse.jvm.num.tasks | 1 means no reuse, -1 indicates unlimited reuse, the number of other value indicates that each jvm reuse. reuse, it does not free the memory at the end of map |
mapred.compress.map.output | Specify whether to compress the output of the map. Help to reduce the amount of data, io reduced pressure, the compression and decompression have cpu cost, necessary to carefully select a compression algorithm |
mapred.map.output.compression.codec | Compression algorithms map output |
mapred.output.compress | reduce the output is compressed |
mapred.output.compression.codec | Compression is controlled output of mapred |
io.compression.codecs | Compression algorithm |
mapred.max.map.failures.percent | Tolerate map error percentage, the default is 0 |
mapred.max.reduce.failures.percent | Reduce the percentage of error tolerance, the default is 0 |
stream.map.output.field.separator | The map output separator, default Tab |
stream.reduce.output.field.separator | reduce output separator, default Tab |
mapred.textoutputformat.separator | Set the output of the key TextOutputFormat, value separator, default Tab |
mapred.textoutputformat.ignoreseparator | After set to true, only when there is no key value will be removed automatically fill the Tab |
mapred.min.split.size | The minimum amount of processing data specifying map, unit B |
mapred.max.split.size | Map data processing up to the specified amount, the unit B, while setting inputformat = org.apache.hadoop.mapred.CombineTextInputFormat |
mapred.combine.input.format.local.only | A merger of the current node only, default true, is set to false can merge data across nodes |
abaci.job.map.cpu.percent | accounting consumed cpu map, the default value of 40 (40% indicates a cpu, i.e. 0.4 cpu) |
abaci.job.reduce.cpu.percent | reduce consumption accounted cpu, the default value of 40 (40% indicates a cpu, i.e. 0.4 cpu) |
mapred.map.capacity.per.tasktracker | Each node represents a maximum number of parallel run of the job map tasks (increased or decreased appropriately according to the parameter memory, the default is 8) |
mapred.reduce.capacity.per.tasktracker | Each node represents a maximum run several parallel reduce tasks of the job (see appropriately increased or decreased according to the parameter memory, the default is 8) |
mapred.map.tasks.speculative.execution | Open map speculative execution, default true |
mapred.reduce.tasks.speculative.execution | 开启reduce预测执行,默认true |
Hadoop环境下系统变量
- 变量名列表
变量名 | 变量说明 |
---|---|
HADOOP_HOME | 计算节点上配置的Hadoop路径 |
LD_LIBRARY_PATH | 计算节点上加载库文件的路径列表 |
PWD | 当前工作目录 |
dfs_block_size | 当前设置的HDFS文件块大小 |
map_input_file | mapper正在处理的输入文件路径 |
mapred_job_id | 作业ID |
mapred_job_name | 作业名 |
mapred_tip_id | 当前任务的第几次重试 |
mapred_task_id | 任务ID |
mapred_task_is_map | 当前任务是否为map |
mapred_output_dir | 计算输出路径 |
mapred_map_tasks | 计算的map任务数 |
mapred_reduce_tasks | 计算的reduce任务数 |
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Configured+Parameters
- 应用示例:
Shell版
#!/bin/bash
set -o pipefail
HOST="localhost"
PORT=$((1000 + ${mapred_task_partition}))
awk '{print $2}' \
| ./access_remote_data ${HOST} ${PORT} outdata.gz
hdfs_outfile=${mapred_work_output_dir}/${mapred_task_partition}.pack
cat outdata.gz \
| gzip -d \
| python ../postprocess.py
| ${HADOOP_HOME}/bin/hadoop fs -D hadoop.job.ugi="username,pwd" -copyFromLocal - ${hdfs_outfile}
Python版
import os
input_file = os.environ['mapreduce_map_input_file']
#do something else
References
Hadoop Streaming相关官方文档:https://hadoop.apache.org/docs/r3.1.2/hadoop-streaming/HadoopStreaming.html
Hadoop Streaming入门:http://icejoywoo.github.io/2015/09/28/introduction-to-hadoop-streaming.html
Hadoop排序工具用法小结:http://www.dreamingfish123.info/?p=1102
Hadoop压缩选项权衡:https://www.slideshare.net/Hadoop_Summit/singh-kamat-june27425pmroom210c