Hadoop Common Operations Summary

Hadoop Streaming sample program (WordCount)

run_hadoop_word_counter.sh

$HADOOP_BIN streaming \
    -input "${INPUT}" \
    -output "${OUT_DIR}" \
    -cacheArchive "${TOOL_DIR}/python2.7.2.tgz""#." \
    -file "mapper_word_counter.py" \                                                                                                                                         
    -file "reducer_word_counter.py" \
    -file "filter_word_counter.py" \
    -mapper "./python2.7.2/bin/python mapper_word_counter.py" \
    -combiner "./python2.7.2/bin/python reducer_word_counter.py" \
    -reducer "./python2.7.2/bin/python reducer_word_counter.py" \
    -jobconf abaci.job.base.environment="centos6u3_hadoop" \
    -jobconf mapred.job.priority="NORMAL" \
    -jobconf mapred.job.name="${TASK_NAME}" \
    -jobconf mapred.map.tasks="${MAP_NUM}" \
    -jobconf mapred.reduce.tasks="${REDUCE_NUM}" \
    -jobconf mapred.map.memory.limit="1000" \
    -jobconf mapred.reduce.memory.limit="1000" \
    -jobconf mapred.job.map.capacity="3000" \
    -jobconf mapred.job.reduce.capacity="2500" \
    -jobconf mapred.job.keep.files.hours=12 \
    -jobconf mapred.max.map.failures.percent=1 \
    -jobconf mapred.reduce.tasks.speculative.execution="false"

mapper_word_counter.py

import sys 

for line in sys.stdin:
    fields = line.strip().split('\t')
    try:
        cnt = 1                                                                                                                                                              
        dateval = fields[1]
        sys.stdout.write('%s\t%d\n' %(dateval, cnt))
    except Exception as exp:
        sys.stderr.write("exp:%s, %s" %(str(exp), line))

reducer_word_counter.py

import sys 

word_pre = None
counter_pre = 0 

for line in sys.stdin:
    try:
        word, cnt  = line.strip().split('\t')                                                                                                                                
        cnt = int(cnt)
    except Exception as exp:
        sys.stderr.write('Exp:%s,line:%s' %(str(exp), line.strip()))
        continue

    if word == word_pre:
        counter_pre += cnt 
    else:
        if word_pre:
            print('%s\t%d' %(word_pre, counter_pre))
        word_pre = word
        counter_pre = cnt 

if word_pre:
    print('%s\t%d' %(word_pre, counter_pre))

Plain text input format

  • Each mapper enter several lines
    -inputformat "org.apache.hadoop.mapred.TextInputFormat"
  • Specifies the number of lines per input mapper
    -inputformat "org.apache.hadoop.mapred.lib.NLineInputFormat" -jobconf mapred.line.input.format.linespermap = " 5"

File distribution methods:

-file the client's local file upload labeled jar package HDFS then distributed to a computing node;
-cacheFile file to the HDFS distributed computing node;
-cacheArchive compressed file distributed to the HDFS and extract the computing node;

Min sorting bucket &

Hadoop default map will output a first row encountered in the separator (default \ T) as a part of the front key, as the latter value, if the output line does not delimiter, then the entire line as a key, value It is set to the empty string. key mapper output through the partition distributed to different reduce inside.

  • Application Examples
${HADOOP_BIN} streaming \
    -input "${INPUT}" \
    -output "${OUT_DIR}" \
    -mapper cat \
    -reducer cat \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -jobconf stream.num.map.output.key.fields=4 \
    -jobconf stream.map.output.field.separator=. \
    -jobconf map.output.key.field.separator=. \
    -jobconf mapred.text.key.partitioner.options=-k1,2 \
    -jobconf mapred.text.key.comparator.options="-k3,3 -k4nr" \
    -jobconf stream.reduce.output.field.separator=. \
    -jobconf stream.num.reduce.output.key.fields=4 \
    -jobconf mapred.reduce.tasks=5

Description:

  • Mapper output setting key
    stream.map.output.field.separator setting map output field separator
    stream.num.map.output.key.fields map output provided as the key field in the first few
  • Dividing the rules set key barrel
    org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner partition type
    field separator disposed within map.output.key.field.separator key (KeyFieldBasedPartitioner and KeyFieldBasedComparator specific)
    num.key.fields .for.partition disposed within key field is used for the first few Partition
    mapred.text.key.partitioner.options key to specify which fields to do a separate partition, and num.key.fields.for.partition used with num.key .fields.for.partition prevail
  • According to the rules set sort key
    advanced KeyFieldBasedComparator comparator can be set flexibly, the default Text-based lexicographically by -n to or based on the digital comparator
    mapred.text.key.comparator.options provided in the key field or the word to be compared section range
  • Reducer set the output key
    stream.reduce.output.field.separator disposed reduce the output field separator
    stream.num.reduce.output.key.fields reduce the output provided as the key field in the first few

Multiple Output

Hadoop support multiple outputs, the output data to a plurality of processing MapReduce part-xxxxx-X file (X a total of 26 letters AZ is one). The program needs maper (facing the mapper only task MR) / reducer (the reducer for task contains) in the form of program output by the <key, value> becomes <key, value # X>, so as to output a particular file with the extension in. #X which only used to specify the output file suffix does not appear in the output content.
Startup scripts need to specify
-outputformat org.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat
or
-outputformat org.apache.hadoop.mapred.lib.SuffixMultipleSequenceFileOutputFormat

  • Application examples
    run_hadoop.sh
${HADOOP_BIN} streaming \
    -input "${INPUT}" \
    -output "${OUT_DIR}" \
    -cacheArchive "${TOOL_DIR}/python2.7.2.tgz""#." \
    -file "mapper_worker.sh" \
    -file "reducer_worker.py" \
    -mapper "sh mapper_worker.sh" \
    -reducer "python2.7.2/bin/python reducer_worker.py" \
    -inputformat "org.apache.hadoop.mapred.TextInputFormat" \
    -outputformat "org.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat" \
    -jobconf mapred.job.priority="NORMAL" \
    -jobconf mapred.job.name="${TASK_NAME}" \
    -jobconf mapred.map.tasks="${MAP_NUM}" \
    -jobconf mapred.reduce.tasks="${REDUCE_NUM}" \
    -jobconf mapred.max.split.size=134217728 \
    -jobconf mapred.map.memory.limit="800" \
    -jobconf mapred.reduce.memory.limit="500" \
    -jobconf mapred.job.map.capacity="3500" \
    -jobconf mapred.job.reduce.capacity="2000" \
    -jobconf mapred.job.keep.files.hours=12 \
    -jobconf mapred.max.map.failures.percent=1 \
    -jobconf mapred.reduce.tasks.speculative.execution="false"

reducer_worder.py

for line in sys.stdin:
    record = line.strip()
    fields = record.split('\t')
    if len(fields) != 7:
        continue
    vcpurl, playurl, title, poster, duration, pubtime, accept = fields
    duration = int(duration)
    pubtime = int(pubtime)
    accept = int(accept)
    if duration < 60:
        sys.stdout.write('%s#A\n' %(record))
    elif duration < 300:
        sys.stdout.write('%s#B\n' %(record))
    else:
        sys.stdout.write('%s#C\n' %(record))

Local Debugging

To avoid discovery in the MR task before starting a program bug, running ahead of the best in the process of local analog MR, it verifies the results are in line with expectations

cat inputfile | ./mapper_task.sh | sort -t$'\t' -k1,1 | ./reducer.sh

Compression output

Hadoop default support gzip compression, streaming operation can specify the following parameters in the output gzip compressed form.

-D mapreduce.output.fileoutputformat.compress=true
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

Hadoop is readable self gzip compressed data, no special input is specified Gzip compression. Gzip compression is characterized by relatively high, Hadoop native support, the disadvantage is not very high compression efficiency, compression ratio and efficiency can not have both, you need to consider other compression methods.

Hadoop Common Configuration Item

Configuration name Explanation
abaci.job.base.environment centos6u3_hadoop If the system needs to be upgraded environment, you can specify centos6u3_hadoop support later versions of glibc
stream.memory.limit Single map / reduce the maximum working memory, the default 800M
mapred.map.memory.limit The use of a single memory map highest priority is higher than stream.memory.limit
mapred.reduce.memory.limit Reduce the maximum use of a single memory, a higher priority than stream.memory.limit
mapred.map.capacity.per.tasktracker Each machine up to start at the same time the number of map
mapred.reduce.capacity.per.tasktracker Each machine up to start at the same time reduce the number of
mapred.job.map.capacity The number of concurrent map
mapred.job.reduce.capacity reduce the number of concurrent
abaci.job.map.max.capacity map concurrency limits, default 10000
abaci.job.reduce.max.capacity reduce concurrency limits, default 3000
mapred.map.tasks The number of map
mapred.reduce.tasks Reduce the number
mapred.job.reuse.jvm.num.tasks 1 means no reuse, -1 indicates unlimited reuse, the number of other value indicates that each jvm reuse. reuse, it does not free the memory at the end of map
mapred.compress.map.output Specify whether to compress the output of the map. Help to reduce the amount of data, io reduced pressure, the compression and decompression have cpu cost, necessary to carefully select a compression algorithm
mapred.map.output.compression.codec Compression algorithms map output
mapred.output.compress reduce the output is compressed
mapred.output.compression.codec Compression is controlled output of mapred
io.compression.codecs Compression algorithm
mapred.max.map.failures.percent Tolerate map error percentage, the default is 0
mapred.max.reduce.failures.percent Reduce the percentage of error tolerance, the default is 0
stream.map.output.field.separator The map output separator, default Tab
stream.reduce.output.field.separator reduce output separator, default Tab
mapred.textoutputformat.separator Set the output of the key TextOutputFormat, value separator, default Tab
mapred.textoutputformat.ignoreseparator After set to true, only when there is no key value will be removed automatically fill the Tab
mapred.min.split.size The minimum amount of processing data specifying map, unit B
mapred.max.split.size Map data processing up to the specified amount, the unit B, while setting inputformat = org.apache.hadoop.mapred.CombineTextInputFormat
mapred.combine.input.format.local.only A merger of the current node only, default true, is set to false can merge data across nodes
abaci.job.map.cpu.percent accounting consumed cpu map, the default value of 40 (40% indicates a cpu, i.e. 0.4 cpu)
abaci.job.reduce.cpu.percent reduce consumption accounted cpu, the default value of 40 (40% indicates a cpu, i.e. 0.4 cpu)
mapred.map.capacity.per.tasktracker Each node represents a maximum number of parallel run of the job map tasks (increased or decreased appropriately according to the parameter memory, the default is 8)
mapred.reduce.capacity.per.tasktracker Each node represents a maximum run several parallel reduce tasks of the job (see appropriately increased or decreased according to the parameter memory, the default is 8)
mapred.map.tasks.speculative.execution Open map speculative execution, default true
mapred.reduce.tasks.speculative.execution 开启reduce预测执行,默认true

Hadoop环境下系统变量

  • 变量名列表
变量名 变量说明
HADOOP_HOME 计算节点上配置的Hadoop路径
LD_LIBRARY_PATH 计算节点上加载库文件的路径列表
PWD 当前工作目录
dfs_block_size 当前设置的HDFS文件块大小
map_input_file mapper正在处理的输入文件路径
mapred_job_id 作业ID
mapred_job_name 作业名
mapred_tip_id 当前任务的第几次重试
mapred_task_id 任务ID
mapred_task_is_map 当前任务是否为map
mapred_output_dir 计算输出路径
mapred_map_tasks 计算的map任务数
mapred_reduce_tasks 计算的reduce任务数

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Configured+Parameters

  • 应用示例:
    Shell版
#!/bin/bash

set -o pipefail
HOST="localhost"
PORT=$((1000 + ${mapred_task_partition}))

awk '{print $2}' \
    | ./access_remote_data ${HOST} ${PORT} outdata.gz
                                                                                                                                                     
hdfs_outfile=${mapred_work_output_dir}/${mapred_task_partition}.pack
cat outdata.gz \
    | gzip -d \
    | python ../postprocess.py
    | ${HADOOP_HOME}/bin/hadoop fs -D hadoop.job.ugi="username,pwd" -copyFromLocal - ${hdfs_outfile}

Python版

import os

input_file = os.environ['mapreduce_map_input_file']
#do something else

References

Hadoop Streaming相关官方文档:https://hadoop.apache.org/docs/r3.1.2/hadoop-streaming/HadoopStreaming.html
Hadoop Streaming入门:http://icejoywoo.github.io/2015/09/28/introduction-to-hadoop-streaming.html
Hadoop排序工具用法小结:http://www.dreamingfish123.info/?p=1102
Hadoop压缩选项权衡:https://www.slideshare.net/Hadoop_Summit/singh-kamat-june27425pmroom210c

Guess you like

Origin www.cnblogs.com/jeromeblog/p/11464693.html