hadoop streaming parameter collation

1. Introduction to Streaming

Hadoop Streaming is a programming tool provided by Hadoop. The Streamining framework allows any executable file or script file to be used as a Mapper and Reducer in Hadoop MapReduce, which facilitates the migration of existing programs to the Hadoop platform. Therefore, it can be said that it is of great significance to the scalability of hadoop.

The principle of Streamining: mapper and reducer will read data from standard input, process it line by line, and send it to standard output.

If a file (executable or script) is used as a mapper, each mapper task starts the file as a separate process when the mapper is initialized, and when the mapper task runs, it splits the input into lines and provides each line to the executable The standard input of the file process. At the same time, mapper collects the content of the standard output of the executable file process, and converts each line of content received into a key/value pair as the output of mapper. By default, before the first tab in a line The part of it is used as the key, and the part after it is used as the value. If there is no tab, the entire row is used as the key, and the value is null.

2. Introduction to basic configuration parameters

In the following parameters -files, -archives, -libjars are hadoop commands

  • -file uploads the client's local file into a jar package to HDFS and distributes it to the computing nodes
  • -cacheFile distributes HDFS files to compute nodes
  • -cacheArchive distributes HDFS compressed files to compute nodes and decompresses them, and can also specify symbolic links
  • -files: Distribute the specified local /hdfs file to the working directory of each Task without any processing of the file , for example: -files hdfs://host:fs_port/user/testfile.txt#testlink;
  • The -archives option allows the user to copy the jar package to the working path of the current task, and automatically decompress the jar package , for example: -archives hdfs://host:fs_port/user/testfile.jar#testlink3 . In this example, the symbolic link testlink3 is created in the working path of the current task. The symbolic link points to the file path where the decompressed jar package is stored
  • -libjars: Specify the jar packages to be distributed. After Hadoop distributes these jar packages to each node, it will be automatically added to the CLASSPATH environment variable of the task
  • -input: Specify the job input, which can be a file or a directory, you can use the * wildcard, or you can use multiple files or directories to specify as input
  • -output: Specifies the job output directory, which must not exist and must have permission to create the directory. -output can only be used once
  • -mapper: Specify the mapper executable program or Java class, which must be specified and unique
  • -reducer: Specify the reducer executable program or Java class
  • -numReduceTasks: Specify the number of reducers. If set to 0 or -reducer NONE, there is no reducer program, and the output of the mapper is directly used as the output of the entire job

3. Other parameter configuration

Generally, -jobconf | -D NAME=VALUE is used to specify job parameters and tasks of mapper or reducer. Officially, -jobconf is used, but this parameter is outdated and cannot be used. Officially, -D is used. Note that -D is It should appear as the initial configuration, because it needs to be hard-specified before the mapper and reducer are executed, so it should appear at the top of the parameter./bin/hadoop jar hadoop-0.19.2-streaming.jar -D ………-input …….. something like this

  • mapred.job.name=”jobname” Set the job name, especially recommended
  • mapred.job.priority=VERY_HIGH|HIGH|NORMAL|LOW|VERY_LOW set job priority
  • * mapred.job.map.capacity=M设置同时最多运行M个map任务*
  • mapred.job.reduce.capacity=N 设置同时最多运行N个reduce任务
  • mapred.map.tasks 设置map任务个数
  • mapred.reduce.tasks 设置reduce任务个数
  • mapred.job.groups 作业可运行的计算节点分组
  • mapred.task.timeout 任务没有响应(输入输出)的最大时间
  • mapred.compress.map.output 设置map的输出是否压缩
  • mapred.map.output.compression.codec 设置map的输出压缩方式
  • mapred.output.compress 设置map的输出是否压缩
  • mapred.output.compression.codec 设置reduce的输出压缩方式
  • stream.map.output.field.separator 指定map输出分隔符,默认情况下Streaming框架将map输出的每一行的第一个’\t’之前的部分作为key,之后的部分作为value,key\value再作为reduce的输入;
  • stream.num.map.output.key.fields 设置分隔符的位置,如果设置为2,指定在第2个分隔符处分割,也就是第二个之前作为key,之后作为value,如果分隔符少于2个,则以整行为key,value为空
  • stream.reduce.output.field.separator 指定reduce输出分隔符
  • stream.num.reduce.output.key.fields 指定reduce输出分隔符位置

其中sort和partition的参数用的比较多 
map.output.key.field.separator: map中key内部的分隔符 
num.key.fields.for.partition: 分桶时,key按前面指定的分隔符分隔之后,用于分桶的key占的列数。通俗地讲,就是partition时候按照key中的前几列进行划分,相同的key会被打到同一个reduce里。 
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 前两个参数,要配合partitioner选项使用!

stream.map.output.field.separator: map中的key与value分隔符 
stream.num.map.output.key.fields: map中分隔符的位置 
stream.reduce.output.field.separator: reduce中key与value的分隔符 
stream.num.reduce.output.key.fields: reduce中分隔符的位置

另外还有压缩参数

Job输出结果是否压缩

mapred.output.compress

是否压缩,默认值false。

mapred.output.compression.type

压缩类型,有NONE, RECORD和BLOCK,默认值RECORD。

mapred.output.compression.codec

压缩算法,默认值org.apache.hadoop.io.compress.DefaultCodec。

map task输出是否压缩

mapred.compress.map.output

是否压缩,默认值false

mapred.map.output.compression.codec

压缩算法,默认值org.apache.hadoop.io.compress.DefaultCodec

另外,Hadoop本身还自带一些好用的Mapper和Reducer:

1、Hadoop聚集功能:Aggregate提供一个特殊的reducer类和一个特殊的combiner类,并且有一系列的“聚合器”(例如:sum、max、min等)用于聚合一组value的序列。可以使用Aggredate定义一个mapper插件类,这个类用于为mapper输入的每个key/value对产生“可聚合项”。combiner/reducer利用适当的聚合器聚合这些可聚合项。要使用Aggregate,只需指定”-reducer aggregate”。 
2、字段的选取(类似于Unix中的’cut’):Hadoop的工具类org.apache.hadoop.mapred.lib.FieldSelectionMapReduc帮助用户高效处理文本数据,就像unix中的“cut”工具。工具类中的map函数把输入的key/value对看作字段的列表。 用户可以指定字段的分隔符(默认是tab),可以选择字段列表中任意一段(由列表中一个或多个字段组成)作为map输出的key或者value。 同样,工具类中的reduce函数也把输入的key/value对看作字段的列表,用户可以选取任意一段作为reduce输出的key或value。


4.默认情况

在hadoop streaming的默认情况下,是以”\t”作为分隔符的。对于标准输入来说,每行的第一个”\t” 以前的部分为key,其他部分为对应的value。如果一个”\t”字符没有,则整行都被当做key。这个

5.map阶段的sort与partition

map阶段很重要的阶段包括sort与partition。排序是按照key来进行的。咱们之前讲了默认的key是由”\t”分隔得到的。我们能不能自己控制相关的sort与partition呢?答案是可以的。

先看以下几个参数: 
map.output.key.field.separator: map中key内部的分隔符 
num.key.fields.for.partition: 分桶时,key按前面指定的分隔符分隔之后,用于分桶的key占的列数。通俗地讲,就是partition时候按照key中的前几列进行划分,相同的key会被打到同一个reduce里。 
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 前两个参数,要配合partitioner选项使用!

stream.map.output.field.separator: map中的key与value分隔符 
stream.num.map.output.key.fields: map中分隔符的位置 
stream.reduce.output.field.separator: reduce中key与value的分隔符 
stream.num.reduce.output.key.fields: reduce中分隔符的位置



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325965145&siteId=291194637