(Turn) How to remove the separator automatically added by mapreduce

    When we use the streaming mode mapreduce to develop programs, we often find that the output result is inexplicably added with a separator, such as a Tab symbol in the middle or end of a line. Especially when the output has only one field, a Tab character must be added at the end, which looks very disgusting and may affect the correctness of the program, so we must remove it.

 
    Let's see how he came about first. Because the streaming version of mapreduce will organize the output of the program in the form of key/velue, and there needs to be a separator between the key/value to facilitate the program to distinguish. This separator, the default is Tab. We can modify it with -jobconf stream.map.output.field.separator=, and -jobconf mapred.textoutputformat.separator= .
 
    As mentioned earlier, if in the mapper stage or reduce stage, only the key is output, and there is no value (the default is tab as the key and value division. So if there is no tab in the output data, it means that there is only key and no value), mapreduce The framework will automatically add a tab to the data. Even if we modify the separator, it will add the modified separator, which still cannot solve this problem. In response to this situation, hadoop provides a parameter, adding -jobconf mapred.textoutputformat.ignoreseparator=true  Through this method, the automatically added tab can be removed.
 
    But there is one thing to pay attention to: the map stage and reduce stage will have the above-mentioned automatic tab filling problem, and the -jobconf mapred.textoutputformat.ignoreseparator=true parameter can only remove the tab added in the reduce stage, so if it is automatically added in the map stage On the tab, you need to manually delete it in the reduce program. For programs that only have map, you can remove the Tab key by adding a round of reduce and then using parameters in reduce.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326116795&siteId=291194637