Spark multi-directory multi-file output according to Key

1 Introduction

Sometimes you will encounter such a requirement, output according to a certain field in the data, and output the data of the same key to a file under a folder.

2. Implementation

2.1 How to implement

Use saveAsHadoopFile to output the data.

1. Ensure that the data of the same key is in the same partition

2. Customize the MultipleTextOutputFormat class

2.2 Code

import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
import org.apache.spark.HashPartitioner;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

/**
 * Created on 16/6/3 13:22
 *
 * @author Daniel
 */
public class SparkMultipleTextOutput {

    public static class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat<String, String> {

        public String generateFileNameForKeyValue(String key, String value,
                                                     String name) {
            //输出格式 /ouput/key/key.csv
            return key + "/" + key+".csv";
        }

    }

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("Test").setMaster("local[2]");
        conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

        JavaSparkContext sc = new JavaSparkContext(conf);

        //加载文件
        JavaRDD<String> rdd1 = sc.textFile("/Users/Daniel/test/test_file.csv");

        //将data转化为K,V
        JavaPairRDD<String,String> rdd2 = rdd1.mapToPair(new PairFunction<String, String, String>() {
            public Tuple2<String, String> call(String s) throws Exception {

                String[] data = s.split(",");
                String key = data[0];
                String value = data[1];
                return new Tuple2<String, String>(key, value);
            }
            //这里是关键点,只要保证同一个Key的数据在同一个分区即可
            //上边的要求一般有2种方式实现
            //第一种 .repartition(1); 测试功能的时候可以使用,现网自己看着办吧.
            //第二种 .partitionBy(); 保证同一个key到一个分区.
        }).partitionBy(new HashPartitioner(2));

        //将JavaPairRDD类型的RDD输出.
        rdd2.saveAsHadoopFile("/Users/Daniel/test2", String.class, String.class, RDDMultipleTextOutputFormat.class);
    }

}

3. Test

 The test data is as follows:

The test results are as follows:

The downside is the output KV. In many cases, everyone only needs V, which I will develop a second time later. Then blog again

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325139222&siteId=291194637