Spark learning--4, key-value pair RDD data partition, accumulator, broadcast variable, SparkCore actual combat (Top10 popular categories)

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


1. Key-value pair RDD data partitioning

Spark currently supports Hash partitions, Range partitions, and user-defined partitions. Hash partition is the current default partition. The partitioner directly determines the number of partitions in the RDD, which partition each piece of data in the RDD enters after Shuffle, and the number of Reduces.

1. Note:
(1) Only Key-Value type RDDs have a partitioner, and the value of the non-Key-Value type RDD partitioner is None.
(2) The partition ID range of each RDD: 0~(numPartitions-1), which determines which partition this value belongs to.
2. Obtain RDD partition
(1) Create package name com.zhm.spark.operator.partitioner
(2) Code implementation

package com.zhm.spark.operator.partitioner;

import org.apache.spark.Partitioner;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.Optional;
import org.apache.spark.api.java.function.Function2;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @ClassName Test01_partitioner
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/29 15:36
 * @Version 1.0
 */
public class Test01_partitioner {
    
    

    /**
     * 刚刚转换来的key—value类型的RDD的分区是empty
     */

    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("Test01");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、创建RDD数据集
        JavaPairRDD<String, Integer> javaRDD = sparkContext.parallelizePairs(Arrays.asList(new Tuple2("a", 1),
                new Tuple2("b", 2), new Tuple2("a", 3)), 2);

        //4、获取javaRDD的分区器
        Optional<Partitioner> partitioner = javaRDD.partitioner();

        //5、输出分区器
        System.out.println("javaRDD的分区器是:"+partitioner);

        //6、对javaRDD根据key进行累加
        JavaPairRDD<String, Integer> reduceByKey = javaRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
    
    
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
    
    
                return v1 + v2;
            }
        });

        //7、获取reduceByKey的分区器
        Optional<Partitioner> partitioner1 = reduceByKey.partitioner();

        System.out.println("reduceByKey的分区器:"+partitioner1);

        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

1.1 Hash partition

The principle of HashPartitioner partition: For a given key, calculate other hashcodes, and divide by the number of partitions to get the remainder. If the remainder is less than 0, use the remainder + the number of partitions (otherwise + 0), and the final returned value is The ID of the partition to which this key belongs.

According to the hashcode value of the original initial letter % number of partitions
A 1 million pieces of data assume that there are three partitions (0, 1, 2)
B 10,000 pieces
C 10,000 pieces D 1 million
pieces of data
E 10,000 pieces
F 10,000 pieces
G 100 Ten thousand
possible partition results:
No. 0 partition
A 1 million data
B 1 million data
C 1 million data

Partition 1
B 10,000 entries
E 10,000 entries

Division 2
C 10,000 pieces
F 10,000 pieces

The above shows the disadvantages of HashPartitioner partitioning: it may lead to uneven data volume in each partition, and in extreme cases, it will cause some partitions to have all the data of RDD.

Code:

package com.zhm.spark.operator.partitioner;

import org.apache.spark.HashPartitioner;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @ClassName Test02_hashPartitioner
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/29 15:47
 * @Version 1.0
 */
public class Test02_hashPartitioner {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("Test01");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、创建RDD数据集
        JavaPairRDD<String, String> pairRDD = sparkContext.parallelizePairs(Arrays.asList(new Tuple2<>("1", "1"),
                new Tuple2<>("2", "1"), new Tuple2<>("3", "1"), new Tuple2<>("4", "1"),
                new Tuple2<>("5", "1"), new Tuple2<>("6", "1"), new Tuple2<>("7", "1"),
                new Tuple2<>("8", "1")),2);

        //4、查看
        pairRDD.saveAsTextFile("output/partition_hash01");

        //5、重新分区
        JavaPairRDD<String, String> partitionBy = pairRDD.partitionBy(new HashPartitioner(4));

        //6、查看
        partitionBy.saveAsTextFile("output/partition_hash02");


        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


1.2 Ranger partition

The role of RangePartitioner: map numbers within a certain range to a certain partition, try to ensure that the amount of data in each partition is even, and the partitions are in order, and the elements in the partition must be larger than those in the other partition. The elements of the partition are small or large, but the order of the elements in the partition is not guaranteed. Simply put, it is to map a certain range of numbers to a certain partition.

The implementation process is as follows:
Step 1: First, use the pond sampling algorithm from the entire RDD to extract sample data, sort the sample data, calculate the maximum key value of each partition, and form an array variable of type Array[KEY] rangeBounds;
Step 2: Determine the range of the key in rangeBounds, given the key value in the subscript of the partition id of the next RDD; the partitioner requires that the KEY type of the RDD must be sortable

(1) We assume that there are 1 million pieces of data to be divided into 4 areas
(2) Draw 100 numbers (1,2,3,...,100) from 1 million pieces of data
(3) Sort the 100 numbers, and then evenly It is divided into four segments
(4) to obtain 1 million pieces of data, compare each value with the range of 4 partitions, and put it into the appropriate partition

2. Accumulator

1. Accumulator: Distributed shared write-only variable
2. Reason: Data cannot be read between Executor and Executor
3. Principle: The accumulator is used to aggregate variable information on the Executor side to the Driver side. A variable defined in the Driver, each task on the Executor side will get a new copy of this variable, and each task will update the values ​​of these copies and send them back to the Driver side for combined calculations.
insert image description here
4. Use of accumulator
(1) Accumulation definition

LongAccumulator longAccumulator = JavaSparkContext.toSparkContext(sparkContext).longAccumulator();

(2) Accumulator adds data (accumulator.add method)
(3) Accumulator acquires data (accumulator.value)
5. Create package name: com.zhm.spark.operator.accumulator
6. Code implementation

package com.zhm.spark.operator.accumulator;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.util.LongAccumulator;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @ClassName Test01_ACC
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/30 9:09
 * @Version 1.0
 */
public class Test01_ACC {
    
    

    /**
     * 累加器是一个分布式共享只写变量
     * 累加器要放在行动算子中,因为转换算子执行的次数取决于job的数量,一个spark应用由多个行动算子,
     * 那么转换算子中的累加器可能会发生不止一次更新,导致结果出错。
     */

    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("Test01");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、创建RDD数据集
        JavaPairRDD<String, Integer> javaPairRDD = sparkContext.parallelizePairs(Arrays.asList(new Tuple2<>("a", 1), new Tuple2<>("a", 2), new Tuple2<>("a", 3), new Tuple2<>("a", 4)));

        System.out.println("--------------使用reduceByKey走shuffle统计(效率低)--------------------");

        JavaPairRDD<String, Integer> reduceByKey = javaPairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
    
    
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
    
    
                return v1 + v2;
            }
        });
        //收集输出
        reduceByKey.foreach(new VoidFunction<Tuple2<String, Integer>>() {
    
    
            @Override
            public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
    
    
                System.out.println(stringIntegerTuple2);
            }
        });


        System.out.println("-----------使用累加器---------------------");
        LongAccumulator longAccumulator = JavaSparkContext.toSparkContext(sparkContext).longAccumulator();

        //在foreach中使用累加器统计a的value之和
        javaPairRDD.foreach(new VoidFunction<Tuple2<String, Integer>>() {
    
    
            @Override
            public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
    
    
                //不要在executor端获取累加器的值,因为不准确,因此我们说累加器叫做分布式共享变量
                longAccumulator.add(stringIntegerTuple2._2);

                //输出累加器的值,可看到获取到的累加器的值不是最终值
                System.out.println("累加器的值:"+longAccumulator.value());
            }
        });

        System.out.println("通过累加器计算的a的总和:"+longAccumulator.value());


        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


Note: The task on the Executor side should not read the value of the accumulator (for example: calling sum.value on the Executor side, the value obtained is not the final value of the accumulator). Because the accumulator is a distributed shared write-only variable.
7. The accumulator should be placed in the action operator
because the number of executions of the conversion operator determines the number of jobs. If a spark application has multiple action operators, the accumulator in the conversion operator may be updated more than once, resulting in The result is wrong. So, if we want an accumulator that is absolutely reliable whether it fails or is recalculated, we have to put it in an action operator like foreach().
8. Running result:

insert image description here

3. Broadcast variables

1. Broadcast variables: Distributed shared read-only variables
2. Reason: Broadcast variables can be used to efficiently distribute large objects, a read-only value, to all working nodes for use by one or more SparkTasks.
3. For example: If your application needs to send a large read-only query table to all nodes, broadcast variables will be very convenient to use. It is to use the same variable in multiple Task parallel operations, and Spark will send it separately for each Task task.
4. Steps for using broadcast variables:
(1) Call SparkContext.broadcast (broadcast variable) to create a broadcast object, which can be implemented for any serializable type.
(2) Access the value of the object by broadcasting the variable .value.
(3) The broadcast variable will only be sent to each node once, and it will be treated as a read-only value (modifying this value will not affect other nodes).
5. Principle description
insert image description here
6. Code implementation

package com.zhm.spark.operator.broadcast;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.broadcast.Broadcast;

import java.util.Arrays;
import java.util.List;

/**
 * @ClassName Test01_BroadCasr
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/30 9:24
 * @Version 1.0
 */
public class Test01_BroadCast {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("Test01");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、创建RDD数据集
        JavaRDD<Integer> javaRDD = sparkContext.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10));

        //奇数列表
        List<Integer> LuckList = Arrays.asList(1, 3, 6);

        //4、使用List,为每个Task都创建一个List,浪费内存
        JavaRDD<Integer> filter = javaRDD.filter(new Function<Integer, Boolean>() {
    
    
            @Override
            public Boolean call(Integer v1) throws Exception {
    
    
                return LuckList.contains(v1);
            }
        });

        //5、收集输出
        System.out.println("----------使用List----------");
        filter.collect().forEach(System.out::println);


        //6、使用广播变量,只会发送一个数据到每个Executor
        Broadcast<List<Integer>> broadcast = sparkContext.broadcast(LuckList);
        JavaRDD<Integer> filter1 = javaRDD.filter(new Function<Integer, Boolean>() {
    
    
            @Override
            public Boolean call(Integer v1) throws Exception {
    
    
                return broadcast.value().contains(v1);
            }
        });

        System.out.println("------------使用广播变量-------------");
        filter1.collect().forEach(System.out::println);

        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

4. SparkCore actual combat

4.1 Data preparation

1. Data format
insert image description here
2. Data detailed field description
insert image description here

4.2 Demand: Top 10 Popular Categories

1. Demand description:
Category refers to the classification of products. Large-scale e-commerce websites have multiple levels of categories. Our project has only one level of category. Different companies may have different definitions of hot spots. We count popular categories according to the amount (number of times) of clicks, orders, and payments for each category.
Shoes ========> Number of Clicks to Order Payment
Clothes =======> Clicks to Order Payment Number of
Computers =======> Clicks to Order Number of Payments
For example, comprehensive ranking = 20% of the number of clicks + 30% of the number of orders + 30% of the number of payments *50%
For better versatility, the current case is sorted by the number of clicks. If the clicks are the same, it will be based on the number of orders. If the orders are still the same, it will be based on the number of payments.

4.2.1 Requirements Analysis (Scheme 1) Objects

Implemented in the form of class objects.

4.2.2 Requirement Realization

1. Add the plug-in of lombok
2. Add the dependency of lombok to omit the getset code

 <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.22</version>
 </dependency> 

3. Create two data storage classes UserVisitAction and CategoryCountInfo under the bean package



import lombok.Data;

/**
 * @ClassName UserVisitAction
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/30 9:35
 * @Version 1.0
 */
@Data
public class UserVisitAction {
    
    
    private String date;  //用户点击行为的日期
    private Long user_id; //用户的ID
    private String session_id; //Session的ID
    private Long page_id; //某个页面的ID
    private String action_time; //动作的时间点
    private String search_keyword; //用户搜索的关键词
    private Long click_category_id; //点击某一个商品品类的ID
    private Long click_product_id; //某一个商品的ID
    private String order_category_ids; //一次订单中所有品类的ID集合
    private String order_product_ids; //一次订单中所有商品的ID集合·1
    private String pay_category_ids; //一次支付中所有品类的ID集合
    private String pay_product_ids;//一次支付中所有商品的ID集合
    private Long city_id; //城市ID

    public UserVisitAction(String date, Long user_id, String session_id, Long page_id,
                           String action_time, String search_keyword,
                           Long click_category_id, Long click_product_id,
                           String order_category_ids, String order_product_ids,
                           String pay_category_ids, String pay_product_ids,
                           Long city_id) {
    
    
        this.date = date;
        this.user_id = user_id;
        this.session_id = session_id;
        this.page_id = page_id;
        this.action_time = action_time;
        this.search_keyword = search_keyword;
        this.click_category_id = click_category_id;
        this.click_product_id = click_product_id;
        this.order_category_ids = order_category_ids;
        this.order_product_ids = order_product_ids;
        this.pay_category_ids = pay_category_ids;
        this.pay_product_ids = pay_product_ids;
        this.city_id = city_id;
    }

}




import lombok.Data;

/**
 * @ClassName CategoryCountInfo
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/30 9:43
 * @Version 1.0
 */
@Data
public class CategoryCountInfo implements Comparable{
    
    
    private Long categoryId;
    private Long clickCount;  //点击数
    private Long orderCount; //下单数
    private Long payCount; //支付数

    public CategoryCountInfo() {
    
    
    }

    public CategoryCountInfo(Long categoryId, Long clickCount, Long orderCount, Long payCount) {
    
    
        this.categoryId = categoryId;
        this.clickCount = clickCount;
        this.orderCount = orderCount;
        this.payCount = payCount;
    }

    @Override
    public int compareTo(Object o) {
    
    
        CategoryCountInfo categoryCountInfo= (CategoryCountInfo) o;
        if (this.getClickCount()==categoryCountInfo.getClickCount()){
    
    
            if (this.getOrderCount()==categoryCountInfo.getOrderCount()){
    
    
                if (this.getPayCount()==categoryCountInfo.getPayCount()){
    
    
                    return 0;
                }
                else return (int) (this.getPayCount()-categoryCountInfo.getPayCount());
            }
            else return (int) (this.getOrderCount()-categoryCountInfo.getOrderCount());
        }
        else return (int) (this.getClickCount()-categoryCountInfo.getClickCount());
    }
}


4. Create Top10 classes and write core business codes for implementation



import com.zhm.spark.Top10Demo.bean.CategoryCountInfo;
import com.zhm.spark.Top10Demo.bean.UserVisitAction;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.serializer.KryoSerializer;
import scala.Tuple2;

import java.util.ArrayList;
import java.util.Iterator;

/**
 * @ClassName Top10
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/30 9:46
 * @Version 1.0
 */
public class Top10 {
    
    
    public static void main(String[] args) throws ClassNotFoundException {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("Test01").set("spark.serializer", KryoSerializer.class.getName()).
                registerKryoClasses(new Class[]{
    
    Class.forName(UserVisitAction.class.getName()),Class.forName(CategoryCountInfo.class.getName())});

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //2019-07-17_95_26070e87-1ad7-49a3-8fb3-cc741facaddf_37_2019-07-17 00:00:02_手机_-1_-1_null_null_null_null_3
        //3、创建RDD数据集(按行读取文件数据)
        JavaRDD<String> javaRDD = sparkContext.textFile("inputTop10");

        //4、转化为UserVisitAction对象储存(map)
        JavaRDD<UserVisitAction> map = javaRDD.map(new Function<String, UserVisitAction>() {
    
    
            @Override
            public UserVisitAction call(String v1) throws Exception {
    
    
                String[] split = v1.split("_");
                UserVisitAction userVisitAction = new UserVisitAction(split[0], Long.parseLong(split[1]),
                        split[2], Long.parseLong(split[3]), split[4], split[5], Long.parseLong(split[6]), Long.parseLong(split[7]), split[8],
                        split[9], split[10], split[11], Long.parseLong(split[12])
                );
                return userVisitAction;
            }
        });

        //5、遍历map,将数据炸裂为(CategoryCountInfo对象)
        JavaRDD<CategoryCountInfo> categoryCountInfoJavaRDD = map.flatMap(new FlatMapFunction<UserVisitAction, CategoryCountInfo>() {
    
    
            @Override
            public Iterator<CategoryCountInfo> call(UserVisitAction userVisitAction) throws Exception {
    
    
                ArrayList<CategoryCountInfo> list = new ArrayList<>();
                //代表是点击行为
                if (userVisitAction.getClick_category_id() != Long.valueOf(-1)) {
    
    
                    list.add(new CategoryCountInfo(userVisitAction.getClick_category_id(), 1L, 0L, 0L));
                }
                //代表是下单行为
                if (!userVisitAction.getOrder_category_ids().equals("null")) {
    
    
                    String[] split = userVisitAction.getOrder_category_ids().split(",");
                    for (String s : split) {
    
    
                        list.add(new CategoryCountInfo(Long.parseLong(s), 0L, 1L, 0L));
                    }
                }
                if (!userVisitAction.getPay_category_ids().equals("null")) {
    
    
                    String[] split = userVisitAction.getPay_category_ids().split(",");
                    for (String s : split) {
    
    
                        list.add(new CategoryCountInfo(Long.parseLong(s), 0L, 0L, 1L));
                    }
                }
                return list.iterator();
            }
        });

        //6、将 categoryCountInfoJavaRDD转换为key-valueRDD
        JavaPairRDD<Long, CategoryCountInfo> longCategoryCountInfoJavaPairRDD = categoryCountInfoJavaRDD.mapToPair(new PairFunction<CategoryCountInfo, Long, CategoryCountInfo>() {
    
    
            @Override
            public Tuple2<Long, CategoryCountInfo> call(CategoryCountInfo categoryCountInfo) throws Exception {
    
    
                return new Tuple2<>(categoryCountInfo.getCategoryId(), categoryCountInfo);
            }
        });

        //7、将longCategoryCountInfoJavaPairRDD的值累加
        JavaPairRDD<Long, CategoryCountInfo> longCategoryCountInfoJavaPairRDD1 = longCategoryCountInfoJavaPairRDD.reduceByKey(new Function2<CategoryCountInfo, CategoryCountInfo, CategoryCountInfo>() {
    
    
            @Override
            public CategoryCountInfo call(CategoryCountInfo v1, CategoryCountInfo v2) throws Exception {
    
    
                //计算点击的和
                v1.setClickCount(v1.getClickCount() + v2.getClickCount());

                //计算下单的和
                v1.setOrderCount(v1.getOrderCount() + v2.getOrderCount());

                //计算支付的和
                v1.setPayCount(v1.getPayCount() + v2.getPayCount());
                return v1;
            }
        });

        //8、将longCategoryCountInfoJavaPairRDD1转换为value型的RDD
        JavaRDD<CategoryCountInfo> map1 = longCategoryCountInfoJavaPairRDD1.map(new Function<Tuple2<Long, CategoryCountInfo>, CategoryCountInfo>() {
    
    
            @Override
            public CategoryCountInfo call(Tuple2<Long, CategoryCountInfo> v1) throws Exception {
    
    
                return v1._2;
            }
        });


        //9、排序
        JavaRDD<CategoryCountInfo> sortBy = map1.sortBy(new Function<CategoryCountInfo, CategoryCountInfo>() {
    
    
            @Override
            public CategoryCountInfo call(CategoryCountInfo v1) throws Exception {
    
    
                return v1;
            }
        }, false, 2);


        //10、获取数据集前10个
        sortBy.take(10).forEach(System.out::println);


        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

Guess you like

Origin blog.csdn.net/qq_44804713/article/details/131658154