spark之distinct去重原理 - 代码天地

spark之distinct去重原理

其他 2020-04-10 10:51:06 阅读次数: 0

distinct算子原理：

贴上spark源码：

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

示例代码：

package com.wedoctor.utils.test

import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

object Test {
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    //本地环境需要加上
    System.setProperty("HADOOP_USER_NAME", "root")
    val session: SparkSession = SparkSession.builder()
       .master("local[*]")
       .appName(this.getClass.getSimpleName)
       .getOrCreate()

    val value: RDD[Int] = session.sparkContext.makeRDD(Array(3,3,4,5,5))
    value.distinct().foreach(println)
    //等价于
    value.map(x=>(x,null)).reduceByKey((x,y) => x).map(_._1).foreach(println)
    session.close()
  }
}

发布了79 篇原创文章 · 获赞 107 · 访问量 8万+

私信关注

猜你喜欢

转载自blog.csdn.net/zuochang_liu/article/details/105387704

spark之distinct去重原理

spark部分：distinct去重的原理

spark算子：distinct去重的原理

orcale distinct 去重

mysql distinct 去重

去重 DISTINCT

distinct去重

Distinct 条件去重

distinct （去重）

distinct 与group by 去重

去重算子：distinct

distinct left out join group by order by之去重

MySQL之去重（DISTINCT去掉重复数据）

C#黔驴技巧之去重（Distinct）

Stream流之distinct去重详细用法

mysql distinct 去重（转载）

thinkphp去重，distinct、group

Oracle的去重函数 distinct

去重是distinct还是group by？

.Net Collection Distinct 去重

3 List去重--distinct

mysql distinct()函数去重

Spark经典案例之数据去重

C#--Distinct C#黔驴技巧之去重（Distinct）

distinct 去重需要注意的地方

DISTINCT 去重仍有重复的分析

sql 去重 distinct 关键字

sql-distinct去重语句

SQL去重distinct方法解析

Linq distinct去重方法之一

今日推荐

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

开源日报 | 中学生开源前端动画引擎；全球首个Llama3 8B中文版开源模型；联想电脑恐出局；Linus讽刺AI炒作

“百模大战”必有一战 | 2024中国“百模大战”竞争格局分析

最强开源大模型 Llama 3 上架 Gitee AI

虽然老乡鸡开源的不是代码，但背后的原因却让人很暖心

周排行

决策树的部分理解

STM32软件IIC的实现

RocketMQ原理解析-HA

vue-动态路由（路由的传参和接参）

利用python对Excel中的特定数据提取并写入新表

【Ubuntu】 Ubuntu16.04搭建NFS服务

Elasticsearch基础操作与对应的curl命令行，python对接实现

JVM数据存储结构 & Java的值传递和址传递

yum命令使用指南

java基础（一）：java语法基础

每日归档

更多

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)

2024-04-18(0)

2024-04-17(5)

2024-04-16(70)

2024-04-15(42)