Big data: RDD action operators: foreach, saveAsText, mapPartitions, foreachPartition, partitionBy, repartition,

Big data: RDD action operator

2022找工作是学历、能力和运气的超强结合体,遇到寒冬,大厂不招人,可能很多算法学生都得去找开发,测开
测开的话,你就得学数据库,sql,oracle,尤其sql要学,当然,像很多金融企业、安全机构啥的,他们必须要用oracle数据库
这oracle比sql安全,强大多了,所以你需要学习,最重要的,你要是考网络警察公务员,这玩意你不会就别去报名了,耽误时间!
与此同时,既然要考网警之数据分析应用岗,那必然要考数据挖掘基础知识,今天开始咱们就对数据挖掘方面的东西好生讲讲 最最最重要的就是大数据,什么行测和面试都是小问题,最难最最重要的就是大数据技术相关的知识笔试


Big data: RDD action operator

insert image description here
Do not use the assignment symbol.
insert image description here
insert image description here
Collect is required to collect before.
Foreach is directly printed internally by
insert image description here
saveAsText insert image description here
insert image description here
insert image description here
, which is directly written in.
Several partitions are written separately
insert image description here
. It is not the kind of collect. No matter the driver
writes directly, the performance is good.
This is one of the advantages of spark performance.
insert image description here
Others It must go through the driver,
insert image description here
the result is the same,
but the mapPartitions process is different,
one-time transmission

iterator object

Each element is done, that is, for iterative processing
returns the entire result
append collects the results

Network io is an iteration, and the speed is fast. This is also one of the great places of spark.
The space complexity is greatly reduced.
insert image description here

Do you understand?
The foreachPartition of spark
insert image description here
is also similar.
ForeachPartition has no return value, just print it directly and
insert image description here
call it directly.

insert image description here
partitionBy
returns an int
partition number.
The default is a hash partition
. If you customize it, you can do whatever you want

divided into several areas

process is to return the partition number, anyway, don’t exceed the maximum number of districts you automatically partition.
insert image description here
Willful partition

insert image description here
It is best not to change this partition, which
will seriously affect the speed of memory and shuffle stateful operations

insert image description here
It is not recommended to use this function! ! ! ! ! ! ! ! ! !

coalesce
can also modify partitions
insert image description here

Ask you the difference between groupByKey and reduceByKey in the interview

The difference is very big
insert image description here
. One group, one aggregation.
The performance of groupByKey is weak, while the performance of redeceByKey is very powerful.

insert image description here
First shuffle
is slow

And reduceByKey, with its own aggregation logic, first internal aggregation, and then inter-partition aggregation

insert image description here
insert image description here
That's a good question for an interview
to tell the difference

The function and performance of reduceByKey
are very different.

Anyway, just find a way to reduce the io overhead of network transmission

Summary of RDDs

insert image description here
insert image description here
foreach和saveAsTextFile
insert image description here
insert image description here
insert image description here
insert image description here


Summarize

提示:重要经验:

1)
2) Learn oracle well, even if the economy is cold, the whole test offer is definitely not a problem! At the same time, it is also the only way for you to test the public Internet police.
3) When seeking AC in the written test, space complexity may not be considered, but the interview must consider both the optimal time complexity and the optimal space complexity.

Guess you like

Origin blog.csdn.net/weixin_46838716/article/details/131033168