Big data: spark shared broadcast variables, accumulators

Big Data: Shared Variables

2022找工作是学历、能力和运气的超强结合体,遇到寒冬,大厂不招人,可能很多算法学生都得去找开发,测开
测开的话,你就得学数据库,sql,oracle,尤其sql要学,当然,像很多金融企业、安全机构啥的,他们必须要用oracle数据库
这oracle比sql安全,强大多了,所以你需要学习,最重要的,你要是考网络警察公务员,这玩意你不会就别去报名了,耽误时间!
与此同时,既然要考网警之数据分析应用岗,那必然要考数据挖掘基础知识,今天开始咱们就对数据挖掘方面的东西好生讲讲 最最最重要的就是大数据,什么行测和面试都是小问题,最难最最重要的就是大数据技术相关的知识笔试


Big Data: Broadcast Variables

insert image description here
insert image description here
Replace the id in the RDD with the name on the left

The right side of spark is RDD, and
the left side of the distributed one is the local python list

The driver on the left is in charge, and the executor is executing on the right. The return name and data are processed
by the function. The map operator comes from the RDD distribution and runs on the executor.
insert image description here


insert image description here
Executing the map in a partition requires network transmission of the local stu_info_list.
insert image description here
The executor is a process.
The memory space within the process is shared by each thread,
so is it necessary for you to send 2 copies of the executor?
Not
only need to send a copy

To solve this problem, broadcast variable technology is required

insert image description here
If I give you the spark mark, the stu information is a broadcast variable.
If partition 1 is sent, OK
, you can ask for partition 2 again. If you don’t give it away, let you go to partition 1 and share it, understand?
I won’t post it anymore! !

If partition 3 is posted, OK
, you can ask for partition 4 again, and I won’t give it away. I will let you go to partition 3 to go and share, understand?
I won’t post it anymore! !

This is to send less, you are a whole, I can only do it once, you can share it,
insert image description here
insert image description here
the type is broadcast variable

insert image description here

Mark the broadcast variable
and then use the value
insert image description here

In this way, the memory will not be wasted.

girl most

Applicable object:
When the local collection object is associated with the RDD object, the local collection object can be encapsulated as a broadcast variable, so that RDD can save memory space by using distributed transmission

Then why not change the local collection to RDD?

No, we avoid that shuffle consumes too much performance.
insert image description hereShuffle is the most performance-intensive.
If the local collection is converted to RDD, it will take a lot of effort to transfer it everywhere

But if it is encapsulated as a broadcast variable, it only needs to interact once
insert image description here
insert image description here
insert image description here
insert image description here
and process 5 each

Why is the final accumulation 0?

Non-RDD codes are executed by the driver.
The RDD is executed by the executor.

When you need to use it, send the count
and you print the last count as the memory pointer, the original value of the driver.
insert image description here
Ordinary writing, and in the process of distributed computing, it is difficult to handle

Then we set the global accumulator variable

insert image description here
The accumulator defines the type, encapsulates it, and it will be superimposed bilaterally

flattered

insert image description here
As we said, rdd is a process data.
When rdd2 is called again later, it needs to be executed again and rebuilt.
This is caused by the nature of rdd process data.

Do you understand
20 times
insert image description here
? Be careful when using the accumulator
. It must be cached so that there will be no accumulation of exceptions.

insert image description here


Summarize

提示:重要经验:

1)
2) Learn oracle well, even if the economy is cold, the whole test offer is definitely not a problem! At the same time, it is also the only way for you to test the public Internet police.
3) When seeking AC in the written test, space complexity may not be considered, but the interview must consider both the optimal time complexity and the optimal space complexity.

Guess you like

Origin blog.csdn.net/weixin_46838716/article/details/131043682