Big data: pyspark module, RDD of spark core, RDD is an abstract object of elastic distributed data, five characteristics of RDD, wordcount case shows RDD

Big data: pyspark module

2022找工作是学历、能力和运气的超强结合体,遇到寒冬,大厂不招人,可能很多算法学生都得去找开发,测开
测开的话,你就得学数据库,sql,oracle,尤其sql要学,当然,像很多金融企业、安全机构啥的,他们必须要用oracle数据库
这oracle比sql安全,强大多了,所以你需要学习,最重要的,你要是考网络警察公务员,这玩意你不会就别去报名了,耽误时间!
与此同时,既然要考网警之数据分析应用岗,那必然要考数据挖掘基础知识,今天开始咱们就对数据挖掘方面的东西好生讲讲 最最最重要的就是大数据,什么行测和面试都是小问题,最难最最重要的就是大数据技术相关的知识笔试


Big data: pyspark module

insert image description here
insert image description here
This wave of pyspark is a framework API
rather than a third-party library, not a third-party code
but a client

pyspark is an interactive client that can write independent programs
insert image description here
insert image description here

RDD of spark core

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
RDD is an abstract data object
whose purpose is to uniformly schedule in the distributed computing framework

Massive data, balanced distribution in spark
RDD is the core abstract object in spark,
very very important

insert image description here
Elastic Distributed Dataset
Immutable, distributed storage, and parallel computing.
insert image description here
Ordinary dictionaries, lists, and arrays are all data collections in the same process, while RDDs are distributed storage. RDDs
stored across processes and machines
are Elastic, the data is in the memory and hard disk, and the partition can be dynamically increased or decreased
.

insert image description here

Five characteristics of RDD

RDD has partitions.
Each fragment has a calculation method, which acts on each fragment.
There is a dependency relationship with other RDDs. The
kv-type RDD can have a partitioner
to read RDD partition data, as close as possible to the location of the data and
as little as possible.
insert image description here
insert image description hereinsert image description here
insert image description here
sc is the spark core,
glom is the partitioned api
data RDD will still be partitioned

insert image description here
Because you are storing data separately, when calling the function, of course, it must be applied to each partition

insert image description here
Logic is code,
physics is to act on each partition

insert image description here
insert image description here
This itself is a dependency chain in the process of program processing
, which is equivalent to assembly line processing.
Anyway, everyone works in parallel
, but each process is dependent on going down.
The final finished vehicle is produced step by step.

insert image description here
The key-value tuple
is a dictionary. As
insert image description here
insert image description here
mentioned before, data balance

RDD may not necessarily be of key-value type.
We can use key to partition, but non-kv type cannot be partitioned.

insert image description here
The local reading speed is fast,
do not use the network, the transmission is troublesome,
and the ability of parallel computing is the core

wordcount case study

Let’s see how it’s calculated
insert image description here
. Work, the three routes are partitioned
insert image description here
. After flatmap, the three partitions must be functioned, flattened, and
insert image description here
then the number of words is counted. Finally,
the map
insert image description here
is reduced and aggregated.
The same statistics are put together

insert image description here
Then collect data
insert image description here
Hash rule partitioner
Default grouping
During calculation, read nearby.
These are the five characteristics of RDD! ! !

RDD: Elastic Distributed Dataset (is a data abstraction)
partition, parallel

insert image description here


Summarize

提示:重要经验:

1)
2) Learn oracle well, even if the economy is cold, the whole test offer is definitely not a problem! At the same time, it is also the only way for you to test the public Internet police.
3) When seeking AC in the written test, space complexity may not be considered, but the interview must consider both the optimal time complexity and the optimal space complexity.

Guess you like

Origin blog.csdn.net/weixin_46838716/article/details/131023555
Recommended