Big data: data table operation, partition table, bucket table, modification table, array, map, struct

Big data: data table operations, partition tables

2022找工作是学历、能力和运气的超强结合体,遇到寒冬,大厂不招人,可能很多算法学生都得去找开发,测开
测开的话,你就得学数据库,sql,oracle,尤其sql要学,当然,像很多金融企业、安全机构啥的,他们必须要用oracle数据库
这oracle比sql安全,强大多了,所以你需要学习,最重要的,你要是考网络警察公务员,这玩意你不会就别去报名了,耽误时间!
与此同时,既然要考网警之数据分析应用岗,那必然要考数据挖掘基础知识,今天开始咱们就对数据挖掘方面的东西好生讲讲 最最最重要的就是大数据,什么行测和面试都是小问题,最难最最重要的就是大数据技术相关的知识笔试


Big Data: Partition Tables

insert image description hereinsert image description here
insert image description here
insert image description here
Physically, the folders are
separated

insert image description here
insert image description here
the syntax is

partitioned by(字段,列类型)

insert image description here
insert image description here
The injected data is the partition in May
. In this case, it is equivalent to specifying a field attribute

insert image description here

The partition will continue to build subfolders

insert image description here

Multi-level partitioning
insert image description here
is equivalent to three file directories,
injecting data,
insert image description here
insert image description here
insert image description here
narrowing the scope of query,
filtering conditions, very similar to SQL

bucket table

insert image description here
insert image description here
insert image description here
Bucketing is for load balancing

The number of files is fixed

insert image description here
The purpose is load balancing

insert image description here
The number of reduce is the same as the number of buckets.
It is estimated that it is for the convenience of calculating channel matching.

clustered by(字段) into k buckets

关键字

Bucketing, which field is used to divide the buckets,
the hash value is randomly divided into buckets, awesome

learned in the algorithm

insert image description here
insert image description here
Load transfer
To make a table, you cannot directly transfer the data to the bucketed table. The table created
insert image description here
insert image description here
by bucketing
is clustered by
the imported data is clustered.

no ed

insert image description here

insert image description here
Look at the number of buckets specified by hdfs , which is 3.
According to the field of cid, buckets are divided.
The principle of bucketing is hash table mapping.

The cid hash value %3
is enough.
insert image description here
insert image description here
The data needs to be divided into three parts.
You can’t do it directly.
You also need to calculate the whereabouts .

insert image description here
As long as it is calculated, it must go through MapReduce
, so load data cannot do it, it will not be triggered,
insert image description here
insert image description here
so the contents of each bucket may not necessarily be the same.
insert image description here
insert image description here
The purpose of bucketing is to determine certain data, which must be in the same bucket
without having to go Find another barrel,
understand?

insert image description here
Corresponding to join, just merge

insert image description here
naturally grouped

insert image description here

modify table

insert image description here
insert image description here
insert image description here
Modify the table name
insert image description here
insert image description here
Modify the attributes of the table, internal table, external table
insert image description here
insert image description here

insert image description here
Add folder,
modify folder name,
delete folder,
partition is folder classification

insert image description here
There is no need to engage in partitioning
, do not operate partition operations

insert image description here
add column

insert image description here

complex operation array type

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
The array is separated by commas

insert image description here
insert image description here
insert image description here
insert image description here
insert image description hereinsert image description here
insert image description here
Count the number of arrays
Regardless of python, java, c++, or sql, hive, they are all similar, and the core idea remains the same
insert image description here

map data type

insert image description here
insert image description here
The collection items are separated by # to separate
the map key-value pairs through: separated.
It’s better to say
insert image description here
the map type, which is better than sql

insert image description here
insert image description here
insert image description here
The dictionary in python
is the kv key-value pair

easy to say

insert image description here

struct data type

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
structure, in c
insert image description here

Anyway, hive is based on the MapReduce sql framework. It can write sql and do distributed computing. Reviewing this knowledge will be very helpful for future network police exams.


Summarize

提示:重要经验:

1)
2) Learn oracle well, even if the economy is cold, the whole test offer is definitely not a problem! At the same time, it is also the only way for you to test the public Internet police.
3) When seeking AC in the written test, space complexity may not be considered, but the interview must consider both the optimal time complexity and the optimal space complexity.

Guess you like

Origin blog.csdn.net/weixin_46838716/article/details/131017333