Alibaba Cloud Big Data ACP Certification Study Notes

1. Big Data Foundation

2. Big data computing service Maxcompute

2.1 Basic knowledge

2.1.1 Purchase Maxcompute and create a project to add sub-users

1. First purchase maxcompute in your own service area:
insert image description here
2. Create a project
insert image description here
3. Add sub-users and save their AccessKey
insert image description here
4. Add user permissions to the project
insert image description here

2.1.2 Create ODPS

1. Create ODPS
insert image description here
2. Create a table: create table A (id bigint,name string);
insert image description here
3. View this table:desc A;
insert image description here

2.1.3 Installation and basic commands of the command line client odpscmd of maxcompute

1. Download the installation package on the official website of Ali and decompress it
insert image description here
. 2. Open the only file in the conf directory after pressurization, and fill in the relevant information according to the project:
insert image description here
3. After configuring the file, open cmd in the bin directory, enter odpscmd.batto open and run
insert image description here
4. Enter quit;to exit odpscmd
insert image description here
5. The -f parameter can execute the command in the file: odpscmd -f create.txt
insert image description here
6. The -e parameter can execute the SQL statement: odpscmd -e "select * from test_table;" insert image description here
7. Use use 项目名;to jump to another project of the user, provided that the user has multiple projects.
insert image description here

2.2 Data upload and download

2.2.1 Tunnel batch offline processing

2.2.1.1 Tunnel upload

1. Additional uploadtunnel upload C:\Users\dz\Downloads\up.csv A;

drop table if exists A;#如果表存在删除
create table A(id int,name string);#创造表A,键是id和name
desc A;#查看表A
tunnel help;#查看tunel命令
tunnel upload C:\Users\dz\Downloads\up.csv A;#本地表up.csv内容追加上传到A表;
select * from A;#查看表A
truncate table A;#清理表A里的内容

insert image description here
2. Partition table upload
First create a partition table

create table A(id int,name string) partitioned by (gender string);#按gender创造分区表

insert image description here
Then

tunnel upload C:\Users\dz\Downloads\up_p\up_1.csv A/gender='male' -acp=true;#上传本地表到此分区,没有此分区值则创建
select * from A where gender='male';查看分区值是此的分区表

insert image description here

read A;#查看分区表的所有分区

insert image description here

3. File directory upload

tunnel upload C:\Users\dz\Downloads\DIR B;#将C:\Users\dz\Downloads\DIR下的所有文件上传到表B;

insert image description here
When there are tables with different formats in the folder: -dbr=true means that only the correct format is entered, and the wrong table is discarded;

tunnel upload C:\Users\dz\Downloads\DIR B -dbr=true;#有格式错误的表格,抛弃此表格

4. Parameter scan scan

When scan=true, scan the data first, the format is correct, and then import the data;
when scan=false, do not scan the data, but directly import the data;
when scan=only, only scan the local data, and do not import after scanning

5. Separator

Row separator - rd (default \r\n) and column separator - fd (,)

6. The first line header

Remove the first line header of the csv file: -h=true
insert image description here
Please add a picture description

2.2.1.2tunnel download

1. Download the partition table

tunnel download A C:\Users\dz\Downloads\download\A_d.csv;#下载分区表的所有分区

insert image description here

tunnel download A\gender="male" C:\Users\dz\Downloads\download\A_d_male.csv;#下载分区表的指定分区

insert image description here
2. Download the specified column:
-ci=column number (serial number starts from 0)
-cn=column name

tunnel download B C:\Users\dz\Downloads\download\B_d_ci0.csv -ci=0;

insert image description here

tunnel download B C:\Users\dz\Downloads\download\B_d_cnname.csv -cn="name";

insert image description here
3. Representative header download -h=true

tunnel download B C:\Users\dz\Downloads\download\B_d_h.csv -h=true;

insert image description here
4. Only a few pieces of information are allowed to be downloaded: -limit=num;

tunnel download B C:\Users\dz\Downloads\download\B_d_1.csv -limit=1 ;

insert image description herePlease add a picture description

2.2.2 Use javaSDK to develop upload and download

1. First download javasdk from Alibaba Cloud official website and install eclipse.

2.2.3 datahub real-time processing channel

Please add a picture description
1. Create a datahub project
insert image description here
2. Write topic in the project
insert image description here

3. Create a connect synchronization task in the topic.
insert image description here

2.3maxcomputeSQL Development Basics

2.3.1DDL

create table t_table01(id bigint,name string);#1.建表
desc t_table01;#2.看表
show create table t_table01;#3.查看建表语句
drop table t_table01;#4.删除表
select * from t_table01;#查看表
create table t_table01_p(id bigint,name string) partitioned by(class string);#1.创建分区表
desc t_table01_p;#2.查看分区表
create table AA as select * from A where gender="male";#使用as拿数据,不拿分区
create table AB like A;#使用like拿了表结构包括分区,不拿数据
alter table A set lifecycle 30;#1.设置分区表的生命周期是30天
alter table A disable lifecycle;#2.撤销分区表生命周期
select * from A where gender="male";#1.查看分区表,需要指定分区where
alter table A add if not exists partition(gender="unknown");#2.分区表增加分区gender=“unknown”
insert into A partition(gender="unknown") select 7,"someone";#3.指定分区unknow插入一条(7,someone)的数据
alter table A partition(gender="unknown") rename to partition(gender="trans");#4.将unknown分区名改为trans
alter table A merge partition(gender="male"),partition(gender="trans") overwrite partition(gender="unknow") purge;#5.将male和trans分区合并为unknow分区
alter table A rename to a_new;#6.修改表名A为a_new
alter table a_new add columns(desc string);#7.表加一列
create view v as select * from a_new;#1.创建视图

2.3.2DML

insert image description here
1. Query

list tables;#1.列出库内所有的表
select name,gender from aa;#2.查看aa表的name和gender两列
select name,gender from aa group by name,gender;#3.通过分组group by对这两列进行去重
select distinct name,gender from aa;#4.通过distinct对这两列进行去重
select * from aa limit 2;#5.查看aa表前两行
select * from (select * from aa where gender = 'female') a join (select * from aa where id = '21' and name = 'ki')b on a.id = b.id;#6.子查询

2. insert

insert into aa values(10,'dz','female');#1.aa表插入一行数据
create table aa2 like aa;#2.做一个aa的备份表,拿结构不拿数据
insert into aa2 select * from aa;2.把aa的数据全追加到aa2
insert overwrite table aa2 select * from aa;2.把aa的数据全覆盖到aa2,aa2里原数据删掉了

3. Partition table

create table t_class_p (id int,name string)partitioned by(gender string);#1.创建分区表,gender分区
from aa insert into t_class_p partition(gender = '1') select id,name where id = 10 insert into t_class_p partition(gender = '2') select id,name where id = 11 insert into t_class_p partition(gender = '3') select id,name where id = 12;#2.多路输出,从aa表给分区表分别插入三个数据
set odps.sql.allow.fullscan =true;#3.设置分区表可以全局扫描
select * from t_class_p;#4.查询分区表所有内容

4. Pay, merge, make up, join

create table a1 as select * from aa where gender="female";#1.从aa表里分出gender为female的建表a1
create table a2 as select * from aa where gender="male";#2.从aa表里分出gender为male的建表a2
select id from a1 union all select id from a2;#3.a1表的id和a2表的id通过unio all求并集
select id from a1 union select id from a2;#4.使用union并集去重
select id from a1 intersect all select id from a2;#5.a1表的id和a2表的id通过intersect求交集
select id from a1 except all select id from a2;#6.使用except all求补集,在a1存在但在a2不存在

2.3.3 Built-in functions

1. Mathematical operations and character processing

select 0.5*10*20*sin(60/180*3.1415926);#1.sin三角函数
select ceil(3.1415926),floor(3.1415926),round(3.1415926),trunc(3.1415926),conv('3.1415926',10,2);#2.ceil向上取整,floor向下取整,round四舍五入,trunc截取,conv10进制转换2进制。
select rand();#3.随机值,可以给种子
select abs(-2);#4.abs取绝对值
select power(-2,5);#5.-2的5次方
select sqrt(16);#6.16的均方根
select length("dacadc中文");#7.字符串长度,每个中文1个字符
select length("dacadc中文");#8.字符串长度,每个中文3个字符
select char_matchcount('asdf','asbrgdgf');#9.字符串1里面有几个在字符串2里面出现
select is_encoding("测试","utf-8");#10.测试编码是否utf-8
select instr("sdsdvfg","s");#11第2个字符在第1个字符的哪一个位置第一次出现,以1开头计数
select substr("dasdf",2,3);#从第2个字符开始剪切,剪切长度为3

2. Date processing and window functions

select getdate();#1.查询系统日期
select datediff(datetime '2022-06-18 20:00:00',datetime '2022-06-15 19:00:00','dd');#2.查看两个时间相差几天
select unix_timestamp(datetime '2022-06-13 20:00:00');#3.时间转换成时间戳
select from_unixtime(1655121600);#4.时间戳转换成时间

3. Aggregation and other functions

在这里插入代码片

2.4 UDF development basis

user-defined function

2.5 MR Development Basics

MapReduce is a programming model for parallel operations on large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce", which are their main ideas, are borrowed from functional programming languages, as well as features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a new set of key-value pairs, and specify a concurrent Reduce (reduction) function to ensure that all mapped key-value pairs are Each of the shares the same set of keys.
Create: map, reduce, drive three java files

2.6 Graph Development Basics

2.7 Permissions and Security

show grants;#1.查看此用户在此项目下的权限

3. Big data development and governance platform Dataworks

3.1 Data Integration

1. First create a new data source, here build a mysql data source
insert image description here

3.2 Data Development

3.3 Task operation and maintenance

3.4 Data Management

4. Data visualization analysis platform Quick Bi

5. Machine learning platform PAI

1. Open Alibaba Cloud pai
insert image description here
2. Create a workflow and enter
insert image description here
3. Download the data set , which is classified by red wine, and then import the data into the workspace.

Operation can refer to: https://blog.csdn.net/wyn_365/article/details/107284561

Guess you like

Origin blog.csdn.net/weixin_38226321/article/details/125187415