Alibaba Cloud big data tool Maxcompute learning - if you have used hive

If you are a big data development engineer and have used hadoop's hive framework, congratulations, you have already mastered 90% of Alibaba Cloud's big data computing service-Maxcompute. This article will briefly compare the similarities and differences between Maxcompute and hive, to facilitate users who are just starting to use Maxcompute to migrate from hive to Maxcompute in seconds.
First, review the concept of hive.
1. Hive is based on hadoop and stores data in the form of tables. In fact, the data is stored on hdfs. The database and table are actually two-level directories on hdfs. The data is placed in the table name directory, calculation or conversion into a mapreduce calculation.
2. Hive can operate data through the client command line and java api.
3. Hive is an HQL language operation table, which is roughly the same as the general SQL language syntax, and there may be more functions that conform to its own calculation. hql will be parsed into mapreduce for related logical calculations 4.
hive has the concept
of
partitioning and bucketing
You can view task progress, logs, etc. through the webUI provided by hadoop.
7. Hive supports custom functions udf, udaf, and udtf.
8. Hive can be operated through the hue interface.
9. Hive can interact with other data sources through tools such as sqoop.
10. Resources Scheduling depends on the hadoop-yarn platform
, so if you are a little familiar with these hive functions, now I will tell you that the functions and usage of Maxcompute are basically the same as the above hive functions.
Let's take a look at the components of Maxcompute:
MaxCompute mainly serves the storage and calculation of batch structured data. It can provide solutions for massive data warehouses and analysis and modeling services for big data. It supports SQL query calculations, custom functions udf to implement complex logic, and mapreduce programs to implement more specific Business computing, supports Graph, an iterative graph computing processing framework, and provides java api to connect and operate sqltask.
Does it seem that MaxCompute is also the same as hive, you can use sql, udf, mr
① file system comparison

对比差异之前,容许我先简单介绍下阿里云的基石-飞天系统,详细的可以网上搜下。飞天系统是分布式的文件存储和计算系统,听起来是不是好熟悉,是不是和hadoop的味道一样。这里对于Maxcompute可以暂时把它当作是hadoop类似的框架,那Maxcompute就是基于飞天系统的,类似于hive基于hadoop。

The data of hive is actually on hdfs, and the metadata is generally placed in mysql and displayed in the form of tables. You can directly check the specific file on hdfs. The data of Maxcompute is in the Feitian file system, and the file system is not exposed to the outside world, and the underlying optimization will be done automatically.
②The hive and Maxcompute clients are
directly above to compare
the hive clients:
image

Client of Maxcompute (original odps):
image

Does it look the same.
In fact,
the project space (Project) is the basic organizational unit of MaxCompute. It is similar to the concept of Database or Schema of traditional databases, and is the main boundary for multi-user isolation and access control. A user can have permissions for multiple project spaces at the same time

The configuration file as
image
shown in the client can execute sql and other commands.

In addition to the command line client, MaxCompute also provides python and java SDKs for access. Don't say anything, just go to the code

import java.util.List;
    import com.aliyun.odps.Instance;
    import com.aliyun.odps.Odps;
    import com.aliyun.odps.OdpsException;
    import com.aliyun.odps.account.Account;
    import com.aliyun.odps.account.AliyunAccount;
    import com.aliyun.odps.data.Record;
    import com.aliyun.odps.task.SQLTask;
    public class testSql {
//这里accessId和accessKey是阿里云为安全设置的账户访问验证,类似于密码,不止是在Maxcompute中使用
    private static final String accessId = "";
    private static final String accessKey = “”;
//这里是的服务地址
    private static final String endPoint = "http://service.odps.aliyun.com/api";
//Maxcompute的项目名称,类似于hive的database
    private static final String project = "";
    private static final String sql = "select category from iris;";
    public static void
    main(String[] args) {
      Account account = new AliyunAccount(accessId, accessKey);
       Odps odps = new Odps(account);
       odps.setEndpoint(endPoint);
       odps.setDefaultProject(project);
       Instance i;
      try {
         i = SQLTask.run(odps, sql);
         i.waitForSuccess();
         List<Record> records = SQLTask.getResult(i); 
         for(Record r:records){
            System.out.println(r.get(0).toString());
         }
      } catch (OdpsException e) {
         e.printStackTrace();
      }
   }
  }

Do you feel very cordial, the same as the way most databases are accessed.
③odpscmd and hivesql
first look at the table building statement
hive standard table building statement:

hive> create table page_view
    > (
    > page_id bigint comment '页面ID',
    > page_name string comment '页面名称',
    > page_url string comment '页面URL'
    > )
    > comment '页面视图'
    > partitioned by (ds string comment '当前时间,用于分区字段')
    > row format delimited
    > stored as rcfile
    > location '/user/hive/test'; 

maxcompute table creation statement:

create table page_view  
(  
page_id bigint comment '页面ID',  
page_name string comment '页面名称',  
page_url string comment '页面URL'  
)  partitioned by (ds string comment '当前时间,用于分区字段')

从建表语句上明显的可以感觉出来,maxcompute没有指定分隔符,没有指定文件存储路径,没有指定文件的存储格式。难道是默认的吗?不。
因为maxcompute是基于阿里云飞天文件系统,用户无需关心文件存储格式,压缩格式,存储路径等,

这些操作由阿里云来完成,用户也不用来疲于文件存储成本,压缩性价比,读写速度等优化,可以将精力集中在业务的开发上。
另外二者的数据的上传下载;
hive可以通过命令,比如上传 
image

maxcompute是通过命令工具 tunnel上传下载,同时支持在上传过程校验文件格式脏数据等
image

④分区和分桶
分区的概念相信使用hive的同学很熟悉,其实就是在表目录上再套一层目录,将数据区分,目的就是提高查询效率。那么从上面建表语句中可以看出maxcomoute和hive都是支持分区的,概念用法一致。
关于分桶,上面建表语句中hive中有分桶语句,maxcompute没有分桶的操作,实际上分桶是把一个大文件根据某个字段hash成多个小文件,适当的分桶会提高查询效率,在maxcompute中这些优化底层已经做了。
⑤外部表功能
hive可以通过外部表的功能来操作例如hbase和es的数据。外部表功能maxcompute(2.0版本支持)中也是同样适用,maxcompute通过外部表来映射阿里云的OTS和OSS两个数据存储产品来处理非结构化的数据,例如音频视频等。看下建表语句:

CREATE EXTERNAL TABLE IF NOT EXISTS ambulance_data_csv_external
(
vehicleId int,
recordId int,
patie
ntId int,
calls int,
locationLatitute double,
locationLongtitue double,
recordTime string,
direction string
)
STORED BY 'com.aliyun.odps.CsvStorageHandler' -- (2)
LOCATION 'oss://oss-cn-hangzhou-zmf.aliyuncs.com/oss-odps-test/Demo/SampleData/CSV/AmbulanceData/';
再看hive映射到hbase建表语句
CREATE EXTERNAL TABLE cofeed_info 
( 
rowkey string, 
id string, 
source string, 
insert_time timestamp, 
dt string 
) STORED BY ‘org.apache.Hadoop.hive.hbase.HBaseStorageHandler’ WITH 
SERDEPROPERTIES (“hbase.columns.mapping”= 
“:key, 
cf:id, 
cf:source, 
cf:insert_time, 
cf:dt”) TBLPROPERTIES (“hbase.table.name” = “cofeed_info”);

语法基本一致,maxcompute可以自定义extractor来处理非结构化数据,可以参考https://yq.aliyun.com/articles/61567来学习
⑥webui
hive任务依赖于hadoop的hdfs和yarn提供的webui访问。看下对比
hadoopwebui
image

在这里可以通过历史任务来查看hive任务的执行情况。个人觉得页面不是很友好。
那么在Maxcompute中当然也是可以通过ui来查看任务执行状态,进度,参数,以及任务日志等
首先在任务执行的时候,在客户端会打印出来一串http地址我们叫做logview,复制下来在浏览器中打开即可。
如图
image

在浏览器中打开

总体上一看,非常清晰明了。任务开始时间结束时间,任务状态,绿色进度条。很方便的获取任务的总体情况
image

点击Detail按钮可以看更具体的调度,日志等
image

点击jsonsumary可以看到非常详细的执行过程
image

那么可以看到Maxcompute的webui还是比较友好的,方便用户很快定位问题。调度方面这里也说一下是由阿里云统一调度,用户无需关心优化。
⑦自定义函数的支持
hive和Maxcompute都支持自定函数。同样是三种,udf,udtf,udaf。
代码写法一致。最大的区别在于数据类型的支持上。
目前Maxcompute支持的数据类型是
UDF 支持 MaxCompute SQL 的数据类型有:Bigint, String, Double, Boolean 类型 。MaxCompute 数据类型与 Java 类型的对应关系如下:

image

注意:

java 中对应的数据类型以及返回值数据类型是对象,首字母请务必大写;

目前暂不支持 datetime 数据类型,建议可以转换成 String 类型传入处理 。
SQL 中的 NULL 值通过 Java 中的 NULL 引用表示,因此 Java primitive type 是不允许使用的,因为无法表示 SQL 中的 NULL 值 。
所以不同于hive中支持各种类型。

看maxcompute代码示例

 import com.aliyun.odps.udf.UDF;
     public final class Lower extends UDF {
       public String evaluate(String s) {
         if (s == null) { return null; }
         return s.toLowerCase();
       }
     }

用法一致,所以使用hive的用户基本可以直接迁移。
在此强调一下,在MaxCompute中处于安全层面的考虑对udf和mr是有java沙箱限制的,比如在udf代码中不能启用其他线程等等,具体可以参考
https://help.aliyun.com/document_detail/27967.html
那么可以看到Maxcompute的webui还是比较友好的,方便用户很快定位问题。调度方面这里也说一下是由阿里云统一调度,用户无需关心优化。

⑧界面化操作。
谈到界面化的操作,阿里云的产品基本上都是界面化操作,可拖拽等等,开发门槛非常低,所以也是非常适合初学大数据或者公司没有相关开发人力的公司。
hive可以借助hue工具来操作查询数据,但是实际上交互性不是很强。
那么这里就将Maxcompute的界面化操作以及数据同步,权限控制,数据管理,和其他数据源交互,定时调度等简单介绍下,就是阿里云的产品-大数据开发套件,目前是免费使用的。需要开通Maxcompute项目进入操作。等不及了直接上图
1,Maxcompute sql 查询界面化
image

maxcompute mapreduce界面化配置
image

Maxcompute data synchronization interface
hive can synchronize data through sqoop tool and various data sources. Maxcompute is also very convenient to synchronize with other data sources in the big data development kit
image

And can configure process control, scheduling
image

Isn't it amazing, not surprised, not surprised. The specific use is still experienced by everyone, so I will not introduce them one by one here.

Finally, let's take a look at the comparison of hadoop-mapreduce and Maxcompute-mapreduce. Let’s use everyone’s favorite wordcount as an example.
Before the introduction, I would like to emphasize that 1. The input and output of Maxcompute-mapreduce are in the form of tables (or partitions). If you need to reference other files, you need to upload them first. 2. Maxcompute-mapreduce also has sandbox restrictions and is not allowed to be enabled in the code. Other framework threads, etc.
The hadoop-mr code will not be posted, just go to the Maxcompute-mapreduce code

 @Override
        public void setup(TaskContext context) throws IOException {
          word = context.createMapOutputKeyRecord();
          one = context.createMapOutputValueRecord();
          one.set(new Object[] { 1L });
          System.out.println("TaskID:" + context.getTaskID().toString());
        }

        @Override
        public void map(long recordNum, Record record, TaskContext context)
            throws IOException {
//maxcompute中是以表中一行记录来处理,Record
         for (int i = 0; i < record.getColumnCount(); i++) {
            word.set(new Object[] { record.get(i).toString() });
            context.write(word, one);
          }
        }
      }

Look at the job main function configuration, the code logic is general

 public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.err.println("Usage: WordCount <in_table> <out_table>");
          System.exit(2);
        }

        JobConf job = new JobConf();

        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(SumCombiner.class);
        job.setReducerClass(SumReducer.class);
//这里可以直接指定map端输出的字段
        job.setMapOutputKeySchema(SchemaUtils.fromString("word:string"));
        job.setMapOutputValueSchema(SchemaUtils.fromString("count:bigint"));
//这里输入出需要是表活着分区
        InputUtils.addTable(TableInfo.builder().tableName(args[0]).build(), job);
        OutputUtils.addTable(TableInfo.builder().tableName(args[1]).build(), job);

        JobClient.runJob(job);
      }

So basically the main function comparison is almost the same. You can find that if you are a developer who has used hive, you can migrate to maxcompute in seconds, making development more convenient and concise, and freeing developers from hard overtime. In fact, the company saved a lot of operation and maintenance costs, development labor costs, etc., and focused on business development. If you have to ask me about the performance comparison between hive and maxcompute, I can only tell that it has been tested on Double Eleven.

Summary: If the industrial revolution is to liberate people from manual labor, then today's Internet revolution, especially the rapid development of cloud computing and big data, is to liberate people from brain power.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326033722&siteId=291194637