Kudu简介

今天浏览Cloudera的官博，发现了一篇介绍《Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data》的文章，摘记如下：

1. 什么是Kudu

This new open source complement to HDFS and Apache HBase is designed to fill gaps in Hadoop’s storage layer that have given rise to stitched-together, hybrid architectures.

可以看出这个新组件有两个特点：1）开源（免费，ASL 2.0）；2）这是一个融合HDFS和HBase的功能的新组件，具备介于两者之间的新存储组件

2. Kudu的使用场景

Strong performance for both scan and random access to help customers simplify complex hybrid architectures（适用于那些既有随机访问，也有批量数据扫描的复合场景）
High CPU efficiency in order to maximize the return on investment that our customers are making in modern processors（高计算量的场景）
High IO efficiency in order to leverage modern persistent storage（使用了高性能的存储设备，包括使用更多的内存）
The ability to update data in place, to avoid extraneous processing and data movement（支持数据更新，避免数据反复迁移）
The ability to support active-active replicated clusters that span multiple data centers in geographically distant locations（支持跨地域的实时数据备份和查询）

总结上述内容，可以归纳为两个亮点：1）将不同组件结合起来的异构生态圈打通，使得数据、操作在一个圈内进行；2）将CPU、磁盘IO统一考量，便于资源的最优分配，尤其是未来CPU的计算资源成为瓶颈后；

3. 总结

kudu目前来看，是把analytics 和 online两个应用场景进行了整合，目的在于将分散的大数据生态圈组件进行融合，估计这也是未来大数据生态圈急需解决的一个问题，也是一个趋势。

4. 关注点

如何和impala进行结合，如何与HDFS、HBase区分（包括整合、数据迁移等），如何使用SQL引擎进行检索，接口除Java、C++外是否有其他拓展的余地，性能比对

猜你喜欢