kudu 介绍

kudu的好处：

快速的olap

列式存储，Hadoop parquet 的一种替代方案

对数据的顺序处理和随机处理都很高效

* High availability. Tablet Servers and Masters use the Raft Consensus Algorithm, which ensures that as long as more than half the total number of replicas is available, the tablet is available for reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5 replicas are available, the tablet is available.
* Reads can be serviced by read-only follower tablets, even in the event of a leader tablet failure.

kudu使用的场景：

Reporting applications where newly-arrived data needs to be immediately available for end users 刚到的数据需要需要马上呈现给用户
Time-series applications that must simultaneously support:
- queries across large amounts of historic data 历史数据的快速查询
- granular queries about an individual entity that must return very quickly 快速返回单个实体粒度查询
Applications that use predictive models to make real-time decisions with periodic refreshes of the predictive model based on all historic data使用预测模型进行实时决策的应用程序，基于所有历史数据定期刷新预测

kudu-impala 有哪些特性：

CREATE/ALTER/DROP TABLE

: Impala supports creating, altering, and dropping tables using Kudu as the persistence layer. The tables follow the same internal / external approach as other tables in Impala, allowing for flexible data ingestion and querying.

INSERT

: Data can be inserted into Kudu tables in Impala using the same syntax as any other Impala table like those using HDFS or HBase for persistence.

UPDATE / DELETE

: Impala supports the UPDATE and DELETE SQL commands to modify existing data in a Kudu table row-by-row or as a batch. The syntax of the SQL commands is chosen to be as compatible as possible with existing standards. In addition to simple DELETE or UPDATE commands, you can specify complex joins with a FROM clause in a subquery.

Flexible Partitioning 采用hash 或 range 的分区

: Similar to partitioning of tables in Hive, Kudu allows you to dynamically pre-split tables by hash or range into a predefined number of tablets, in order to distribute writes and queries evenly across your cluster. You can partition by any number of primary key columns, by any number of hashes, and an optional list of split rows. See Schema Design.

Parallel Scan 并发对多个tablet 进行扫描

: To achieve the highest possible performance on modern hardware, the Kudu client used by Impala parallelizes scans across multiple tablets.

High-efficiency queries

: Where possible, Impala pushes down predicate evaluation to Kudu, so that predicates are evaluated as close as possible to the data. Query performance is comparable to Parquet in many workloads.

概念和术语:

列式数据存储
读取效率快，只读一列或部分列
数据压缩：一列的数据类型统一，便于亚索
Tablet：类似其他数据库的分区，一个给定的tablet 会被备份在多台tablet server中，有一个备份会被选为leader tablet，
tablet server:一个tablet server 可以服务多个tablet，一个tablet 也可以被多个server服务，服务于一个tablet的多个server 只有一个时leader，其他为foller，leader 负责读和写，foller 只负责读
master：master 服务器跟踪所有tablet，tablet server ，目录表以及与群集相关的其他元数据。master 只有一个，master disappears 之后再选举一个
catlog table：存储了两类信息，1）table 的shcemas ，location，states 。2）现存talbets 的列表，每个tablet 被哪几个server 服务，tablet 的状态，开始key，结束key

查看kudu ui：

access the Master or Tablet Server web UI by opening http://<_host_name_>:8051/ for masters or http://<_host_name_>:8050/ for tablet servers.

impala shell 连接 kudu：

Start Impala Shell using the impala-shell command. By default, impala-shell attempts to connect to the Impala daemon on localhost on port 21000. To connect to a different host,, use the -i <host:port> option. To automatically connect to a specific Impala database, use the -d <database> option. For instance, if all your Kudu tables are in Impala in the database impala_kudu, use -d impala_kudu to use this database.

sadfasdf

猜你喜欢