Getting Started with Big Data - What is Kudu

content

1. Concept

2. Architecture

1.Master Server

2.Tablet Server

3.Table

4.Tablet

3. Characteristics

1. Importance

2. Ease of use

3. Advantages

4. Comparison with traditional relational databases

5. Comparison with other big data components

Four, commonly used sentences

1. Build a table

1. Build a common table

2. Create a partition table

2. Delete the table

3. Query data

4. Add data

5. Update data

6. Delete data

7. Add a single partition

8. Delete a single partition

9. Add multiple partitions

10. Delete multiple partitions

11. New column

12. Delete columns

13. Modify the column name

14. Partitioning as Multiple Columns

1. Create a new table

2. Add a new partition

3. Delete the partition

Five, the vernacular

6. Others

A series of articles on getting started with big data


Here is a brief introduction to some common nouns of Kudu, a simple structure, and some commonly used statements. As for the follow-up more detailed introduction, this component will be introduced in detail separately, and you can follow the blog for follow-up reading.

1. Concept

Kudu is a new type of columnar storage system open sourced by Cloudera. It is a member of the Apache Hadoop ecosystem. It is designed to quickly analyze rapidly changing data and fill the gap in the previous Hadoop storage layer.

Kudu provides functions and data models that are closer to RDBMS, providing a storage structure similar to relational databases to store data, allowing users to insert, update, and delete data in the same way as relational databases.

Kudu is just a storage layer, it does not store data, but relies on external Hadoop processing engines (MapReduce, Spark, Impala). Kudu stores data in the underlying Linux file system in its own columnar storage format.

The core in Kudu is a table-based storage engine. Kudu stores its own metadata (about the table) information and user data, stored in the Tablet.

Kudu has Upsert to update data, similar to Oracle's Merge.

2. Architecture

Similar to HDFS and HBase, Kudu uses a single Master node to manage the metadata of the cluster, and any number of Tablet Server (compare to understand the RegionServer role in HBase) nodes to store the actual data. Multiple Master nodes can be deployed to improve fault tolerance. The data of a Table table is divided into one or more Tablets, which are deployed on Tablet Server to provide data read and write services.

1.Master Server

The leader of the Kudu cluster can have multiple Master Servers to improve the fault tolerance of the cluster, but only one Master Server provides external services and is responsible for managing the cluster and managing metadata.

2.Tablet Server

There can be any number of younger brothers in the Kudu cluster, responsible for storing data and reading and writing data. Tablets are stored on the Tablet Server. For a Tablet, only one of the Table Servers serves as the Leader to provide read and write services, while the other Table Servers are all Followers and only provide read services.

3.Table

Table: The table concept in Kudu includes Schema and Primary Key concepts. The table in Kudu will be divided into multiple Tablet fragments horizontally and stored on the Tablet Server.

4.Tablet

A tablet is a continuous segment of a table, and a tablet is a horizontal partition of a table. The primary key ranges between tablets do not overlap. All tablet segments of a table constitute all the primary key ranges of the table. Tablet will be redundantly stored on multiple Tablet Servers to set up replicas. At any time, only one Tablet Server is the Leader, and the others are Followers.

3. Characteristics

​​​​​​​

 

1. Importance

1. The complexity of big data analysis is often brought about by the limitations of the storage system. Kudu's limitations are much smaller, which makes big data analysis simpler to a certain extent.
2. New application scenarios require Kudu, such as more and more applications focusing on machine-generated data and real-time analytics.
3. Adapt to the new hardware environment, thereby bringing higher performance and application flexibility.

2. Ease of use

1. Provide functions and data models that are closer to RDBMS;
2. Provide RDBMS-like database table storage structure;
3. Allow users to insert, update and delete data in the same way as RDBMS.

3. Advantages

Kudu also has the ability to insert row-by-row, low-latency random access, update, and fast analysis scanning, making it well supported in both OLAP and OLTP. These complex architectures that originally required multiple storage systems to support at the same time were replaced There is only one storage system, and all data is stored in this storage system, which greatly simplifies the architecture of big data.

4. Comparison with traditional relational databases

1. Like relational databases, Kudu tables have a unique primary key.
2. Common features in relational databases, such as transactions, foreign keys and non-primary key indexes, are currently not supported in Kudu.
3. Kudu has some OLAP and OLTP features, but lacks support for cross-row atomicity, consistency, isolation, and persistent transactions.
4. Kudu can be classified as a Hybrid Transaction/Analytic Processing (HTAP) type database.
5.Kudu supports fast primary key retrieval and can analyze data while the data is continuously input, and OLAP database performance is usually not very good in this scenario.
6. Kudu's persistence guarantee is closer to that of an OLTP database.
7. Kudu's Quorum capability can implement a mechanism called Fractured Mirrors, that is, one or two nodes use row storage, and the other nodes use column storage. In this way, OLTP-type queries can be executed on the nodes of the row store, and OLAP queries can be executed on the nodes of the column store, mixing the two workloads.

5. Comparison with other big data components

1. HDFS is good at large-scale scanning, but not good at random reading. Strictly speaking, it does not support random writing. It can simulate random writing by merging, but the cost is very high.
2. HBase and Cassandra are good at random access, reading and modifying data randomly, but poor performance in large-scale scanning.
3. The goal of Kudu is to double the scanning performance of HDFS, while the random read performance is connected to HBase and Cassandra. The actual goal is that the random read/write delay on SSD is within 1ms.

Four, commonly used sentences

1. Build a table

Kudu requires a primary key to build a table, and the primary key cannot be empty.

1. Build a common table

create table test.test1 (
  date_timekey string not null,
  username string null,
  product_qty string null
)
stored as kudu

2. Create a partition table

create table test.test1 (
  date_timekey string not null,
  username string null,
  product_qty string null,
  primary key (date_timekey)
)
partition by range (date_timekey) (value='20220417')
stored as kudu

2. Delete the table

drop table if exists test.test1;

3. Query data

Note: When querying data, it is best to bring the columns to be queried, which can reduce the number of query columns and reduce the loading of the query. When writing SQL, using the specified columns puts less pressure on the big data cluster and makes the system more robust.

select date_timekey,username  from test.test1

4. Add data

Note: Before inserting data into a partitioned table, you must create a partition first.

insert into test.test1 (date_timekey,username)values('20200330','shuijianshiqing');

Note: The primary key of the added data cannot be empty, otherwise the data will not enter.

insert into test.test1 (date_timekey,b)values(null,'shuijianshiqing');

5. Update data

upsert into test.test1 (date_timekey,username)values('20200330','shuijianshiqing');

6. Delete data

Note: When deleting data, you cannot use alias deletion, such as test.test t, and then the condition is t.date_timekey, so the data cannot be deleted.

delete from test.test1 where date_timekey='20200328';

7. Add a single partition

alter table test.test1 add range partition value='20200325';

8. Delete a single partition

alter table test.test1 drop range partition value='20200325';

9. Add multiple partitions

alter table test.test1 add range partition '20200327'<=values<'20200331';

10. Delete multiple partitions

alter table test.test1 drop range partition '20200327'<=values<'20200331';

11. New column

alter table test.test1 add columns(column_new string);

12. Delete columns

alter table test.test1 drop column column_new;

13. Modify the column name

username is the old name of the column, username_new is the name of the new column,

alter table test.test1 change column username username_new string;

14. Partitioning as Multiple Columns

1. Create a new table

drop table if exists test.test2;
create table test.test2 (
  id String not null,
  date_timekey String not null,
  hour_timekey String not null,
  username STRING,
  password STRING,
  interface_time String,
  primary key (id,date_timekey,hour_timekey)
)
partition by range (date_timekey,hour_timekey) (partition value=('20200601','20200601 0730')) 
stored as kudu

2. Add a new partition

alter table test.test2_kudu add range partition value=('20200601','20200601 0830');

3. Delete the partition

alter table test.test2_kudu drop range partition value=('20200601','20200601 0830');

Five, the vernacular

Kudu is a storage engine, similar to RDBMS, which can add, delete, modify, and query, making big data analysis more convenient. His storage is not based on Hadoop, but he has an independent system in Linux. As for the more detailed content of Kudu's reading and writing, it will be introduced in detail later.

6. Others

Chicken Soup: The world is most afraid of the word seriousness. These two words are worth thousands of dollars, and thousands of dollars cannot be exchanged!

A series of articles on getting started with big data

1. Introduction to Big Data - What is Big Data

2. Introduction to Big Data - Overview of Big Data Technology (1)

3. Introduction to Big Data - Overview of Big Data Technology (2)

4. Introduction to big data - understand Hadoop in three minutes

5. Introduction to big data - five minutes to understand HDFS

6. Introduction to big data - five minutes to understand Hive

Don't miss the handsome guys and beauties passing by, pay attention and likes to the pinnacle of life! ! !

Guess you like

Origin blog.csdn.net/helongqiang/article/details/123704306