content
4. Comparison with traditional relational databases
5. Comparison with other big data components
10. Delete multiple partitions
14. Partitioning as Multiple Columns
A series of articles on getting started with big data
Here is a brief introduction to some common nouns of Kudu, a simple structure, and some commonly used statements. As for the follow-up more detailed introduction, this component will be introduced in detail separately, and you can follow the blog for follow-up reading.
1. Concept
Kudu is a new type of columnar storage system open sourced by Cloudera. It is a member of the Apache Hadoop ecosystem. It is designed to quickly analyze rapidly changing data and fill the gap in the previous Hadoop storage layer.
Kudu provides functions and data models that are closer to RDBMS, providing a storage structure similar to relational databases to store data, allowing users to insert, update, and delete data in the same way as relational databases.
Kudu is just a storage layer, it does not store data, but relies on external Hadoop processing engines (MapReduce, Spark, Impala). Kudu stores data in the underlying Linux file system in its own columnar storage format.
The core in Kudu is a table-based storage engine. Kudu stores its own metadata (about the table) information and user data, stored in the Tablet.
Kudu has Upsert to update data, similar to Oracle's Merge.
2. Architecture
Similar to HDFS and HBase, Kudu uses a single Master node to manage the metadata of the cluster, and any number of Tablet Server (compare to understand the RegionServer role in HBase) nodes to store the actual data. Multiple Master nodes can be deployed to improve fault tolerance. The data of a Table table is divided into one or more Tablets, which are deployed on Tablet Server to provide data read and write services.
1.Master Server
The leader of the Kudu cluster can have multiple Master Servers to improve the fault tolerance of the cluster, but only one Master Server provides external services and is responsible for managing the cluster and managing metadata.
2.Tablet Server
There can be any number of younger brothers in the Kudu cluster, responsible for storing data and reading and writing data. Tablets are stored on the Tablet Server. For a Tablet, only one of the Table Servers serves as the Leader to provide read and write services, while the other Table Servers are all Followers and only provide read services.
3.Table
Table: The table concept in Kudu includes Schema and Primary Key concepts. The table in Kudu will be divided into multiple Tablet fragments horizontally and stored on the Tablet Server.
4.Tablet
A tablet is a continuous segment of a table, and a tablet is a horizontal partition of a table. The primary key ranges between tablets do not overlap. All tablet segments of a table constitute all the primary key ranges of the table. Tablet will be redundantly stored on multiple Tablet Servers to set up replicas. At any time, only one Tablet Server is the Leader, and the others are Followers.
3. Characteristics
1. Importance
1. The complexity of big data analysis is often brought about by the limitations of the storage system. Kudu's limitations are much smaller, which makes big data analysis simpler to a certain extent.
2. New application scenarios require Kudu, such as more and more applications focusing on machine-generated data and real-time analytics.
3. Adapt to the new hardware environment, thereby bringing higher performance and application flexibility.
2. Ease of use
1. Provide functions and data models that are closer to RDBMS;
2. Provide RDBMS-like database table storage structure;
3. Allow users to insert, update and delete data in the same way as RDBMS.
3. Advantages
Kudu also has the ability to insert row-by-row, low-latency random access, update, and fast analysis scanning, making it well supported in both OLAP and OLTP. These complex architectures that originally required multiple storage systems to support at the same time were replaced There is only one storage system, and all data is stored in this storage system, which greatly simplifies the architecture of big data.
4. Comparison with traditional relational databases
1. Like relational databases, Kudu tables have a unique primary key.
2. Common features in relational databases, such as transactions, foreign keys and non-primary key indexes, are currently not supported in Kudu.
3. Kudu has some OLAP and OLTP features, but lacks support for cross-row atomicity, consistency, isolation, and persistent transactions.
4. Kudu can be classified as a Hybrid Transaction/Analytic Processing (HTAP) type database.
5.Kudu supports fast primary key retrieval and can analyze data while the data is continuously input, and OLAP database performance is usually not very good in this scenario.
6. Kudu's persistence guarantee is closer to that of an OLTP database.
7. Kudu's Quorum capability can implement a mechanism called Fractured Mirrors, that is, one or two nodes use row storage, and the other nodes use column storage. In this way, OLTP-type queries can be executed on the nodes of the row store, and OLAP queries can be executed on the nodes of the column store, mixing the two workloads.
5. Comparison with other big data components
1. HDFS is good at large-scale scanning, but not good at random reading. Strictly speaking, it does not support random writing. It can simulate random writing by merging, but the cost is very high.
2. HBase and Cassandra are good at random access, reading and modifying data randomly, but poor performance in large-scale scanning.
3. The goal of Kudu is to double the scanning performance of HDFS, while the random read performance is connected to HBase and Cassandra. The actual goal is that the random read/write delay on SSD is within 1ms.
Four, commonly used sentences
1. Build a table
Kudu requires a primary key to build a table, and the primary key cannot be empty.
1. Build a common table
create table test.test1 (
date_timekey string not null,
username string null,
product_qty string null
)
stored as kudu
2. Create a partition table
create table test.test1 (
date_timekey string not null,
username string null,
product_qty string null,
primary key (date_timekey)
)
partition by range (date_timekey) (value='20220417')
stored as kudu
2. Delete the table
drop table if exists test.test1;
3. Query data
Note: When querying data, it is best to bring the columns to be queried, which can reduce the number of query columns and reduce the loading of the query. When writing SQL, using the specified columns puts less pressure on the big data cluster and makes the system more robust.
select date_timekey,username from test.test1
4. Add data
Note: Before inserting data into a partitioned table, you must create a partition first.
insert into test.test1 (date_timekey,username)values('20200330','shuijianshiqing');
Note: The primary key of the added data cannot be empty, otherwise the data will not enter.
insert into test.test1 (date_timekey,b)values(null,'shuijianshiqing');
5. Update data
upsert into test.test1 (date_timekey,username)values('20200330','shuijianshiqing');
6. Delete data
Note: When deleting data, you cannot use alias deletion, such as test.test t, and then the condition is t.date_timekey, so the data cannot be deleted.
delete from test.test1 where date_timekey='20200328';
7. Add a single partition
alter table test.test1 add range partition value='20200325';
8. Delete a single partition
alter table test.test1 drop range partition value='20200325';
9. Add multiple partitions
alter table test.test1 add range partition '20200327'<=values<'20200331';
10. Delete multiple partitions
alter table test.test1 drop range partition '20200327'<=values<'20200331';
11. New column
alter table test.test1 add columns(column_new string);
12. Delete columns
alter table test.test1 drop column column_new;
13. Modify the column name
username is the old name of the column, username_new is the name of the new column,
alter table test.test1 change column username username_new string;
14. Partitioning as Multiple Columns
1. Create a new table
drop table if exists test.test2;
create table test.test2 (
id String not null,
date_timekey String not null,
hour_timekey String not null,
username STRING,
password STRING,
interface_time String,
primary key (id,date_timekey,hour_timekey)
)
partition by range (date_timekey,hour_timekey) (partition value=('20200601','20200601 0730'))
stored as kudu
2. Add a new partition
alter table test.test2_kudu add range partition value=('20200601','20200601 0830');
3. Delete the partition
alter table test.test2_kudu drop range partition value=('20200601','20200601 0830');
Five, the vernacular
Kudu is a storage engine, similar to RDBMS, which can add, delete, modify, and query, making big data analysis more convenient. His storage is not based on Hadoop, but he has an independent system in Linux. As for the more detailed content of Kudu's reading and writing, it will be introduced in detail later.
6. Others
Chicken Soup: The world is most afraid of the word seriousness. These two words are worth thousands of dollars, and thousands of dollars cannot be exchanged!
A series of articles on getting started with big data
1. Introduction to Big Data - What is Big Data
2. Introduction to Big Data - Overview of Big Data Technology (1)
3. Introduction to Big Data - Overview of Big Data Technology (2)
4. Introduction to big data - understand Hadoop in three minutes
5. Introduction to big data - five minutes to understand HDFS
6. Introduction to big data - five minutes to understand Hive
Don't miss the handsome guys and beauties passing by, pay attention and likes to the pinnacle of life! ! !