Java interview must-test points - Lecture 09 (Part 1): Message Queue Kafka Architecture and Principles

This class mainly explains the knowledge related to message queue and database, focusing on three parts of knowledge points:

  1. Kafka’s architecture and message interaction process;

  2. 4 major characteristics and classifications of database transactions;

  3. MySQL related content, such as indexes, MySQL tuning, etc.

Message queue and database knowledge points

Let’s first take a look at the summary of relevant knowledge points, as shown below. First of all, to avoid ambiguity, the "queue" mentioned in this lesson refers to the "message queue".

message queue

Let’s look at the application scenarios of message queues, that is, what problems the queue can solve.

  • Queues can decouple applications and eliminate the need for direct calls between applications.

  • Messages can be delivered through queues to complete communication.

  • Queues can also be used to perform asynchronous tasks, and the task submitter does not need to wait for the results.

  • Another function of the queue is to cut peaks and fill valleys. When there is a sudden traffic, the queue can be used as a buffer, which will not put great pressure on the back-end service. When the peak value passes, the accumulated data can be gradually consumed to fill the traffic valley. .

  • Message queues generally also provide the ability to write once and read multiple times, which can be used for multicast and broadcast of messages.

There are two main messaging protocols to know about queues.

  • JMS is Java's message service, which specifies the API for Java to use message services. As mentioned in the previous Spring lesson, Spring provides components that support JMS.

  • AMQP is the Advanced Message Queuing Protocol and an open standard for application layer protocols. AMQP is not limited by the API layer, but directly defines the data format for network exchange. Therefore, it supports cross-language capabilities. For example, RabbitMQ uses AMQP implementation.

Let’s compare several commonly used message queues.

  • RabbitMQ

Using the open source message queue developed by Erlang, stable and reliable transmission of data is achieved through Erlang's Actor model. It supports multiple protocols such as AMQP, XMPP, SMTP, etc., so it is also relatively heavyweight. Due to the Broker proxy architecture, the data is queued in the central queue before being sent to the client. It is suspected that the single-machine throughput of RabbitMQ is in the 10,000 level, which is not very high.

  • ActiveMQ

It can be deployed in proxy mode and P2P mode, supports multiple protocols, and has a single-machine throughput of 10,000 levels. However, ActiveMQ is not lightweight enough and does not support situations with a large number of queues. And there is a lower probability of losing messages.

  • RocketMQ

Alibaba's open source message middleware can support 100,000-level throughput on a single machine. It is developed using Java. It has the characteristics of high throughput and high availability, and is suitable for application in large-scale distributed systems.

  • Kafka

A high-performance cross-language distributed message queue developed by Scala, the single-machine throughput can reach 100,000 levels, and the message delay is at the ms level. Kafka is a completely distributed system. Broker, Producer, and Consumer all natively and automatically support distribution and rely on ZooKeeper for distributed coordination. Kafka supports one write and multiple reads. Messages can be consumed by multiple clients. Messages may be duplicated but will not be lost. The architecture of Kafka will be introduced in detail later in this lesson.

Database middleware

Database middleware generally provides the ability to separate reading and writing and horizontally expand the database. The following mainly introduces two middlewares.

The first is Sharding-Sphere, which is an open source distributed database middleware solution. It consists of several independent products: Sharding-JDBC, Sharding-Proxy, and Sharding-Sidecar, and is suitable for different usage scenarios. These products all provide standardized data sharding, read-write separation, flexible transactions, and data governance functions, and can be applied to various diverse application scenarios such as Java isomorphism, heterogeneous languages, containers, and cloud native. At present, Sharding-Sphere has entered the Apache incubation and is developing very fast, so you can focus on it.

The second is Mycat, which also provides capabilities such as sharding databases and sharding tables. Mycat is based on the Proxy proxy mode. The backend can support different database implementations such as MySQL, Oracle, and DB2. However, the proxy mode will have a certain impact on performance.

There are other database middlewares such as Vitess, etc., which are not widely used, but you only need to understand them.

database

For database-related knowledge points, you first need to know different types of databases.

  • Relational Database

Commonly used relational databases are Oracle and MySQL. Oracle is powerful, but its main disadvantage is that it is expensive. MySQL is the most popular database in the Internet industry. This is not only because MySQL is free. It can be said that MySQL can satisfy all the functions you need in a relational database scenario. The following detailed explanation will introduce some knowledge points of MySQL in detail.

MariaDB is a branch of MySQL and maintained by the open source community. Although MariaDB is regarded as a replacement for MySQL, compared with MySQL, it has very good improvements in extended functions and storage engines. You can pay attention to it in the future.

PostgreSQL is also called PGSQL. PGSQL is similar to Oracle's multi-process framework and can support high-concurrency application scenarios. PG supports almost all SQL standards and supports a wide range of types. PG is more suitable for strict enterprise application scenarios, while MySQL is more suitable for Internet scenarios with relatively simple business logic and low data reliability requirements.

  • NoSQL

NoSQL, which means Not only SQL, generally refers to non-relational databases.

Redis introduced in the previous lesson is a non-relational database, which provides persistence capabilities and supports multiple data types. Redis is suitable for scenarios where data changes quickly and the data size is predictable.

MongoDB is a database based on distributed file storage, which stores data as a document, and the data structure consists of key-value pairs. MongoDB is more suitable for scenarios where the table structure is unclear and the data structure may be constantly changing. It is not suitable for scenarios with transactions and complex queries.

HBase is a distributed column-oriented database built on HDFS, the Hadoop file system. Similar to Google's large table design, HBase can quickly and randomly access massive structured data. In a table it is sorted by rows, a table has multiple column families and each column family can have any number of columns. HBase relies on HDFS to achieve reliable storage of massive data, which is suitable for scenarios where the amount of data is large, there is more writing and less reading, and no complex queries are required.

Cassandra is a highly reliable, large-scale distributed storage system. Supports distributed structured key-value storage with high availability as the main goal. It is suitable for scenarios where there is a lot of writing and for making some simple queries, but it is not suitable for data analysis and statistics.

Pika is a durable, large-capacity Redis-like storage service that is compatible with most commands of the five major data structures. Pika uses disk storage to mainly solve the cost problem of Redis large-capacity storage.

  • NewSQL

NewSQL database is also attracting more and more attention. NewSQL refers to a new generation of relational database. A typical example is TiDB.

TiDB is an open source distributed relational database that is almost fully compatible with MySQL. It can support horizontal elastic expansion, ACID transactions, standard SQL, MySQL syntax and MySQL protocol, and has strong data consistency and high availability features. It is suitable for both online transaction processing and online analytical processing.

Another well-known NewSQL is Ant Financial's OceanBase. OB is a database system that can meet financial-level reliability and data consistency requirements. When transactions need to be used and the amount of data is relatively large, OB is more suitable. However, OB has now been commercialized and is no longer open source.

Finally, let’s look at the paradigm of the database. There are currently six paradigms in relational databases: first normal form, second normal form, third normal form, Buss-Code normal form (BCNF), fourth normal form and fifth normal form. The higher the paradigm level, the more stringent the requirements for data tables.

  • The least demanding first normal form only requires that the fields in the table are not available for splitting.

  • On the basis of the first normal form, the second normal form requires that each record is uniquely distinguished by a primary key, and all attributes in the record depend on the primary key.

  • Based on the second normal form, the third normal form requires that all attributes must directly depend on the primary key, and indirect dependence is not allowed.

Generally speaking, the database only needs to meet the third normal form.

Detailed explanation of Kafka
Architecture

Let’s learn the architecture of Kafka. Let’s first understand several concepts in Kafka based on the following architecture diagram.

First of all, the Kafka message queue is composed of three roles. The one on the left is the producer of the message; the middle is the Kafka cluster. The Kafka cluster is composed of multiple Kafka servers. Each server is called a Broker, which is the message agent; the one on the right is the message. The consumer Consumer.

Messages in Kafka are divided according to Topic, and a Topic is a Queue. In practical applications, different business data can be set to different topics. A Topic can have multiple consumers. When a producer sends a message on a Topic, all consumers that subscribe to this Topic can receive the message.

In order to improve parallelism, Kafka maintains multiple Partition partitions for each Topic, and each partition can be regarded as an append-type log. The messages in each partition are guaranteed to have unique IDs and are in order, and new messages are continuously appended to the end. When Partition actually stores data, it will be segmented according to size to ensure that smaller files are always written, improving performance and facilitating management.

As shown in the middle part of the figure, Partitions are distributed on multiple Brokers. The green module in the figure indicates that Topic1 is divided into 3 Partitions. Each Partition will be replicated multiple times and exist on different Brokers, as shown in the red module in the figure. This can ensure disaster recovery when problems occur in the primary partition. Each Broker can save multiple Partitions of multiple Topics.

Kafka only guarantees the ordering of messages within a partition, but cannot guarantee the ordering of messages between different partitions of a Topic. In order to ensure high processing efficiency, all message reading and writing are performed in the primary Partition, and other replica partitions will only copy data from the primary partition. Kafka maintains an ISR (in-sync replica) for each Topic on ZooKeeper, which is a synchronized replica set. If a primary partition becomes unavailable, Kafka will select a replica from the ISR set as the new primary partition.

Message publishing/consuming process

Kafka supports one write and multiple reads of messages by grouping consumers. The process is shown in the figure below.

Let's look at the example in the figure. This Topic is divided into 4 Partitions, namely the green P1 to P4 in the figure. The producer in the upper part selects a Partition for writing according to the rules. The default rule is the polling strategy. The producer can also specify a Partition or a key to select a Partition based on the Hash value.

There are three ways to send messages: synchronous, asynchronous and oneway.

  • In synchronous mode, the results are obtained synchronously when sending messages in the background thread. This is also the default mode.

  • The asynchronous mode allows producers to send data in batches, which can greatly improve performance, but will increase the risk of data loss.

  • The oneway mode only sends messages without returning the sending results. The message reliability is the lowest, but it has low latency and high throughput. It is suitable for scenarios that do not require high reliability.

Looking at message consumption, Consumers consume messages according to Group. Each message in a Topic can be consumed by multiple Consumer Groups, such as GroupA and GroupB in the above figure. Kafka ensures that each Partition can only be consumed by one Consumer in a Group. Kafka uses Group Coordinator to manage which Partition the Consumer is actually responsible for consuming. It supports Range and polling allocation by default.

Kafka saves the consumption offset offset of each Partition in each Topic in different Groups in ZK, and ensures that each message is consumed by updating the offset.

Note: When using multiple threads to read messages, one thread is equivalent to a Consumer instance. When the number of Consumers is greater than the number of partitions, some Consumer threads will not be able to read data.

Detailed explanation of database transactions
characteristic

The characteristics of databases are a very frequently examined question during interviews. Let’s take a look at the four ACID characteristics of databases, as shown in the figure below.

The first atomicity means that the transaction consists of an atomic sequence of operations, and all operations either succeed or fail and are rolled back.

The second transaction consistency means that the execution of the transaction cannot destroy the integrity and consistency of the database data. The database must be in a consistent state before and after a transaction is executed. For example, when doing multi-table operations, multiple tables either have new values ​​after the transaction or old values ​​before the transaction.

The third transaction isolation means that when multiple users access the database concurrently, the transactions performed by the database for each user cannot be interfered by the operations of other transactions, and multiple concurrent transactions must be isolated from each other. The isolation level of transactions is introduced later.

The fourth transaction durability refers to that once a transaction is submitted and executed successfully, the changes to the data in the database are permanent, and the operation of submitting the transaction will not be lost even if the database system encounters a failure.

Concurrency issues

Before introducing the isolation level of data, let's take a look at the concurrency problems that will occur in the database without isolation, as shown in the left part of the figure below.

Dirty reading refers to reading data from another uncommitted transaction during a transaction. For example, account A transfers 500 yuan to B. After B's balance is increased but the transaction has not yet been submitted, if another request is made at this time What is obtained in B is the balance after the increase of B, which causes dirty reading, because if the transaction fails and rolls back, the balance of B should not increase.

Non-repeatable read means that for certain data in the database, multiple queries within a transaction range return different data values. This is because other transactions modified the data and submitted it between multiple queries.

Phantom read means that when the same query is executed twice in a transaction, the result set returned by the second query is different from the first query. The difference from non-repeatable read is that the value read twice is different for the same record. Phantom reading is the addition or deletion of records, resulting in a different number of records obtained twice under the same conditions.

isolation level

The four isolation levels of transactions can solve the above concurrency problems. As shown on the right side of the figure above, from top to bottom, the four isolation levels are from low to high.

The first isolation level is read uncommitted, which means you can read the uncommitted content of other transactions. This is the lowest isolation level. Under this isolation level, the three concurrency problems mentioned above may occur.

The second isolation level is read committed, which means that only data that has been submitted by other transactions can be read. This isolation level can solve the dirty read problem.

The third isolation level is repeatable read, which can ensure that the results of multiple reads of the same data during the entire transaction are the same. This level can solve the problems of dirty reads and non-repeatable reads. MySQL's default isolation level is repeatable read.

The last isolation level is serialization, which is the highest isolation level and all transaction operations are executed sequentially. This level will lead to a decrease in concurrency and the worst performance. However, this level can solve all the concurrency issues mentioned earlier.

Transaction classification

Next, look at the classification of transactions, as shown below.

The first is a flat transaction. In a flat transaction, all operations are at the same level. This is also the most commonly used transaction. Its main limitation is that it cannot commit or roll back a certain part of the transaction. It must either succeed or roll back both.

In order to solve the disadvantages of the first type of transaction, there is a second type of flat transaction with save points. It allows transactions to be rolled back to an earlier state during execution, rather than rolling back entirely. By inserting a savepoint into a transaction, you can choose to roll back to the most recent savepoint when the operation fails.

The third type of transaction is a chain transaction, which can be regarded as a variant of the second type of transaction. When the transaction commits, it will implicitly pass the necessary context to the next transaction, so that when the transaction fails, it can roll back to the most recent transaction. However, chain transactions can only be rolled back to the latest savepoint, while flat transactions with savepoints can be rolled back to any savepoint.

The fourth type of transaction is a nested transaction, which consists of top-level transactions and sub-transactions, similar to a tree structure. Generally, the top-level transaction is responsible for logical management, and the sub-transaction is responsible for specific work. The sub-transaction can be submitted, but the actual submission must wait until the parent transaction is submitted. If the upper-level transaction is rolled back, then all sub-transactions will be rolled back.

The final type is distributed transactions. It refers to flat transactions in a distributed environment.

Commonly used distributed transaction solutions are shown on the right side of the figure above, and are briefly introduced below.

The first distributed transaction solution is the XA protocol, which is a rigid transaction that guarantees strong consistency. The implementation methods include two-stage submission and three-stage submission. Two-phase commit requires a transaction coordinator to ensure that all transaction participants have completed the first phase of preparation. If the coordinator receives a message that all participants are ready for, it will notify all transactions to perform a second-phase commit. In general scenarios, two-stage commit can solve distributed transactions very well. However, when only one process fails, the two-stage commit will cause the entire system to be blocked for a long time. Three-stage commit reduces the system blocking time mentioned earlier by adding a pre-commit stage. Three-stage submission is rarely used in practice, just understand it briefly.

The second distributed solution is TCC, which is a flexible transaction solution that meets eventual consistency. TCC adopts a compensation mechanism. The core idea is that for each operation, the corresponding confirmation and compensation operations must be registered. It is divided into three stages: the Try stage mainly detects the business system and reserves resources; the Confirm stage confirms and submits the business system; the Cancel stage performs rollback and releases reserved resources when business execution errors occur.

The third solution is the message consistency solution. The basic idea is to put local operations and message sending in one transaction to ensure that both local operations and message sending either succeed or fail. Downstream applications subscribe to messages and perform corresponding operations after receiving the messages.

The fourth option is to learn about the global transaction service GTS in Alibaba Cloud. The corresponding open source version is Fescar. Fescar is improved based on two-stage commit, stripping away the requirements of the distributed transaction solution on the protocol support of the database. The prerequisite for using Fescar is that the resources involved in the branch transaction must be a relational database that supports ACID transactions. The commit and rollback mechanisms of branches rely on local transactions for guarantee. Fescar's implementation currently has some limitations. For example, the transaction isolation level supports up to the read-committed level.

Guess you like

Origin blog.csdn.net/g_z_q_/article/details/129940768
Recommended