Are you really familiar with distributed and transactions?

This article focuses on actual combat or implementation, does not involve CAP, but mentions ACID.

This article is suitable for basic distributed programmers:

  1. This article will cover the failover and recover problems of nodes in the cluster.

  2. This article will deal with issues of affairs and opaque affairs.

  3. This article will mention Weibo and tweeter, and raise a big data problem.

Since the topic of distribution is too big, and the topic of transactions is too big, we start with a small node in a cluster.

Live nodes and synchronization in the cluster

In a distributed system, how to judge whether a node is alive?
Kafka thinks like this:

  1. This node and zookeeper can talk. (Keep sessions with zookeeper through heartbeats.)

  2. If this node is a slave node, it must be able to reflect the data changes of the master node as faithfully as possible.
    In other words, it must be able to copy the changed data in time after the master node writes new data. The so-called timely, it cannot be pulled down too much.

Then, nodes that meet the above two conditions can be considered alive or in-sync.

Regarding the first point, everyone is familiar with heartbeat, so we can think that a certain node cannot talk to zookeeper like this:

zookeeper-node:
var timer = 
new timer()
.setInterval(10sec)
.onTime(slave-nodes,function(slave-nodes){
    slave-nodes.forEach( node -> {
        boolean isAlive = node.heartbeatACK(15sec);
        if(!isAlive) {
            node.numNotAlive += 1;
            if(node.numNotAlive >= 3) {
                node.declareDeadOrFailed();
                slave-nodes.remove(node);

                //回调也可 leader-node-app.notifyNodeDeadOrFailed(node)

            }
        }else 
        node.numNotAlive = 0;
    });
});

timer.run();

//你可以回调也可以像下面这样简单的计时判断
leader-node-app:
var timer = 
new timer()
.setInterval(10sec)
.onTime(slave-nodes,function(slave-nodes){
    slave-nodes.forEach(node -> {
        if(node.isDeadOrFailed) {

        //node不能和zookeeper喊话了

        }
    });
});

timer.run();

 

Regarding the second point, it is a little more complicated.
Let’s analyze it like this:

  • Data messages.

  • Operation op-log.

  • Offset position/offset.

// 1. 先考虑messages
// 2. 再考虑log的postion或者offset
// 3. 考虑msg和off都记录在同源数据库或者存储设备上.(database or storage-device.)
var timer = 
new timer()
.setInterval(10sec)
.onTime(slave-nodes,function(nodes){
    var core-of-cpu = 8;
    //嫌慢就并发呗 mod hash go!
    nodes.groupParallel(core-of-cpu)
    .forEach(node -> {
        boolean nodeSucked = false;

        if(node.ackTimeDiff > 30sec) {
            //30秒内没有回复,node卡住了
            nodeSucked = true;
        }
        if(node.logOffsetDiff > 100) {
            //node复制跟不上了,差距超过100条数据
            nodeSucked = true;
        }

        if(nodeSucked) {
            //总之node“死”掉了,其实到底死没死,谁知道呢?network-error在分布式系统中或者节点失败这个事情是正常现象.
            node.declareDeadOrFailed();
            //不和你玩啦,集群不要你了
            nodes.remove(node);
            //该怎么处理呢,抛个事件吧.
            fire-event-NodeDeadOrFailed(node);
        }
    });
});

timer.run();

The state management of the above nodes is generally done by zookeeper, and the leader or master node will also maintain that state.

Then the leader or master node in the application only needs to pull the status from zookeeper. At the same time, is the above implementation necessarily the best? No, and most of the operations can be combined, but in order to describe whether the node is alive or not, we can write this way.

What should I do if the node is dead, failed, or out of sync?

Well, finally talked about failover and recovery, then failover is relatively simple, because there are other slave nodes, which do not affect data reading.

  1. Multiple slave nodes failed at the same time?
    There is no 100% availability. The data center and computer room are paralyzed, the network cable is cut, the hacker invades and deletes your roots. In short, your RP is out of order.

  2. If the master node fails, can the master-master not work?
    Keep-alived or LVS or write your own failover.
    High-availability architecture (HA) is another big thing , this article will not be expanded.

Let's pay attention to the recovery aspect. Let's open the point of view here. Not only focus on the slave node to track the log to synchronize data after restarting. Let's see in actual applications, what if the data request (including read, write, update) fails?

Everyone may say, retry, replay or just leave it alone!
It's okay, it's okay, these are all strategies, but do you really know how to do it?

A bigdata problem

Let's set up the background for discussion:

Problem: Message flow, such as Weibo's Weibo (reel), continuously flows into our application. To process these messages, there is a requirement like this:

Reach is the number of unique people exposed to a URL on Twitter.

Then, count the total number of reach of this Weibo (url) within 3 hours.

How to solve it?

Pull out the people who have reposted a certain Weibo (url) in a certain period of time, pull out the fans of these people, remove the duplicates, and then calculate the total number, which is the required reach.

For simplicity, let’s ignore the date and see if this method works:

/** ---------------------------------
* 1. 求出转发微博(url)的大V. 
* __________________________________*/

方法 :getUrlToTweetersMap(String url_id)

SQL : /* 数据库A,表url_user存储了转发某url的user */
SELECT url_user.user_id as tweeter_id
FROM url_user
WHERE url_user.url_id = ${url_id}

返回 :[user_1,...,user_m]

 

/** ---------------------------------
* 2. 求出大V的粉丝 
* __________________________________*/

方法 : getFollowers(String tweeter_id);

SQL :   /* 数据库B */
SELECT users.id as user_id
FROM users
WHERE users.followee_id = ${tweeter_id}

返回:tweeter的粉丝

 

/** ---------------------------------
* 3. 求出Reach
* __________________________________*/

var url = queryArgs.getUrl();
var tweeters = getUrlToTweetersMap();
var result = new HashMap<String,Integer>();
tweeters.forEach(t -> {
    // 你可以批量in + 并发读来优化下面方法的性能
    var followers = getFollowers(t.tweeter_id);

    followers.forEach(f -> {
        //hash去重
        result.put(f.user_id,1);
    });
});

//Reach
return result.size();

 

Great quack, no matter what, Reach was found!

In fact, this leads to a very important question, which is often overlooked when many talk about frameworks, designs, and patterns: the relationship between performance and database modeling.

  1. How big is the amount of data?
    I don’t know if the readers have some ideas about the database I/O of this problem, or are they shocked?
    Computing reach is too intense for a single machine – it can require thousands of database calls and tens of millions of tuples.
    JOIN is avoided in the database design above . In order to improve the performance of the fans of seeking big V, a batch of big V can be used as batch/bulk, and then multiple batches are read concurrently, swearing to kill the database.
    Here we will separate Weibo to the database where the forwarder table is located, and separate it from the fan database. What if the data is larger?
    Database sub-table...
    OK, assuming you are already very familiar with traditional relational database sub-table and data routing (aggregation of read paths, distribution of write paths), or you are also familiar with sharding technology, or you are good It combines the horizontal scalability of HBase and has a consistent strategy to solve its secondary index problem.
    In short, the storage and reading problem assumes that you have solved it, what about distributed computing?

  2. In the application of Weibo, the relationship between people becomes a graph (web). How do you model and store it? Instead of just answering this question, for example: How close is
    someone’s friend’s friend to someone?

See how to use storm to solve distributed computing and provide streaming computing capabilities:

// url到大V -> 数据库1
TridentState urlToTweeters =
    topology.newStaticState(getUrlToTweetersState());
// 大V到粉丝 -> 数据库2
TridentState tweetersToFollowers =
    topology.newStaticState(getTweeterToFollowersState());

topology.newDRPCStream("reach")
    .stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
    .each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
    .shuffle() /* 大V的粉丝很多,所以需要分布式处理*/
    .stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
    .parallelismHint(200) /* 粉丝很多,所以需要高并发 */ 
    .each(new Fields("followers"), new ExpandList(), new Fields("follower"))
    .groupBy(new Fields("follower"))
    .aggregate(new One(), new Fields("one")) /* 去重 */
    .parallelismHint(20)
    .aggregate(new Count(), new Fields("reach")); /* 计算reach数 */

At most once

Back to the topic, the above example is introduced, one is to elicit a question about distributed (storage + computing), and the other is to reveal the meaning:
programmers should pay attention to design and implementation, such as how Jay Kreps invented Kafka on this wheel:]

If you are still at the programmer level, let's be practical. As we mentioned earlier recover, the problem of node recovery, then how many things should we recover?

basic:

  • Node status

  • Node data

This article discusses this problem from the data. To make the problem simpler, we consider the scenario of writing data. If we use write-ahead-loga method to ensure data replication and consistency, then how will we deal with the consistency problem?

  1. New data is written to the master node.

  2. Follow the log from the node and prepare to copy this batch of new data. The slave node does two things:
    (1). Write the id offset of the data to the log;
    (2). Just about to process the data itself, the slave node hangs.

Then according to the node survival conditions above, the slave node has been detected that the event is down. The slave node is manually restored by the maintenance staff or by itself. Then before joining the cluster and the friends continue to play, it must synchronize its own Status and data.
Here comes the problem:

If the data is synchronized based on the data offset in the log, then because the node has written the offset before processing the data, but the batch of data lost-datas has not been processed, if the data after the log is synchronized, then That batch of data lost-datas was lost.

In this case, it is said that the data is processed at most once, which means that the data will be lost.

At least once

Well, loss of data cannot be tolerated, so let's deal with it in another way:

  1. New data is written to the master node.

  2. Follow the log from the node and prepare to copy this batch of new data. The slave node does two things:
    (1). Process the data first;
    (2). The id offset of the data is about to be written into the log, and the slave node is down.

Here comes the problem:

If the data is synchronized by chasing the log from the node, then because the batch of data duplicated-datas has been processed, and the data offset is not reflected in the log, if it is chased in this way, this batch of data will be duplicated.

In this scenario, semantically speaking, the data is processed at least once, which means that the data processing will be repeated.


Exactly once

Transaction

Well, data duplication can't be tolerated? The requirements are high.
The strong consistency guarantee that everyone is pursuing (here is the final consistency), how to do it?
In other words, when updating data, how to guarantee transaction capabilities?
Suppose a batch of data is as follows:

// 新到数据
{
    transactionId:4
    urlId:99
    reach:5
}

 

Now to update this batch of data to the library or log, the original situation is:

// 老数据
{
    transactionId:3
    urlId:99
    reach:3
}

 

If we can guarantee the following three points:

  1. The generation of transaction ID is strongly ordered. (Isolation, serial)

  2. A batch of data corresponding to the same transaction ID is the same. (idempotence, one result of multiple operations)

  3. A single piece of data will and only appear in a certain batch of data. (Consistency, no omissions and no duplication)

So, rest assured and boldly update:

// 更新后数据
{
    transactionId:4
    urlId:99
    //3 + 5 = 8
    reach:8
}

 

 

Note that this update is updated with the ID offset and data , so what guarantees this operation: atomicity .
Does your database provide atomicity? It will be mentioned slightly later.

Here is the update successfully. If the node is hung up during the update, the id offset in the library or log will not be written, and the data will not be processed. When the node is restored, you can rest assured to synchronize and then join the cluster to play.

So, it is still difficult to ensure that the data is processed only once, right?

What's wrong with the realization of the semantics of "processing only once" above?

Performance issues.

 

The batch strategy has been used here to reduce the Round-Trip Time to the library or disk, so what is the performance problem here?

Consider, the master-master architecture is used to ensure the availability of the master node, but if one master node fails, it takes time to host the work on another master node.
Assuming that the slave node is synchronizing, pop! The master node is down! Because it is necessary to ensure that the semantics are processed only once, the atomicity works, fail, roll back, and then pull the failed data from the master node (you can't update it nearby, because this batch of data may have changed, or you did not cache the batch at all Data), what is the result?

The old master node is down, and the new master node has not started yet, so this transaction is stuck here until the source of data synchronization-the master node can respond to the request.

If you don't think about performance, just let it go, it's not a big deal.

You seem to be unfinished? Come on, see what the "silver bullet" is?

 

Opaque-Transaction

Now, let's pursue such an effect:

A piece of data in a batch of data (this batch of data corresponds to a transaction) is likely to fail, but it will succeed in another batch of data.
In other words, the transaction ID of a batch of data must be the same.

Take a look at an example, the same old data, but more fields: prevReach.

// 老数据
{
    transactionId:3
    urlId:99
    //注意这里多了个字段,表示之前的reach的值
    prevReach:2
    reach:3
}


// 新到数据
{
    transactionId:4
    urlId:99
    reach:5
}

 

 

In this case, the ID of the new transaction is larger and lower, indicating that the new transaction can be executed. What are you waiting for? Update directly. The updated data is as follows:

// 新到数据
{
    transactionId:4
    urlId:99
    //注意这里更新为之前的值
    prevReach:3
    //3 + 5 = 8
    reach:8
}

 

Now let’s look at another situation:

// 老数据
{
    transactionId:3
    urlId:99
    prevReach:2
    reach:3
}

// 新到数据
{
    //注意事务ID为3,和老数据中的事务ID相同
    transactionId:3
    urlId:99
    reach:5
}

 

How to deal with this situation? Is it skip? Because the transaction ID of the new data is the same as the transaction ID in the library or log, the data should have been processed this time according to the transaction requirements, skip?

No, this kind of thing cannot be guessed. Think about the several properties we have. The key point is:

Given a batch of data, they belong to the same transaction ID.

Carefully understand the difference between the above sentence and the following sentence:
Given a transaction ID, at any time, the batch of data associated with it is the same.

We should do this, considering that the transaction ID of the newly arrived data is the same as the transaction ID in the storage, so this batch of data may be processed separately or asynchronously, but the transaction ID corresponding to this batch of data is always the same, then, even Part A in this batch of data is processed first. Since everyone is a transaction ID, the previous value of part A is reliable.

Therefore, we will rely on the value of prevReach instead of Reach to update:

// 更新后数据
{
    transactionId:3
    urlId:99
    //这个值不变
    prevReach:2
    //2 + 5 = 7
    reach:7
}

 

What did you find?
Different transaction IDs lead to different values:

  1. When the transaction ID is 4, which is greater than the transaction ID 3 in the storage, Reach is updated to 3+5 = 8.

  2. When the transaction ID is 3, which is equal to the transaction ID3 in the storage, Reach is updated to 2+5 = 7.

This is it Opaque Transaction.

This transaction capability is the strongest, and it can ensure that transactions are submitted asynchronously. So don’t worry about getting stuck, if you say in the cluster:

Transaction:

  • The data is processed in batches, and each transaction ID corresponds to a certain batch of identical data.

  • Ensure that the generation of transaction IDs is strongly ordered.

  • Ensure that batches of data are not duplicated or omitted.

  • If the transaction fails and the data source is lost, subsequent transactions will be stuck until the data source is restored.

Opaque-Transaction:

  • The data is processed in batches, and each batch of data has a definite and unique transaction ID.

  • Ensure that the generation of transaction IDs is strongly ordered.

  • Ensure that batches of data are not duplicated or omitted.

  • If the transaction fails, the data source is lost, and subsequent transactions are not affected, unless the data source of the subsequent transaction is also lost.

In fact, the design of this global ID is also an art:

  • The ID of the redundant association table is used to reduce joins, so that ID is O(1).

  • Redundant date (long type) field to avoid order by.

  • Redundant filter fields to avoid the embarrassment of no secondary index (HBase).

  • Store the mod-hash value to facilitate the writing of data routing at the application layer after the database and table are divided.

This content is too much and the topic is too big, so I won't expand it here.

You now know the importance of Twitter's snowflake to generate globally unique and ordered IDs.


Two-phase commit

Now using zookeeper to do two-phase commit is already an entry-level technology, so it will not start.

If your database does not support atomic operations, then consider two-phase commit.

If you want to learn Java engineering, high performance and distributed, explain the profound things in a simple way for free. Friends of microservices, Spring, MyBatis, Netty source code analysis can add to my Java advanced group: 478030634, there are Ali Daniels live explain technology in the group, and videos of Java large-scale Internet technology to share with you for free.


Conclusion

To be continued.

 


Guess you like

Origin blog.csdn.net/yunzhaji3762/article/details/84310996