Architecture Design of Payment Platform

 

2018-04-29Li  Yanpeng  Programmer Xiaohui

 

This article is reproduced from the public   account Fastpay

 

Author Li Yanpeng , Ali P8 technical expert, Xiaohui was fortunate to get to know each other at the Qcon conference. He is very skilled and humble.

 

 

 

The Internet platform architecture has increasingly become the cornerstone of Internet development. For Java developers and architects, only after understanding the principles behind the architecture can they write higher-quality code, design better solutions, and work on complex platforms. Only by producing value under the structure can quickly find problems, quickly locate problems, and quickly solve problems in various scenarios.

This chat will lead you to start with the payment platform architecture design review, explain the core points of the design review, bring real cases to the readers, and help the readers understand the importance, core points and best implementation of the design review. In this chat you will learn the following:

  1. Demystifying the application practice of database locks in payment systems.

  2. How to scientifically set the thread pool.

  3. Best practices for cache usage.

  4. Database design essentials.

  5. A "blood case" caused by a line of code.

  6. Idempotent and weight-proof.

  7. Multiple ways to implement distributed task scheduling.

 

 

Uncover the application practice of database lock in payment system

 

Locks are usually used when multiple threads operate on a shared resource at the same time, as a synchronization facility to ensure the ordering and correctness of operations. In the author's opinion, the essence of locks is actually queuing, and the space and time for queuing different locks are different. For example, Java's Synchronized locks are queued on the object header when the application processes business logic, and the database locks are in the The database is queued when performing database operations, and distributed locks are queued on a common storage service when processing business logic.

optimistic locking

 

Optimistic locking is based on an "optimistic" idea. It is assumed that the concurrency of database operations is very small. In most cases, there is no concurrency. Updates are performed in sequence. In rare cases, version control is used to prevent the generation of dirty data. . The specific process is that when operating database data, no explicit lock is added to the data, but the order and correctness of the operation are ensured by comparing the versions or timestamps of the data. Generally, before updating data, first obtain the version or timestamp of this record. When updating data, compare the version or timestamp of the record. If the version or timestamp is the same, continue to update, if not, stop updating. Data records, which indicate that the data has been updated by other threads or other clients. At this time, you need to obtain the latest version of the data, perform business logic operations, and update it again.

Its pseudocode is as follows.

int version = executeSql("select version from... where id = $id"); // process business logic boolean succ = executeSql("update ... where id = $id and version = $version"); if (!succ) {    // try again }

At the same time, only one update request will succeed, and other update requests will fail. Therefore, it is suitable for scenarios with low concurrency. It is usually applied to ERP systems in traditional industries to prevent multiple operators from concurrently modifying the same copies of data. In some Internet companies, it is an anti-pattern to use optimistic locking to try to update multiple times when it fails, resulting in the concurrency never increasing. Moreover, this mode is implemented by the application layer, which cannot prevent other programs from directly updating database data.

pessimistic lock

 

Pessimistic locking is based on a "pessimistic" idea. It is assumed that there are many concurrent database operations, and in most cases there is concurrency. The data is locked before the data is updated, and any other requests to update the data are prevented during the update process. After the data is updated, the lock is released. The lock here is a database-level lock.

Usually use the database for update statement to achieve, the code is as follows.

executeSql("select ... where id = $id for update"); try {    // process business logic    commit(); } catch (Exception e) {    rollback(); }

Pessimistic locking is implemented at the database engine level and prevents all database operations. However, in order to update a piece of data, the data needs to be locked in advance until the data processing is completed and the transaction is submitted, and other requests can update the data. Therefore, the performance of pessimistic locks is relatively low, but because it can ensure the strength of the updated data Consistency is the safest way to deal with databases. Therefore, some accounts and capital processing systems still use this method, sacrificing performance, but gaining security and avoiding capital risks.

row level lock

 

Not all update operations require display locks. The database engine itself has row-level locks, and it has synchronization and mutual exclusion operations when updating row data. We can use this row-level lock to control the time window of the lock. Minimum, once to ensure the validity of updating data in high concurrency scenarios.

A row-level lock is a lock on the engine itself when a record is updated in the database engine. It is a part of the database engine. When the database engine updates a piece of data, it will lock the record itself. At this time, even if there are multiple requests to update, Dirty data will not be generated, the granularity of row-level locks is very fine, and the time window for locking is minimal. Only when the data records are updated, the records will be locked. Therefore, the conflict of database operations can be greatly reduced, and locks will occur. The probability of conflict is the lowest and the degree of concurrency is the highest.

Row-level locks are usually used in the scenario of deducting inventory, so that the database engine itself can control record locking to ensure the security of database updates, and through the conditions of the where statement, to ensure that the inventory will not be reduced below 0, That is, it can effectively control oversold scenarios, as shown in the following code.

boolean result = executeSql("update ... set amount = amount - 1 where id = $id and amount > 1");if (result) {    // process sucessful logic} else {    // process failure logic}

Another scenario is to use row-level locks during state transitions. For example, in a transaction engine, the state can only flow from init to doing, any repeated flow from init to doing, or other states such as init to finished. will fail, the code is as follows.

boolean result = executeSql("update ... set status = 'doing' where id = $id and status = 'init'"); if (result) {    // process sucessful logic } else {    // process failure logic }

Row-level locks have high concurrency and the best performance, and are suitable for scenarios where inventory is deducted and the direction of state flow is controlled under high concurrency.

However, some people say that this method cannot be guaranteed to be idempotent. For example, in the scenario of deducting balances, multiple submissions may deduct multiple times. This does exist. However, we have solutions. We The history of deduction can be recorded. If there is a non-idempotent scenario, it can be checked and corrected through the recorded deduction history. This method is also applicable to scenarios such as accounting history. The code is as follows.

boolean result = executeSql("update ... set amount = amount - 1 where id = $id and amount > 1"); if (result) {    int amount = executeSql("select amount ... where id = $id");    executeSql("insert into hist (pre_amount, post_amount) values ($amount + 1, $amount)");    // process successful logic } else {    // process failure logic }

In the review of the payment platform architecture design, it is usually recommended to use row-level locks for the control of the status flow of the flow meter of the transaction and payment system, the status control of the account system, and the update of the split and refund balances. Locks and pessimistic locking are not recommended.

 

How to scientifically set the thread pool


The online high-concurrency service is like a levee standing silently beside the big river, ready to deal with the impact of the flood at any time. The thread pool of the online high-concurrency service also causes many problems, such as: the thread pool is full , high CPU utilization, hanging service threads, etc. These are all caused by improper use of the thread pool, or failure to do a good job of protection and downgrade.

Of course, some friends have the idea of ​​​​protecting the thread pool, but have you ever had such experience and impression that sometimes the threads of the thread pool are set too much and the performance is low. What about thread pools?

After years of reviewing the design of the small partners, I learned that the small partners set the number of threads in the thread pool based on experience and intuition, and then adjust the number according to the online situation, and finally find the most suitable value. , this is through experience, sometimes it works, sometimes it doesn't, and sometimes it works but at a great cost to find the best number of settings.

In fact, the setting of the thread pool is well-founded and can be set according to theoretical calculations.

First, let's take a look at the ideal situation, that is, all the tasks to be processed are computing tasks. At this time, the number of threads should be equal to the number of CPU cores, so that each CPU runs a thread without thread switching, and the efficiency is the highest. Of course this is ideal.

In this case, if we want to reach a certain amount of QPS, we use the following formula.

The number of threads set = target QPS/(1/task actual processing time)

For example, assuming that the target QPS=100, the actual processing time of the task is 0.2s, 100 * 0.2 = 20 threads, and the 20 threads here must correspond to 20 physical CPU cores, otherwise the estimated QPS index will not be achieved.

But in fact, in addition to doing memory computing, our online services are more about accessing databases, caches and external services, and most of the time is waiting for IO tasks.

If there are many IO tasks, we use Amdahl's law to calculate.

Number of threads set = number of CPU cores * (1 + io/computing)

For example, assuming a 4-core CPU, the IO tasks in each task account for 80% of the total tasks, 4 * (1 + 4) = 20 threads, and the 20 threads here correspond to a 4-core CPU.

In addition to the setting of the number of threads in the thread, the setting of the thread queue size is also very important, which can also be calculated theoretically. The rule is to calculate the queue size according to the target response time.

Queue size = number of threads * (target response time / task actual processing time)

For example, assuming that the target response time is 0.4s, the length of the calculated blocking queue is 20 * (0.4 / 0.2) = 40.

In addition, we have the following best practices when setting the number of thread pools.

  1. The use of the thread pool takes into account the maximum number of threads and the minimum number of threads.

  2. For a single-part service, the maximum number of threads should be equal to the minimum number of threads, and for mixed services, the gap between the maximum and minimum numbers can be properly opened to adjust the utilization of CPU cores as a whole.

  3. The size of the thread queue must be set to a bounded queue, otherwise the pressure will drag down the entire service.

  4. Thread pools are used only when necessary, and design performance evaluation and stress testing are required.

  5. The failure strategy of the thread pool and the compensation after failure must be considered.

  6. Background batch services must be separated from online user-facing services.

 

 

Best Practices for Cache Usage

 

In the process of design review, the author summarizes the best practices of some developers when designing cache systems.

best practice 1

The cache system mainly consumes the memory of the server. Therefore, when using the cache, you must first evaluate the data size that the application needs to cache, including the cache data structure, cache size, cache quantity, and cache expiration time, and then according to the business situation. Calculate the capacity usage for a certain period of time in the future, and apply and allocate cache resources according to the results of the capacity evaluation. Otherwise, resources will be wasted or the cache space will be insufficient.

Best Practice 2

It is recommended to separate the services that use the cache, and use different cache instances for core services and non-core services to physically isolate them. If possible, use a separate instance or cluster for each service to reduce the number of applications that interact with each other. possibility of impact. The author often hears that some companies use shared cache, which causes the cached data to be overwritten and the online accident of the cached data to be disordered.

Best Practice 3

According to the memory size provided by the cache instance, the number of cache instances to be used by the application is pushed. Generally, a cache management operation and maintenance team will be established in the company. This team will virtualize the cache resources into multiple cache instances with the same memory size. For example, a The instance has 4GB of memory. When applying for an application, you can apply for a sufficient number of instances for use. Such an application needs to be sharded. It should be noted here that if we use the RDB backup mechanism and each instance uses 4GB of memory, our system needs more than 8GB of memory, because the copy-on-write mechanism is used for RDB backup, which requires fork to create a child process and copy a memory, so double the memory storage size is required.

Best Practice 4

The cache is generally used to speed up the read operation of the database. Generally, the cache is accessed first, and then the database is accessed, so the setting of the cache timeout time is very important. In an Internet company, the author once encountered a situation where the cache timeout was set too long due to an operation and maintenance error, which dragged down the thread pool of the service and eventually led to a service avalanche.

Best Practice 5

All cache instances need to add monitoring, which is very important, we need to do reliable monitoring for slow queries, large objects, and memory usage.

Best Practice 6

If multiple businesses share a cache instance, of course we do not recommend this situation, but due to cost control reasons, this situation often occurs, we need to limit the key used by each application through the specification must have a unique prefix, and carry out Isolate the design to avoid the problem of caches covering each other.

Best Practice 7

Any cached key must set the cache invalidation time, and the invalidation time cannot be concentrated at a certain point, otherwise the cache will fill up the memory or the cache will penetrate.

Best Practice 8

Data that is accessed infrequently should not be placed in the cache. As we said before, the main purpose of using the cache is to improve the reading performance. A small partner once designed a set of timed batch processing systems. A large data model is used for calculation, so the small partner saves this data model in the local cache of each node, and receives updated messages through the message queue to maintain the real-time nature of the model in the local cache, but this model every month It is only used once, so it is very wasteful to use the cache in this way. Since it is a batch task, it is necessary to divide the task, perform batch processing, and use the method of divide and conquer and step-by-step calculation to obtain the final result.

Best Practice 9

The cached data is not easy to be too large, especially Redis, because Redis uses a single-threaded model, when the data of a single cache key is too large, it will block the processing of other requests.

Best Practice 10

For keys that store more values, try not to use set operations such as HGETALL, which will block requests and affect the access of other applications.

Best Practices 11

The cache is generally used in the scenario of speeding up the query in the transaction system. When there is a large amount of update data, especially batch processing, please use the batch mode, but this scenario is rare.

Best Practices 12

If the performance requirements are not very high, try to use the distributed cache instead of the local cache, because the local cache is replicated between the various nodes of the service, and the replicas are inconsistent at a certain time. If this cache represents It is a switch, and the request in the distributed system may be repeated, which will cause the repeated request to go to two nodes. The switch of one node is on, and the switch of the other node is off. If the request processing is not idempotent, This results in duplication of processing and, in severe cases, financial losses.

Best Practices 13

When writing to the cache, you must write completely correct data. If part of the cached data is valid and part of the data is invalid, you would rather give up the cache than write part of the data into the cache, otherwise it will cause null pointers, program exceptions, etc.

Best Practices 14

Under normal circumstances, the order of reading is the cache first, then the database; the order of writing is the database first, then the cache.

Best Practices 15

When using a local cache (such as Ehcache), be sure to strictly control the number and lifetime of cached objects. Due to the characteristics of the JVM, too many cached objects will greatly affect the performance of the JVM, and even cause problems such as memory overflow.

Best Practices 16

When using the cache, there must be downgrade processing, especially for key business links, when there is a problem with the cache or it is invalid, the source must be returned to the database for processing.

For best practices and online examples of cache usage, please refer to Chapter 4 of the book "Scalable Service Architecture: Frameworks and Middleware", which is expected to be available in March 2018.

 

Database Design Essentials

 

index

 

When it comes to the design points of the database, the first thing we need to talk about is the use of database indexes. In online services, any database query must go through the index. This is the bottom line. It is not possible to not use the index because the amount of data is temporarily small. The increase in the amount of data leads to performance problems. Generally, every developer has the awareness of creating and using indexes. However, the problem occurs in the way developers use indexes. To ensure the validity of the established index, we must ensure that the online query finally reaches the index. There has been such a low-level error. In a certain scenario, a joint query is required based on the three fields of A, B, and C. The developer Three indexes were established on A, B and C respectively, which seemed to be in line with the specification, but in fact, only the index A was used, and B and C were not used. Later, due to performance problems, the code was checked. time to discover.

We recommend that each developer check the execution plan for the SQL they use. In addition, the SQL and indexes must be reviewed by the DBA before going online.

In addition, for general databases, >=, BETWEEN, IN, LIKE, etc. can be indexed, but NOT IN cannot be indexed. If the matching character starts with %, it cannot be indexed, and these must be remembered.

range query

 

Any range query against the database must have a limit on the maximum number of result sets, and then perform paging processing. Development-style SQL statements cannot be used because the amount of temporary data is small. If this is the case, after the data is increased, it will lead to results. The set is too large and makes the application OOM.

Here's how mainstream databases limit the size of a result set.

DB2
FETCH FIRST 100 ROWS ONLYSELECT id FROM( SELECT ROW_NUMBER() OVER() AS num,id FROM TABLE ) A WHERE A.num>=1 AND A.num<= 100
MySQL
limit 1, 100
Oracle
rownum

Schema change

For the schema change of the database, we recommend only adding fields, not modifying or deleting fields. The risk of modifying and deleting fields is too high, especially when the application is more complex, and the design of the database and the application is all additive. If you don't know the application using the database, don't change the original data structure easily. Modifying the fields may lead to incompatibility between the code and the database.

Even if only adding fields is allowed, we make the following provisions.

New code should be compatible with old data, and old code should be compatible with new data.

Try to make the old and new code fully compatible with the old and new database schema, which will not cause problems before, during and after the database upgrade.

The increase of the field enumeration value, or the change of the meaning, format, and restriction of the database field, must consider the inconsistent behavior caused by quasi-production and online, or the inconsistent behavior of the new and old versions during the online process. It has happened before that the enumeration value was added when the version was updated. Since the Boss backend went online first, a new enumeration value was generated. As a result, the trading program was not updated, and a processing exception occurred without knowing the new enumeration value. The value should be used with caution.

affairs

 

It often occurs that a remote service is called in a database transaction, and the transaction is extended due to the timeout of the remote service, which leads to the situation of database paralysis. Therefore, during transaction processing, it is forbidden to execute calls that may cause thread blocking, such as: lock waiting, remote call Wait.

In addition, transactions should be kept as short as possible. There should not be too many operations in one transaction, or too many things should be done. Operating transactions for a long time will affect or block other requests. Accumulation can cause database failures. Data operations will cause the scope and impact of locks to expand, and other operations of the database may be blocked, resulting in temporary unavailability.

Therefore, if the business allows, it is necessary to replace long transactions with short transactions as much as possible, reduce transaction execution time, reduce the duration of locks, and use eventual consistency to ensure the principle of data consistency.

We recommend this structure in the figure below.

The structure shown in the figure below must not be used.

SQL Security

 

All SQL must use parameterized SQL to prevent SQL injection, which is a bottom-line principle that cannot be compromised.

 

"Blood Case" Caused by a Line of Code

 

When doing the design review of the payment platform, we must be very careful, because if we are not careful, problems may occur, and even lead to loss of funds.

In the process of checking the problem, it was found that a log was missing, so a line of log was added.

log.info(... + obj); 

Unfortunately, after going online, the application has a comprehensive problem, and the transaction fails. Check the code and find that there are NullPointerExceptions from time to time. After analyzing the code, it is found that the code where NullPointerException occurs is in the obj.toString() method.

The object.toString() method code is shown below.

private Object fld1; 

......public String toString() {    return ... + this.fld1; }

We see that in the obj.toString() method, the local variable fld1 is used directly. Since the return value is of type String, Java will try to convert fld1 into a string, but a NullPointerException occurs at this time, then, fld1 will It must be null. Find out the reason and find that this object is deserialized from the cache. When deserializing, this field is null.

Therefore, we see that the online code and environment are very complex. When doing a design review, we must consider all the situations, think about the impact as comprehensively as possible, and fully reduce the reduction caused by code changes. Availability Risk.

 

Idempotent and anti-reweight


Although idempotency and anti-replication are quite complicated to say, they are very simple to implement, which corresponds to the author's sentence: any method that can effectively solve the problem is a method that looks very frustrating."

Idempotency is a feature. When an operation is performed multiple times and the result is the same, it becomes idempotent. It is expressed in mathematical formulas as follows.

f(f(x)) = f(x)

For the characteristics of some services, the operation itself is idempotent, such as: delete a resource, add a resource, obtain a resource, etc.

Anti-duplication is a way to achieve idempotency, and there are many ways to prevent duplication.

  1. Use the unique key of the database table to filter and reject duplicate requests. This is usually used to add records. As long as the record has a unique primary key, this method will work.

  2. Use the directionality of the state flow to filter the weight, usually using the above row-level lock to achieve, this is usually when the callback message is received, to update the state of the record, you can use the row-level lock to update the state of the database, Then, the business logic of continuing processing is judged according to the success of the update. For example, when a payment success message is received, the payment record will be updated from init to pay_finished first. If there are repeated requests, the second update request will fail.

  3. Using distributed storage to filter requests is expensive to implement.

 

 

Various methods for implementing distributed task scheduling

 

Use a mature framework

 

Mature open source distributed task invocation systems, such as TBSchedule, ElasticJob, etc., can be used.

For details, please refer to Chapter 6 of "Scalable Service Architecture: Frameworks and Middleware".

The code implements itself

 

If you don’t like to use a mature framework, like to reinvent the wheel, or the platform has requirements and is not allowed to introduce external open source projects, then this is the time for us to show our skills, and we can develop a distributed task scheduling system by ourselves.

In fact, the core of the distributed task scheduling system is task preemption, which is similar to the task scheduling of the operating system, but the application scenarios are different. The operating system processes the tasks submitted by each application process, while our distributed task scheduling system handles Background timing tasks in a service-based system.

Suppose, we have 4 background timing service nodes, and 4 tasks are stored in the task table of the database, as shown in the following figure, all tasks are in an idle state, the owner is empty, and 4 servers have no work to do .

At a certain point in time, the scheduled task of the service node is activated, and the service node starts to preempt the task. The preemption task needs to update the record status field and owner in the database. Generally, the row-level lock of the database is used. The code is as follows.

boolean result = executeSql("update ... set status = 'occupied' and owner = $node_no where id = $id and status = 'FREE' limit 1");if (result) {
    Task t = executeSql("select ... where status = 'occupied' and owner = $node_no");    // process task t    executeSql("update ... set status = 'finished' and owner = null where id = $t.id and status = 'occupied'); } 

Assuming that service node 1 preempts task number 1, service node 2 preempts task number 2, service node 3 preempts task number 3, and service node 4 preempts task number 4, as shown in the figure below, each starts to process its own task. After processing, set the task status to finished, and other service nodes will not preempt this task.

Of course, what is described here is only the core idea. The specific implementation requires detailed design, taking into account how to schedule tasks, how to handle task timeouts, and so on.

Use Dubbo as a service or a service-oriented platform with load balancing to achieve

 

If the platform stipulates that third-party open source components cannot be used, and it is time-consuming and labor-intensive to develop by yourself, there is another way. Although this method does not seem to be the best, it can help you quickly implement task sharding.

We can use Dubbo as a service or a service with load balancing to achieve this. We develop two services on the service node, a master control service, which is used to receive distributed timing trigger events. The master control service fishes tasks from the database, and then Distributing tasks is realized by using Dubbo as a service or a service-oriented platform with load balancing, that is, calling the task processing service of the service node, and realizing it through service-oriented load balancing.

For example, in the figure below, the main control service of service node 2 is called distributed regularly. The main control service retrieves tasks from the database, divides them into 4 shards, and then invokes the task processing interface through servitization. Since servitization has the function of load balancing , therefore, the 4 shards will be evenly distributed on service node 1, service node 2, service node 3, and service node 4.

Of course, this method needs to isolate the timing tasks in the background from the services in the foreground, and it is the bottom line that cannot affect the normal online services.

 

 

—————END—————

 



The official account ,  Fastpay , is a boutique official account in the third-party payment industry, providing business knowledge, architecture planning and implementation, technical core points and best practices for third-party payment.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325329707&siteId=291194637