A one-time centralized processing a large number of regular tasks of data, how to shorten execution time?

Author: Shen Jian 58

Abstract problem:
(1) user membership system;
(2) the user will have scores of water, do a statistical fraction of a month, doing different business processes for different members grade scores;
data assumptions:
A one-time centralized processing a large number of regular tasks of data, how to shorten execution time?

(1) assuming that users 100w level;
(2) assume a daily user of water, which means increasing the amount of data flowing in the 100W level, the new monthly water level in 3kW, water and the amount of data in one hundred million level for three months;
common solution:
use a timed task, the first day of each month once.

//(1)查询出所有用户
uids[] = select uid from t_user;
//(2)遍历每个用户
foreach $uid in uids[]{
         //(3)查询用户3个月内分数流水
        scores[]= select score from t_flow
                  where uid=$uid and time=[3个月内];
         //(4)遍历分数流水
        foreach $score in scores[]{
                   //(5)计算总分数
                  sum+= $score;
        }
         //(6)根据分数做业务处理
        switch(sum)
        升级降级,发优惠券,发奖励;
}

Once a month to perform regular tasks, what problems exist?
Computationally expensive, large amount of data processed, takes a long time, according to Friends of the water say, need 1-2 days.
VO: 100W level user outer loop; inner loop 9kW water level; business process requires several database 10 interaction.
Can multi-threaded parallel processing?
It can, for each user is not coupled to the processing pipeline.
Changed to multi-threaded parallel processing, for example in accordance with the user split, what problems exist?
Each thread should access the database to do business processing, database likely not carry.
Optimization direction such problems are:
(1) the same data to reduce the number of calculations is repeated;
(2) sharing CPU time, try to dispersion treatment, rather than focus;
(3) reduce the amount of calculation single transactions;
how to reduce the same copies data, double-counting?
A one-time centralized processing a large number of regular tasks of data, how to shorten execution time?

Above, each square is assumed that the data flow is a fraction of 1 month (approximately 3kW).

When calculating the end of March, to query and calculation January, February, 9kW data in March three months;
the calculation of the end of April, to query and calculation February, March, 9kW data in April for three months;
...
will found, February and March data (pink section), query and calculation are repeated many times.
Voice-over: the business, the monthly data are calculated three times.
New monthly summary integral water per month is calculated only increment:
flow_month_sum (month The, uid, flow_sum)
(a) Every month, the month scores only count the amount of data is reduced to 1/3, time-consuming but also reduced to 1 / 3;
(2) At the same time, the first two months of water was added and, you can get the last three months of the total score (this action is hardly spend time);
voice-over: the same order of magnitude and user tables the amount of data the table, 100w level .
As a result, each fraction of the water will only be counted once.
How to share the CPU time, reducing the amount of data a single computing it?
Business demand is calculated once a month to re-score, but centralized computing a month, the amount of data is too large, time-consuming too long, it can be apportioned to calculate every day.
A one-time centralized processing a large number of regular tasks of data, how to shorten execution time?

As shown above, the integral water-month summary, upgraded, integral water daily summary table.
The concentration is calculated once a month, assessed for the dispersion calculation 30 times, each time reduced to 1/30 of the calculated data quantity, it only takes a few minutes to process.
Even calculated once every hour, every calculation of the amount of data can be reduced to 1/24, each time it only takes a few minutes to deal with.

Although the time shortened, but that is timed task, can not be calculated in real-time scores of water it?
Add water a day 100w score, you can accumulate in real-time computing "integral water daily summary."
A one-time centralized processing a large number of regular tasks of data, how to shorten execution time?

Use DTS (or Canal) a fraction of water increases listener table, when the user changes the score, cumulative water daily score in real time, the timing of once an hour mission computing uniformly allocated to "all the time", daily new 100w running water, a database 10 times per second write pressure, completely Go On.
Narrator: If you can not use DTS / canal, you can use MQ.

Summary, for such a large number of timed task processing time centralized data, optimization idea is:
(1) the same data to reduce the number of calculations is repeated;
(2) sharing CPU time, try to dispersion treatment (even in real time), rather than centralized processing;
(3) reduce the amount of data a single computing;
I hope you some inspiration, ideas important than the conclusions.

Finally,
we welcome the exchange, like the point of a praise yo remember the article, thanks for the support!

Guess you like

Origin blog.51cto.com/14442094/2430251