Distributed Computing Standard Deviation, Reliability

Distributed Computing Standard Deviation, Reliability


When a set of data cannot be fully loaded into memory for computing, then we need to perform distributed computing, where each machine computes part of the data and then synthesizes the final result. For example, the typical case of word frequency statistics, but when the final result cannot be obtained from the results of each machine, then the algorithm must be split.

==Standard of splitting algorithm: The granularity of the algorithm formula must be obtained according to the processing of each distributed task==

Split standard deviation:

For a set of data (for example: 1, 2, 3, 4, 5, 6, 7), we split it into two machines to calculate the
two sets of data

A机器计算 (1、2、3、4)

B机器计算 (5、6、7)

First, a single set of data needs to calculate three indicators

For the group (1, 2, 3, 4):

成员个数: 4
成员之和: 1+2+3+4=10
成员的平方和:1²+2²+3²+4²=30

For the group (5, 6, 7):

成员个数: 3
成员之和: 5+6+7=18
成员的平方和:5²+6²+7²=110

After getting these three indicators, taking mr as an example, we can calculate these three indicators in each map, and finally
execute the algorithm in reduce

((110+30)/(4+3))-((10+18)/(4+3))²

In the prescribing, it is exactly the same as the std calculation result of mysql

Let's see if the result of mysql is the same

The standard deviation is obtained, and the reliability is again based on the calculation of the subset!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324949561&siteId=291194637