[Federated Learning + Blockchain] TORR: A Lightweight Blockchain for Decentralized Federated Learning


论文地址https://ieeexplore.ieee.org/abstract/document/10159020


I.CONTRIBUTION

  1. Proposed lightweight blockchain TORR for decentralized federated learning
  2. A new consensus protocol Proof of Reliablity is proposed to filter out unreliable devices and thereby reduce system latency. A fast aggregation algorithm is proposed to perform fast and correct aggregation, further reducing system latency.
  3. Erasure coding is used to store the model efficiently and reliably, while the blockchain ledger only records the hash value of the model. A periodic storage refresh strategy is proposed to further reduce storage overhead.
  4. The system was deployed on five servers, 100 nodes were simulated, and tested on three different models. Experiments show that TORR can reduce system latency, overall storage overhead, and peak storage overhead by up to 62%, 75.44%, and 51.77%, respectively.

II. ASSUMPTIONS AND THREAT MODEL

A. Assumptions

Federated Learning. Following the general assumptions in typical federated learning, TORR assumes that many distributed devices are willing to jointly perform machine learning to train the global Model. In each training round, a subset of devices is selected to train the local model. However, there is no central parameter server. Lightweight devices join the blockchain network as nodes with limited storage capacity and heterogeneous connections. Some nodes have problems with slow response or maliciously slowing down the training process due to poor network conditions. These nodes are considered unreliable nodes.

Proof of Reliability. In TORR, nodes use PoR to reach consensus on the status of the blockchain ledger. PoR is based on PoS design of. PoS elects a group of nodes as a committee based on the node's stake, and committee members decide the content of the next block. TORR defines a node's stake as a measure of its reliability rather than contributed funds, allocating a certain stake to nodes that stay online and respond correctly and quickly. Any unreliable behavior such as network slowness or disconnection will reduce the node's stake. Like PoS, PoR assumes that honest participants own at least 70% of the stake. The distribution of stake is recorded in each block, and nodes can obtain the distribution of stake through the blockchain ledger.

B. Threat Model

​Malicious nodes can perform witch attacks by generating fake nodes to increase their proportion. With PoS-like protocols, malicious nodes cannot increase their stake by simply increasing the number of nodes. TORR can work normally as long as no more than 30% of the stake is controlled by malicious nodes.

​During the storage process, a malicious node can claim to store a chunk, but always returns the wrong chunk when requested by other nodes . TORR records the hash value of each chunk in the block to solve this problem. When requesting a chunk from other nodes, each node can verify the chunk by checking the hash value stored in the blockchain.

​Malicious nodes can attempt to slow down the entire training process byresponding slowly or deliberately not responding to any requests. In the PoR protocol, any such behavior will be recorded as evidence of reducing the node’s stake. As training progresses, nodes with lower stakes will have a harder time being elected as committee members.

During the aggregation process, multiple nodes will be selected as aggregators to perform aggregation tasks. If malicious nodes own more than half of the aggregator's stakes, they may dominate the aggregation to obtain an incorrect global model. The article uses a scoring mechanism to evaluate the reliability of each node and determine the increase or decrease of stake. In each round, selected clients and aggregators can score nodes storing chunks. A malicious client or aggregator may try to manipulate this process by providing extremely high or extremely low scores. To solve this problem, the scheme needs to make the committee large enough to contain enough honest nodes. Aggregation algorithms and appropriate scoring rules are also designed to prevent malicious behavior.

III. SYSTEM DESIGN

​TORR design goals are as follows: (1) System latency should be comparable to centralized federated learning and lower than other blockchain-based federated learning frameworks. (2) The storage requirements of a single node should be much less than other blockchain-based federated learning frameworks. (3) Malicious nodes cannot control the system without obtaining enough stakes.

A. Design Overview

There are three roles in TORR: client, aggregator, and keeper. K K will be selected in each roundK 个clients和 M M M aggergators for training and aggregation respectively. Parameter K K K is specified by the model owner who initiates federated learning, parameters M M M is predefined by TORR. Keepers are used to store models. By using erasure coding, both the local model and the global model should be encoded as n n n chunks, these chunks are stored in $n $ selected keepers. Blocks are generated by the aggregator or model owner, contain the hash value of the global model and indicate the start of a certain round. The first block is generated by the model owner to start federated learning. The TORR framework is shown in Fig.1.
Insert image description here

​(step 1) The model owner encodes the global model as n n n chunks and store them in n n Among n keepers. Keepers are selected through VRF and consistent hashing, and the hash value of the global model is used as the input of VRF. At the same time, clients and aggregators also use the hash value of the previous block for selection, that is, the hash value of the genesis block is used as the input of the VRF.

​(step 2) After the global model is stored, the model owner will create a block containing the model hash value and broadcast the block.

​(step 3) After receiving the block, each node will check its own role. If the node is a client this round, it will first request chunks from the keepers to restore the global model. The client will record the response delay of each keeper and use it as a scoring criterion.

​(step 4) The client uses its own local data for model training.

​(step 5) After training, the client will use the hash value of the local model as the input of the VRF to select n n n keepers. Local models are encoded as n n n chunks and stored in corresponding keepers.

​(step 6) Any client that completes the storage operation will transmit the hash of its local model and the recorded delay to the aggregators. When an aggregator receives K K The aggregation operation will begin when there are K local model hashes.

​(step 7) aggergator first requests chunks from keepers to restore K local models. At the same time, the aggregator will record the response delay of each keeper. The aggregator then aggregates the local models into a new global model.

​(step 8) It should be noted that a malicious aggregator may aggregate incorrect global models. In order to ensure correct aggregation, the article designs a most-stake aggregation algorithm, which is modified from the Bully election algorithm. The Bully algorithm is a well-known election algorithm that can help efficiently select leaders among multiple nodes, but cannot tolerate malicious nodes. If a malicious node wins an election, it can perform incorrect aggregations or stop running to halt the process. The article adds verification steps to ensure the correctness of aggregation, and adds a timing mechanism to ensure the efficiency of the algorithm. The Bully election algorithm selects the node with the largest ID as the leader. The article selects the node with the most stakes as the leader to ensure that blocks are more likely to be created by honest and fast nodes.

​(step 9) Once the aggergator (usually the node with the most stakes) wins the election and becomes the leader, it will select the keepers of the new global model and store the new global model in these keepers.

​(step 10) The next block containing the new global model hash is then created and propagated to other nodes to start the next round of federated learning.

B. Block Design

​Nodes in TORR propagate blocks through the gossip protocol to ensure that every node in the network receives new blocks. The receiver appends the received block to its local ledger. The structure of the block is shown in Fig.2.

Insert image description here

Each block contains the hash value of the previous block, thus forming a chain. Parametersround are recorded in each block. When round reaches the value specified by the model owner in the genesis block, the federated learning process will stop. . The block also contains a model object, which contains the hash value of the global model, the hash value of the chunk and the signature of the corresponding keeper. Any node can know where to get the global model through the information recorded in the model object. The signature of keepers ensures that the model has been stored correctly, and the hash of the model ensures the correctness of the global model. stake distribution Indicates how many stakes each node holds.

VRF proof π is shared through block. With the proof, the block miner's public key, and the public seed used to generate the proof (previous block hash or model hash), anyone can deterministically obtain the same VRF hash output and verify that the output is valid . The VRF hash output is used together with consistent hashing for client and aggregator selection. In addition, a block should contain at most K + M K+M provided by clients and aggregators.K+M scores and j j j signatures, where $ j ≥ ⌊M/2⌋ + 1$. These scores are used to update each node’s stake. The signature is provided during the verification step of the aggregation algorithm if it contains at least ⌊M / 2 ⌋ + 1 ⌊M/2⌋ + 1 M/2+1 signatures, the block is considered valid.

C. Initialization

The federated learning model owner creates a genesis block to initialize the entire federated learning process. The article assumes that the model owner is trusted, and only in this step, so the first block does not contain any signature from the aggregator. Additionally, the first block also contains the expected number of epochs so that each node knows when to stop the training process.

D. Role Selection

To prevent possible attacks, the characters in TORR change every round. When an aggregator creates a new block, it uses its private key, the hash of the previous block, and the model hash to select a new client, aggregator, and keeper. The role selection process is shown in Fig.3.

​According to consistent hashing, all possible hashes form a hash ring, from 0 to 2 32 2^{32} 232. The stake of each block node is hashed into a virtual node located on the ring. The number of virtual nodes is different depending on the number of stakes owned by the node. The more stakes a node has, the more virtual nodes it will have, and the greater the possibility of being selected. In the context of federated learning, there are usually a large number of devices, resulting in a large number of virtual nodes, which will be approximately evenly distributed on the ring. TORR follows the characteristics of PoS, that is, the probability of a node being selected is equal to the proportion of stake it owns.

First, input the hash or model hash of the previous block into the VRF to generate a VRF hash. Since the hash is at a point on the ring, the first node clockwise will be selected. The hash of this hash is then calculated to select the second node. This process continues until the required number of nodes is selected. The VRF function is shown in formula (1).


VRF with random seed α \alpha αJapanese private meal s k sk skProduction import, export confirmation quality Haruki β \beta β 和 proof π \pi π. If the input is the same private key and seed, the output is the same. For those who don't know the private key, the hash output β \beta β looks no different from a random variable. Therefore, this hash can be used as a starting point for consistent hashing to select nodes fairly.

​Other nodes can verify whether the hash output is generated by having the public key p k p_k pkgenerated by the node , so the attacker cannot forge the output. TORR uses the previous block hash or model hash as a random seed as input α α α, usage point s k i sk_i skiThe private key of is used as the input private key s k sk sk.ゆう输输 ゆう s k i sk_i ski is decided, so the hash output β β β is unique for each node. Block hashes and model hashes cannot be known before the block or model is created, and malicious nodes cannot predict the output of the VRF and conduct targeted attacks on clients or aggregators.
Insert image description here

E. Storage Protocol

Storage Requirement Reduction. Many devices sit idle in federated learning because only a small number of them are selected at each round to train the local model , so TORR utilizes the limited storage space of many idle devices to store models. Considering the impact of malicious nodes and the limited storage space of a single device, erasure coding is used to divide the model into multiple chunks. On the one hand, erasure coding ensures the availability of the model. The solution uses RS code to encode the model. ( n , k ) (n, k) (n,k) RS code can encode the model in total as n n n 个chunks,并且 n n n What is the role of the chunks k k k can be used to restore the entire model. Correct settings k k k sum n n n can guarantee that even if some nodes are malicious or fail, any node can recover the model. On the other hand, compared with the storage of large-size models, the storage of small-size chunks averages out the heavy storage pressure on all nodes.

In addition, a periodic storage refresh strategy is designed to avoid continuous expansion of the overall storage. In federated learning, the global model is updated every round. Therefore, in some scenarios, the model can be stored for a period of time and then deleted. Specifically, when a node requests to store a chunk in TORR, the chunk is sent to the keeper along with the block-tolive (BTL) parameter. BTL is an integer that represents the number of rounds or blocks that a chunk can survive. When a new chunk arrives, the node decrements the BTL of all stored chunks by one. Any chunk whose BTL becomes zero is allowed to be deleted, and flexible storage refresh is achieved through the definition of BTL.

Mitigation of the Impact of Unreliable Devices. In federated learning, lightweight devices communicate via different communication media such as Wifi or cellular network) to access public networks. There is a high probability that there are some unreliable devices with poor network conditions in the network. In more serious cases, an attacker can deliberately respond slowly or even not at all, which will threaten the availability of the model. Erasure coding avoids possible bad connections between client and aggregator. Especially when multiple clients need to transfer local models to the aggregator, one of the clients may have a poor connection to the aggregator. Now the aggregator no longer receives the entire model from a single client, but requests multiple chunks from different keepers simultaneously. Erasure coding allows the aggregator not to wait for chunks from certain stragglers, but to use only partial chunks to recover the entire model.

Selection of k k k and n n n of RS Code. TORR assumes a more unfriendly environment where up to 30% of the stake may be held by malicious nodes. So by selecting n n n sum k k k to allow 30% of the nodes to fail.

Since TORR uses consistent hashing based on node stakes, the probability of a node being selected is equal to the proportion of its stake. Assume that the proportion of stakes held by malicious nodes is p p p, need to select at least n n n keepers to store model chunks. At most n − k n-k nk nodes failed because any k k k blocks can be used to restore the model. From this, the probability that keepers are dominated by malicious nodes can be calculated:
Insert image description here
Considering that malicious nodes hold up to 30% of the stake, it can be calculated by p p p is set to 30% to obtain probability P u p p e r P_{upper} PupperThe upper limit of . The experiment requires that 10 clients be selected in each round, and the entire federated learning lasts for 100 rounds. In 100 rounds, keepers will be selected 1100 times to store 100 global models and 1000 local models. Therefore, the upper bound probability P u p p e r P_{upper} Pupper should be less than 0.0009 to ensure safe operation of the system. Select n n n sum k k Balance should be considered when k. If fixed k k k,Selection is large n n n Yusuke Yu descending low P u p p e r P_{upper} Pupper probability, but results in higher redundancy (with n / k n/k n/k equilibrium).

F. Aggregation Protocol

In order to ensure the security during the aggregation process and prevent single points of failure, TORR requires multiple aggregators to perform aggregation together. When the aggragator receives the local model object and score from all clients, it first checks the signature. A valid local model object contains the hash value of the local model and n n Hash values ​​of n blocks, by n n The signatures of n keepers prove that the model has been stored correctly. The aggregators will then request chunks from the corresponding keeper in preparation for restoring all local models.

The scheme only uses the FedAvg scheme for aggregation. Assume client i i i For the original model of the superior w i w_i Ini means that the number of data points of the client is n i n_i ni. A total of K clients are included, and the aggregated global model is expressed by formula (4).

Insert image description here

Since formula (4) consists of multiple addition operations, the aggregator does not need to wait to restore all local models to perform aggregation. Once the new local model is restored, the aggregator immediately computes the intermediate model. In order to reach consensus among multiple aggregators, the plan designs a most-stake aggregation protocol modified based on the Bully election algorithm, as shown in Algorithm 1.

Timed Election. When the aggregator generates the global model, an election message will be sent to the aggregator with more stakes. The election message is a message used to detect the presence of another aggregator in the network. If any response to the election message is received, the aggregator cannot become the leader because someone has more stakes and remains online. However, if the attacker with the most stake becomes the leader, it can stop running immediately and then other aggregators will wait forever for new blocks. Therefore, each aggregator will set a timer T T after receiving responses from other aggregators to the election message.T. Results T T T expires without receiving further blocks, meaning the leader may be an attacker or a laggard. The aggregators will start a new election and send the election message to the aggregator with more stakes, except the leader elected last time.

Verification. Timing elections can ensure that malicious nodes cannot intentionally suspend the system, but it cannot guarantee the correctness of the global model. The malicious node with the most stakes can still win the election and then include an incorrect global model in the block. Therefore, the solution introduces a verification step to ensure correctness. After an aggregator becomes the leader, it should send a block containing its global model hash to all other aggregators for verification. Other aggregators will sign the block if they find that the hash is the same as that of their global model. A block is considered valid if it contains the signatures of more than half of the aggregators. As long as no more than half of the aggregators are malicious, the global model is correct.

Minimum Size of Aggregator Committee. The size of the committee should be chosen appropriately M M M to prevent the committee from being controlled by malicious nodes. As mentioned above, the probability of a malicious node being selected is p p p shall not exceed 30%. The probability that the aggregator is controlled by a malicious node P a g r P_{agr} Pagr As shown in the following formula (5):

Considering running federated learning for 100 rounds, the aggregator committee will be selected 100 times. Therefore, the safety threshold is 0.01. The minimum committee size is 26 people.

G. Proof of Reliability

The article designs a Proof of Reliability consensus protocol to adapt to heterogeneous network conditions and unreliable devices. The core is to find a trustworthy measure to evaluate the reliability of each node. The solution uses the client or aggregator to report the latency of acquired chunks as the basis for evaluation.

​Assume that at the beginning of a round, the global model is stored at n n Among n keepers. After receiving the new block, K K K clients will first request chunks from these keepers. In this process, each client can respond to n n n keepers score. This score is equal to the delay between initiating the request and receiving the reply, so that each keeper will have K K K fraction. Use S i C j S^{C_j}_i SiCj next display client j j j 向keeper i i i provides the score. The client sends the score to the aggregator along with the hash of the local model.

​In order to perform aggregation, each aggregator needs to request chunks from the keepers of these local models. In this process, each aggregator will be at most n K nK nK keepers for scoring, use $ S^{A_j}_i $ to represent aggregator j j j 为 keeper i i The score rated by i. Each aggregator sends its score to all other aggregators. Then all aggregators will get all the scores generated in this round, including those from clients n K nK nK scores and n K M nKM nKM fraction.

​If a keeper is selected to store all models in a certain round, it will obtain K + K M {K+KM} K+KM scores. If the keeper is picked only once to store the global model, then it will get K K K scores. Finally, the median of all scores of a certain keeper is used as an indicator to evaluate its reliability, as shown in formula (6).


As mentioned above, assuming that the stake held by malicious nodes does not exceed 30%, the minimum size of the committee that ensures a majority of honest parties is 26. Therefore, TORR only adjusts a node's stake when it receives more than 26 nodes' scores. The stake update formula is shown in (7).

Therefore, reliable nodes that can respond quickly will have more stakes. The updated stake is included in the new block, which will be verified in the most-stake aggregation verification step. If the leader fabricates the score, it will be immediately detected by other aggregators and the block cannot be verified.

H. Blockchain Consensus

There are no forks in TORR. Since clients and aggregators are selected via VRF and consistent hashing using the previous block hash as input, all nodes should observe the same client and aggregator in a round. Only one aggregator will be elected as leader to create the next block, and the block should be verified by a majority of aggregators. Therefore, the block is not in dispute.

IV. SECURITY ANALYSIS

A malicious client may send fake local models to poison the global model and deliberately give the keeper high or low scores. The article will not discuss the first question. Second problem, if less than 26 nodes score the same keeper, the score from the client will not be trusted and used, because malicious nodes may be in the majority in this case. If there are many nodes, the median of the scores is used to prevent the influence of malicious nodes.

A malicious keeper may claim to store a chunk, but delete the chunk or modify a block. The first behavior destroys the usability of the model. However, through the correct selection of RS code, it can be guaranteed that k of n blocks can be obtained from the honest keeper to restore the model. Even if all malicious models do not store chunks, the model can be recovered. The validity of the second behavioral threat model. By recording the hash value of each chunk in the block so that any node can verify the chunk by comparing the hash value. If a malicious keeper returns a wrong hash value, it will be detected immediately.

A malicious aggregator may forge an incorrect global model and deliberately give the keeper high or low scores. The first problem is solved by the verification operation in the most-stake aggregattion. The second problem is solved by using the median score instead of the average score to evaluate the keeper.

Guess you like

Origin blog.csdn.net/WuwuwuH_/article/details/134415491