Distributed file system IPFS and FileCoin

Distributed file system IPFS and FileCoin

In this article, I want to talk about the recently popular IPFS (InterPlanetary File System), a peer-to-peer distributed file system; more than half a century has passed since the emergence of the HTTP protocol, and there are few designs that can Enhance the entire HTTP network or bring new features to it.

Insert picture description here

Using the HTTP protocol to transfer relatively small files is actually very cheap and convenient, but with the exponential growth of computing resources and storage space, we are facing the problem of needing to obtain a large amount of data at any time, and IPFS is to solve this problem.

Architecture design

As a distributed file system, IPFS provides a platform that supports deployment and writing, as well as the distribution and version management of large files. In order to achieve the above purpose, the IPFS protocol is divided into the following sub-protocols:

Insert picture description here
The above seven sub-protocols are responsible for different functions in IPFS. In the following chapters, we will introduce what each protocol does and how IPFS is implemented.

Identity

In the IPFS network, all nodes are identified by a unique NodeId, which is somewhat similar to the Bitcoin address. It is actually a hash of a public key. However, in order to increase the cost of the attacker, IPFS uses S/Kademlia as mentioned The algorithm increases the cost of creating a new identity:

difficulty = <integer parameter>
n = Node{
    
    }
do {
    
    
  n.PubKey, n.PrivKey = PKI.genKeyPair()
  n.NodeId = hash(n.PubKey)
  p = count_preceding_zero_bits(hash(n.NodeId))
} while (p < difficulty)

Each node is represented by the Node structure in the IPFS code , which only contains the NodeId and a public and private key pair:

type NodeId Multihash
type Multihash []byte
type PublicKey []byte
type PrivateKey []byte

type Node struct {
    
    
  NodeId NodeId
  PubKey PublicKey
  PriKey PrivateKey
}

In short, the main role of the identity system is to represent each node in the IPFS network and represent each "user" using IPFS.

The internet

As a distributed storage system, communication and information transfer between nodes need to be carried out through the network, while being able to use multiple transport layer protocols to ensure reliability, connectivity, information integrity and authenticity.

Insert picture description here

IPFS can use any network to communicate. It does not assume that it must run on the IP protocol, but uses the format of multiaddr to indicate the target address and the protocol used, so as to be compatible and expand other network protocols that may appear in the future:

/ip4/10.20.30/40/sctp/1234/
/ip4/5.6.7.8/tcp/5678/ip4/1.2.3.4/sctp/1234/

routing

In a distributed system, a routing system is needed to retrieve or access resources stored in other nodes. IPFS uses DSHT based on S/Kademlia and Coral to implement a routing system. We can use libp2p/go-libp2p-routing Find the interface of the IPFS routing system in /routing.go:

type IpfsRouting interface {
    
    
	ContentRouting
	PeerRouting
	ValueStore

	Bootstrap(context.Context) error
}

type ContentRouting interface {
    
    
	Provide(context.Context, *cid.Cid, bool) error
	FindProvidersAsync(context.Context, *cid.Cid, int) <-chan pstore.PeerInfo
}

type PeerRouting interface {
    
    
	FindPeer(context.Context, peer.ID) (pstore.PeerInfo, error)
}

type ValueStore interface {
    
    
	PutValue(context.Context, string, []byte) error
	GetValue(context.Context, string) ([]byte, error)
	GetValues(c context.Context, k string, count int) ([]RecvdVal, error)
}

From here we can see that routing in IPFS needs to implement three basic functions, content routing, node routing, and data storage. The "router" that implements these interfaces can be replaced at the bottom layer without affecting the work of other parts of the system. Currently, IPFS uses global DHT and DNS to resolve routing records, while Kademlia DHT has the following advantages:

  1. Find the target address quickly in batch nodes, the time complexity is log 2 (n) log_2(n)log2( n ) , that is, only 20 queries are needed in 10,000,000 nodes;
  2. Optimize the control message length between nodes and reduce the cost of information coordination;
  3. Resist multiple network attacks by preferentially selecting long-term nodes;
  4. It is widely used in point-to-point applications, such as BitTorrent and Gnutella, and the technology is relatively mature;

Data exchange

In IPFS, the distribution and exchange of data uses the BitSwap protocol. BitSwap is responsible for two things: requesting the required blocks from other nodes and providing blocks for other nodes.

Insert picture description here

When we need to request Block from other nodes or provide Block for other nodes, we will send BitSwap message, which mainly contains two parts: sender's wantlist and data block. The whole message is encoded using Protobuf :

message Message {
    
    
  message Wantlist {
    
    
    message Entry {
    
    
      optional string block = 1; // the block key
      optional int32 priority = 2; // the priority (normalized). default to 1
      optional bool cancel = 3;  // whether this revokes an entry
    }

    repeated Entry entries = 1; // a list of wantlist entries
    optional bool full = 2;     // whether this is the full wantlist. default to false
  }

  optional Wantlist wantlist = 1;
  repeated bytes blocks = 2;
}

In the BitSwap system, there are two very important module requirements manager (Want-Manager) and decision engine (Decision-Engine); the former will return the corresponding block locally when the node requests a block or issue a suitable request, and the latter Decide how to allocate resources to other nodes. When a node receives a message containing Wantlist, the message will be forwarded to the decision engine, and the engine will decide how to process the request based on the Ledger of the node.

Readers who want to know more about the implementation details of BitSwap and Spec can read other contents of BitSwap Spec.

In addition to defining the messages sent between nodes, IPFS also introduces incentives and penalties to ensure that there are no "malicious" nodes in the entire network. Ledger is used to store data exchanges between two nodes:

type Ledger struct {
    
    
    owner      NodeId
    partner    NodeId
    bytes_sent int
    bytes_recv int
    timestamp  Timestamp
}

The decision engine will calculate a debt ratio through the Ledger between the two nodes:

The debt ratio is used to measure the trust between nodes. It can not only prevent attackers from creating a large number of nodes, but also protect existing transaction relationships when the nodes are not available for a short time and terminate transactions before the node relationship deteriorates.

IPFS uses Ledger to create a network with incentives and penalties to ensure that most nodes in the network can exchange data and operate normally.

File system

DHT and BitSwap allow IPFS to build a large peer-to-peer system for storing and distributing data blocks; on top of this, IPFS builds a Merkle DAG, each IPFS object may contain a set of links and data in the current node:

type IPFSLink struct {
    
    
    Name string
    Hash Multihash
    Size int
}
type IPFSObject struct {
    
    
    links []IPFSLink
    data []byte
}

We can use the following command to list all the links under the object:

$ ipfs ls QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V 1688 about
QmYCvbfNbCwFR45HiNP45rwJgvatpiW38D961L5qAhUM5Y 200  contact
QmY5heUM5qgRubMDD1og9fhCPA6QdkMp3QCwd4s7gJsyE7 322  help
QmdncfsVm2h5Kqq9hPmU7oAVX2zTSVP3L869tgTbPYnsha 1728 quick-start
QmPZ9gcCEpqKTo6aq61g2nXGUhM4iCL3ewB6LDXZCtioEB 1102 readme
QmTumTjvcYCAvRRwQ8sDRxh8ezmrcr88YFU7iYNroGGTBZ 1027 security-notes

Original data plus universal links are the basis for constructing any data structure on IPFS. Key-value storage, traditional relational databases, and encrypted blockchains can all be stored and distributed on the Merkle DAG of IPFS.

On top of this, IPFS defines a series of objects to build a file system that supports version control, which is very similar to Git's object model, and all file objects are actually binary encoded through Protobuf:

Insert picture description here

IPFS files can be represented by lists and blobs:

  • The blob does not contain any links, only data;
  • But list contains an ordered queue of blobs and lists
  • The tree file object is very similar to the tree in Git, it represents a file directory from name to hash;
  • The final commit represents a snapshot of any object;
    Insert picture description here

In the above file object graph, the top-level commit represents a certain historical snapshot. Comparing the tree formed by the two commits and the child nodes can get the difference between the two snapshots. We can think that Merkle DAG and the file object constitute The entire file system in IPFS.

Naming system

So far, the IPFS technology stack has provided a point-to-point data exchange system that can send DAG objects between nodes, and can push and retrieve immutable objects, but the variable naming system is also an indispensable part of the network. After all, we need to use the same address to get different statuses, because the domain name cannot be changed due to the update of the website, so IPFS needs to provide a "domain name service" to solve this problem.

The following variable namespaces can be used in IPFS to solve these problems. A user can publish an object, and other nodes can access these objects published to the network through ipns plus the user's node address:

/ipns/XLF2ipQ4jD3UdeX5xp1KBgeHRhemUtaA8Vm/

Of course, we can also add TXT records to the existing DNS system, so that we can access the file objects published in the IPFS network through the domain name:

ipfs.benet.ai. TXT "ipfs=XLF2ipQ4jD3UdeX5xp1KBgeHRhemUtaA8Vm"

/ipns/XLF2ipQ4jD3UdeX5xp1KBgeHRhemUtaA8Vm
/ipns/ipfs.benet.ai

In IPFS, not only can hashes be used to access variable objects, but it can also be embedded in existing DNS services to run well, solving the problem of seamless switching of underlying services.

excitation

Today, when we discuss the underlying blockchain technology of IPFS, we have to mention FileCoin built on IPFS, which provides a market for hosts and uploaders to trade, through which the cost of storage can be adjusted, upload The user can choose speed, redundancy, and cost based on price.

node

Most blockchain networks have only a single type of standard node, but there are two different nodes in FileCoin: storage nodes and retrieval nodes.

Insert picture description here

Everyone can become a storage node, renting out additional storage space on their own disks. FileCoin will use these disks to store part of some smaller encrypted files; while retrieval nodes need to be as close as possible to more storage nodes, and also need With higher bandwidth and lower latency, users will pay for the retrieval node that returns the file the fastest.

When we want to upload a file, we need to pay a certain storage fee. The storage node will give a quotation for the right to store the file. FileCoin will select the storage node with the lowest price to save the file; the stored file will be encrypted and divided into multiple parts And sent to multiple nodes, the location of the file will be stored in the global table, after which only the node with the private key can query, assemble and decrypt the uploaded file.

Consensus algorithm

We can say that all blockchain applications require consensus algorithms to ensure that multiple nodes reach a consensus on a certain result and resolve conflicts. Bitcoin and Ethereum currently use Proof-of-Work as the consensus algorithm, while FileCoin Use Proof-of-Replication (PoRep) to solve the internal problems of its network.

We introduce Proof-of-Replication (PoRep) schemes, which allow a prover P to (a) commit to store n distinct replicas (physically independent copies) of D, and then (b) convince a verifier V that P is indeed storing each of the replicas.

PoRep allows the prover P to submit and store n different copies of D, and then convince the verifier V that P did save these copies.

In the article Proof of Replication, you can find more consensus algorithms such as PoRep and PoS (Proof-of-Storage) that are used to verify that the disk space provider does indeed store resources. I will not introduce them here.

to sum up

IPFS is a very interesting underlying technology of blockchain. It implements a point-to-point file storage system based on compatibility with existing Internet protocols and proposes a solution for big data storage. The author tried the official IPFS client go- ipfs is indeed easier to use, but it is also in the early stages of the project. Many modules and functions have not yet been finalized, and FileCoin based on IPFS has no exact log Proof of Replication to be released. This white paper is also marked as WIP (Work In Process), some parts have not been completed, so it does take a long time to wait for the maturity of this technology.

Guess you like

Origin blog.csdn.net/uucckk/article/details/103816567