[Analysis of go-libp2p source code] Swarm dial

1 Introduction

libp2p swarm is a "low-level" interface for libp2p networks, which can control all aspects of the system more finely. Swarm can establish monitoring, or dial up to other hosts to establish new connections (such as establishing a tcp connection with a host), and the dialing mentioned here is actually the process of establishing an outbound connection. Its implementation logic is more complicated. Here is a sorting out.

2. Code structure

Warehouse Address:https://github.com/libp2p/go-libp2p-swarm.git
dial-related code is mainly distributed in swarm_dial.go,limiter.go,dial_sync.gothese three files, and they contain the structure:
swarm_dial.go: DialBackoff,backoffAddr
DialBackoff is mainly used to limit
dial_sync again after dialing fails. .go: DialSync、activeDial
DialSync synchronous dialing helper, only one dial to the specified Peer is active at the same time
limiter.go: dialLimiter、dialJob、dialResult
dialLimiter mainly limits the number of concurrent dials

3. Timing diagram

Insert picture description here

From the above figure, we can see that dialing is actually a series of checks on concurrent dialing, synchronization, and retry, and finally calling Transport to dial. Suppose there are 1000 Peers, and each Peer has 5 different addresses. If you dial synchronously one by one, it will affect the efficiency, so you need to start multiple coroutines to dial concurrently, but it cannot be completely unlimited. DialLimiter realizes the restriction on concurrent dialing . If an address fails to dial, you can't try to dial again immediately. This will likely fail. You need to wait a while before dialing, otherwise it will be a waste of resources, so there is an algorithm for how long to wait. DialBackoff implements these functions. Then why do you need DialSync? When an external program calls DialPeer, it may start multiple coroutines to dial the same Peer concurrently. Because it is impossible to restrict how the call is made outside, it can only be restricted at the source of the dialing (concurrent dialing has been implemented internally).
Here is a picture from swarm_dial.go to see how DialSync works:

 Diagram of dial sync:

   many callers of Dial()   synched w.  dials many addrs       results to callers
  ----------------------\    dialsync    use earliest          /--------------
  -----------------------\             |----------\           /----------------
  ------------------------>------------<------------>---------<-----------------
  -----------------------|              \----x                 \----------------
  ----------------------|                \-----x                \---------------
                                         any may fail          if no addr at end retry dialAttempt x

3. Call the entrance

Swarm exposes a DialPeer method, and applications can directly dial Peer through it. It is used in two places.

// DialPeer connects to a peer.
func (s *Swarm) DialPeer(ctx context.Context, p peer.ID) (network.Conn, error) {
    if s.gater != nil && !s.gater.InterceptPeerDial(p) {
        log.Debugf("gater disallowed outbound connection to peer %s", p.Pretty())
        return nil, &DialError{Peer: p, Cause: ErrGaterDisallowedConnection}
    }

    return s.dialPeer(ctx, p)
}

1. The Connect method of BasicHost calls the DialPeer method

func (h *BasicHost) Connect(ctx context.Context, pi peer.AddrInfo) error {
    // absorb addresses into peerstore
    h.Peerstore().AddAddrs(pi.ID, pi.Addrs, peerstore.TempAddrTTL)

    if h.Network().Connectedness(pi.ID) == network.Connected {
        return nil
    }

    resolved, err := h.resolveAddrs(ctx, h.Peerstore().PeerInfo(pi.ID))
    if err != nil {
        return err
    }
    h.Peerstore().AddAddrs(pi.ID, resolved, peerstore.TempAddrTTL)

    return h.dialPeer(ctx, pi.ID)
}

func (h *BasicHost) dialPeer(ctx context.Context, p peer.ID) error {
    log.Debugf("host %s dialing %s", h.ID(), p)
    c, err := h.Network().DialPeer(ctx, p)
    if err != nil {
        return err
    }
    select {
    case <-h.ids.IdentifyWait(c):
    case <-ctx.Done():
        return ctx.Err()
    }

    log.Debugf("host %s finished dialing %s", h.ID(), p)
    return nil
}

Finally, IpfsDHT called Connect of BasicHost

func (dht *IpfsDHT) dialPeer(ctx context.Context, p peer.ID) error {
    // short-circuit if we're already connected.
    if dht.host.Network().Connectedness(p) == network.Connected {
        return nil
    }

    logger.Debug("not connected. dialing.")
    routing.PublishQueryEvent(ctx, &routing.QueryEvent{
        Type: routing.DialingPeer,
        ID:   p,
    })

    pi := peer.AddrInfo{ID: p}
    if err := dht.host.Connect(ctx, pi); err != nil {
        logger.Debugf("error connecting: %s", err)
        routing.PublishQueryEvent(ctx, &routing.QueryEvent{
            Type:  routing.QueryError,
            Extra: err.Error(),
            ID:    p,
        })

        return err
    }
    logger.Debugf("connected. dial success.")
    return nil
}

2. In addition, Swarm's NewStream also calls dialPeer, if the connection has not been established, dial Peer first

func (s *Swarm) NewStream(ctx context.Context, p peer.ID) (network.Stream, error) {
    log.Debugf("[%s] opening stream to peer [%s]", s.local, p)
    dials := 0
    for {
        c := s.bestConnToPeer(p)
        if c == nil {
            if nodial, _ := network.GetNoDial(ctx); nodial {
                return nil, network.ErrNoConn
            }

            if dials >= DialAttempts {
                return nil, errors.New("max dial attempts exceeded")
            }
            dials++

            var err error
            c, err = s.dialPeer(ctx, p)
            if err != nil {
                return nil, err
            }
        }
        s, err := c.NewStream()
        if err != nil {
            if c.conn.IsClosed() {
                continue
            }
            return nil, err
        }
        return s, nil
    }
}

4. Dialer initialization

Swarm{
    ....
    // dialing helpers
    dsync   *DialSync
    backf   DialBackoff
    limiter *dialLimiter
}

func NewSwarm(ctx context.Context, local peer.ID, peers peerstore.Peerstore, bwc metrics.Reporter, extra ...interface{}) *Swarm {
    s := &Swarm{
        local: local,
        peers: peers,
        bwc:   bwc,
    }
    .....
    s.dsync = NewDialSync(s.doDial)
    s.limiter = newDialLimiter(s.dialAddr, s.IsFdConsumingAddr)
    s.proc = goprocessctx.WithContext(ctx)
    s.ctx = goprocessctx.OnClosingContext(s.proc)
    s.backf.init(s.ctx)

    return s
}

type DialFunc func(context.Context, peer.ID) (*Conn, error)

// NewDialSync constructs a new DialSync
func NewDialSync(dfn DialFunc) *DialSync {
    return &DialSync{
        dials:    make(map[peer.ID]*activeDial),
        dialFunc: dfn,
    }
}

type dialfunc func(context.Context, peer.ID, ma.Multiaddr) (transport.CapableConn, error)
type isFdConsumingFnc func(ma.Multiaddr) bool

func newDialLimiter(df dialfunc, fdFnc isFdConsumingFnc) *dialLimiter {
    fd := ConcurrentFdDials
    if env := os.Getenv("LIBP2P_SWARM_FD_LIMIT"); env != "" {
        if n, err := strconv.ParseInt(env, 10, 32); err == nil {
            fd = int(n)
        }
    }
    return newDialLimiterWithParams(fdFnc, df, fd, DefaultPerPeerRateLimit)
}

func newDialLimiterWithParams(isFdConsumingFnc isFdConsumingFnc, df dialfunc, fdLimit, perPeerLimit int) *dialLimiter {
    return &dialLimiter{
        isFdConsumingFnc:   isFdConsumingFnc,
        fdLimit:            fdLimit,
        perPeerLimit:       perPeerLimit,
        waitingOnPeerLimit: make(map[peer.ID][]*dialJob),
        activePerPeer:      make(map[peer.ID]int),
        dialFunc:           df,
    }
}

func (db *DialBackoff) init(ctx context.Context) {
    if db.entries == nil {
        db.entries = make(map[peer.ID]map[string]*backoffAddr)
    }
    go db.background(ctx)
}

In the NewSwarm instance, the three dialing helpers, DialBackoff, dialLimiter, and DialBackoff, were initialized.
NewDialSync needs to pass in a dial function as a parameter (actually calls the doDial function of Swarm)
newDialLimiter needs to pass in two functions: one is the dial function (actually calls the dialAddr function of Swarm), and the other is to determine whether the protocol needs to consume FD (UNIX/ TCP)
DialBackoff's init background will also start a coroutine to clean up Backoff

5. Involving coroutines

1. For each peer, a coroutine is started in DialSync to dial

func (ad *activeDial) start(ctx context.Context) {
    ad.conn, ad.err = ad.ds.dialFunc(ctx, ad.id)

    // This isn't the user's context so we should fix the error.
    switch ad.err {
    case context.Canceled:
        // The dial was canceled with `CancelDial`.
        ad.err = errDialCanceled
    case context.DeadlineExceeded:
        // We hit an internal timeout, not a context timeout.
        ad.err = ErrDialTimeout
    }
    close(ad.waitch)
    ad.cancel()
}

func (ds *DialSync) getActiveDial(p peer.ID) *activeDial {
    ds.dialsLk.Lock()
    defer ds.dialsLk.Unlock()

    actd, ok := ds.dials[p]
    if !ok {
        adctx, cancel := context.WithCancel(context.Background())
        actd = &activeDial{
            id:     p,
            cancel: cancel,
            waitch: make(chan struct{}),
            ds:     ds,
        }
        ds.dials[p] = actd

        go actd.start(adctx)
    }

    // increase ref count before dropping dialsLk
    actd.incref()

    return actd
}

2. For each address of each Peer, a coroutine is started in dialLimiter to dial

func (dl *dialLimiter) addCheckFdLimit(dj *dialJob) {
    if dl.shouldConsumeFd(dj.addr) {
        if dl.fdConsuming >= dl.fdLimit {
            log.Debugf("[limiter] blocked dial waiting on FD token; peer: %s; addr: %s; consuming: %d; "+
                "limit: %d; waiting: %d", dj.peer, dj.addr, dl.fdConsuming, dl.fdLimit, len(dl.waitingOnFd))
            dl.waitingOnFd = append(dl.waitingOnFd, dj)
            return
        }

        log.Debugf("[limiter] taking FD token: peer: %s; addr: %s; prev consuming: %d",
            dj.peer, dj.addr, dl.fdConsuming)
        // take token
        dl.fdConsuming++
    }

    log.Debugf("[limiter] executing dial; peer: %s; addr: %s; FD consuming: %d; waiting: %d",
        dj.peer, dj.addr, dl.fdConsuming, len(dl.waitingOnFd))
    go dl.executeDial(dj)
}

func (dl *dialLimiter) addCheckPeerLimit(dj *dialJob) {
    if dl.activePerPeer[dj.peer] >= dl.perPeerLimit {
        log.Debugf("[limiter] blocked dial waiting on peer limit; peer: %s; addr: %s; active: %d; "+
            "peer limit: %d; waiting: %d", dj.peer, dj.addr, dl.activePerPeer[dj.peer], dl.perPeerLimit,
            len(dl.waitingOnPeerLimit[dj.peer]))
        wlist := dl.waitingOnPeerLimit[dj.peer]
        dl.waitingOnPeerLimit[dj.peer] = append(wlist, dj)
        return
    }
    dl.activePerPeer[dj.peer]++

    dl.addCheckFdLimit(dj)
}

// executeDial calls the dialFunc, and reports the result through the response channel when finished. Once the response is sent it also releases all tokens it held during the dial.
func (dl *dialLimiter) executeDial(j *dialJob) {
    defer dl.finishedDial(j)
    if j.cancelled() {
        return
    }

    dctx, cancel := context.WithTimeout(j.ctx, j.dialTimeout())
    defer cancel()

    con, err := dl.dialFunc(dctx, j.peer, j.addr)
    select {
    case j.resp <- dialResult{Conn: con, Addr: j.addr, Err: err}:
    case <-j.ctx.Done():
        if err == nil {
            con.Close()
        }
    }
}

3. Backoff cleanup

func (db *DialBackoff) init(ctx context.Context) {
    if db.entries == nil {
        db.entries = make(map[peer.ID]map[string]*backoffAddr)
    }
    go db.background(ctx)
}

func (db *DialBackoff) background(ctx context.Context) {
    ticker := time.(BackoffMax)NewTicker
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            db.cleanup()
        }
    }
}

func (db *DialBackoff) cleanup() {
    db.lock.Lock()
    defer db.lock.Unlock()
    now := time.Now()
    for p, e := range db.entries {
        good := false
        for _, backoff := range e {
            backoffTime := BackoffBase + BackoffCoef*time.Duration(backoff.tries*backoff.tries)
            if backoffTime > BackoffMax {
                backoffTime = BackoffMax
            }
            if now.Before(backoff.until.Add(backoffTime)) {
                good = true
                break
            }
        }
        if !good {
            delete(db.entries, p)
        }
    }
}

6. Some important rules and algorithms

1. Filtering of dial-up addresses

// filterKnownUndialables takes a list of multiaddrs, and removes those that we definitely don't want to dial: addresses configured to be blocked, IPv6 link-local addresses, addresses without a dial-capable transport, and addresses that we know to be our own. This is an optimization to avoid wasting time on dials that we know are going to fail.
func (s *Swarm) filterKnownUndialables(p peer.ID, addrs []ma.Multiaddr) []ma.Multiaddr {
    lisAddrs, _ := s.InterfaceListenAddresses()
    var ourAddrs []ma.Multiaddr
    for _, addr := range lisAddrs {
        protos := addr.Protocols()
        // we're only sure about filtering out /ip4 and /ip6 addresses, so far
        if len(protos) == 2 && (protos[0].Code == ma.P_IP4 || protos[0].Code == ma.P_IP6) {
            ourAddrs = append(ourAddrs, addr)
        }
    }

    return addrutil.FilterAddrs(addrs,
        addrutil.SubtractFilter(ourAddrs...),
        s.canDial,
        // TODO: Consider allowing link-local addresses
        addrutil.AddrOverNonLocalIP,
        func(addr ma.Multiaddr) bool {
            return s.gater == nil || s.gater.InterceptAddrDial(p, addr)
        },
    )
}

// FilterAddrs is a filter that removes certain addresses, according to the given filters.
// If all filters return true, the address is kept.
func FilterAddrs(a []ma.Multiaddr, filters ...func(ma.Multiaddr) bool) []ma.Multiaddr {
    b := make([]ma.Multiaddr, 0, len(a))
    for _, addr := range a {
        good := true
        for _, filter := range filters {
            good = good && filter(addr)
        }
        if good {
            b = append(b, addr)
        }
    }
    return b
}

// AddrOverNonLocalIP returns whether the addr uses a non-local ip link
func AddrOverNonLocalIP(a ma.Multiaddr) bool {
    split := ma.Split(a)
    if len(split) < 1 {
        return false
    }
    if manet.IsIP6LinkLocal(split[0]) {
        return false
    }
    return true
}

2. Sorting of dialing addresses

// ranks addresses in descending order of preference for dialing   Private UDP > Public UDP > Private TCP > Public TCP > UDP Relay server > TCP Relay server
    rankAddrsFnc := func(addrs []ma.Multiaddr) []ma.Multiaddr {
        var localUdpAddrs []ma.Multiaddr // private udp
        var relayUdpAddrs []ma.Multiaddr // relay udp
        var othersUdp []ma.Multiaddr     // public udp

        var localFdAddrs []ma.Multiaddr // private fd consuming
        var relayFdAddrs []ma.Multiaddr //  relay fd consuming
        var othersFd []ma.Multiaddr     // public fd consuming

        for _, a := range addrs {
            if _, err := a.ValueForProtocol(ma.P_CIRCUIT); err == nil {
                if s.IsFdConsumingAddr(a) {
                    relayFdAddrs = append(relayFdAddrs, a)
                    continue
                }
                relayUdpAddrs = append(relayUdpAddrs, a)
            } else if manet.IsPrivateAddr(a) {
                if s.IsFdConsumingAddr(a) {
                    localFdAddrs = append(localFdAddrs, a)
                    continue
                }
                localUdpAddrs = append(localUdpAddrs, a)
            } else {
                if s.IsFdConsumingAddr(a) {
                    othersFd = append(othersFd, a)
                    continue
                }
                othersUdp = append(othersUdp, a)
            }
        }

        relays := append(relayUdpAddrs, relayFdAddrs...)
        fds := append(localFdAddrs, othersFd...)

        return append(append(append(localUdpAddrs, othersUdp...), fds...), relays...)
    }

3. Backoff time setting

// BackoffBase is the base amount of time to backoff (default: 5s).
var BackoffBase = time.Second * 5

// BackoffCoef is the backoff coefficient (default: 1s).
var BackoffCoef = time.Second

// BackoffMax is the maximum backoff time (default: 5m).
var BackoffMax = time.Minute * 5

// AddBackoff lets other nodes know that we've entered backoff with peer p, so dialers should not wait unnecessarily. We still will attempt to dial with one goroutine, in case we get through.
//
// Backoff is not exponential, it's quadratic and computed according to the following formula:
//
//     BackoffBase + BakoffCoef * PriorBackoffs^2
//
// Where PriorBackoffs is the number of previous backoffs.
func (db *DialBackoff) AddBackoff(p peer.ID, addr ma.Multiaddr) {
    saddr := string(addr.Bytes())
    db.lock.Lock()
    defer db.lock.Unlock()
    bp, ok := db.entries[p]
    if !ok {
        bp = make(map[string]*backoffAddr, 1)
        db.entries[p] = bp
    }
    ba, ok := bp[saddr]
    if !ok {
        bp[saddr] = &backoffAddr{
            tries: 1,
            until: time.Now().Add(BackoffBase),
        }
        return
    }

    backoffTime := BackoffBase + BackoffCoef*time.Duration(ba.tries*ba.tries)
    if backoffTime > BackoffMax {
        backoffTime = BackoffMax
    }
    ba.until = time.Now().Add(backoffTime)
    ba.tries++
}

7. Core dialing logic

func (s *Swarm) dialAddrs(ctx context.Context, p peer.ID, remoteAddrs []ma.Multiaddr) (transport.CapableConn, *DialError) {
    /*
        This slice-to-chan code is temporary, the peerstore can currently provide
        a channel as an interface for receiving addresses, but more thought
        needs to be put into the execution. For now, this allows us to use
        the improved rate limiter, while maintaining the outward behaviour
        that we previously had (halting a dial when we run out of addrs)
    */
    var remoteAddrChan chan ma.Multiaddr
    if len(remoteAddrs) > 0 {
        remoteAddrChan = make(chan ma.Multiaddr, len(remoteAddrs))
        for i := range remoteAddrs {
            remoteAddrChan <- remoteAddrs[i]
        }
        close(remoteAddrChan)
    }

    log.Debugf("%s swarm dialing %s", s.local, p)

    ctx, cancel := context.WithCancel(ctx)
    defer cancel() // cancel work when we exit func

    // use a single response type instead of errs and conns, reduces complexity *a ton*
    respch := make(chan dialResult)
    err := &DialError{Peer: p}

    defer s.limiter.clearAllPeerDials(p)

    var active int
dialLoop:
    for remoteAddrChan != nil || active > 0 {
        // Check for context cancellations and/or responses first.
        select {
        case <-ctx.Done():
            break dialLoop
        case resp := <-respch:
            active--
            if resp.Err != nil {
                // Errors are normal, lots of dials will fail
                if resp.Err != context.Canceled {
                    s.backf.AddBackoff(p, resp.Addr)
                }

                log.Infof("got error on dial: %s", resp.Err)
                err.recordErr(resp.Addr, resp.Err)
            } else if resp.Conn != nil {
                return resp.Conn, nil
            }

            // We got a result, try again from the top.
            continue
        default:
        }

        // Now, attempt to dial.
        select {
        case addr, ok := <-remoteAddrChan:
            if !ok {
                remoteAddrChan = nil
                continue
            }

            s.limitedDial(ctx, p, addr, respch)
            active++
        case <-ctx.Done():
            break dialLoop
        case resp := <-respch:
            active--
            if resp.Err != nil {
                // Errors are normal, lots of dials will fail
                if resp.Err != context.Canceled {
                    s.backf.AddBackoff(p, resp.Addr)
                }

                log.Infof("got error on dial: %s", resp.Err)
                err.recordErr(resp.Addr, resp.Err)
            } else if resp.Conn != nil {
                return resp.Conn, nil
            }
        }
    }

    if ctxErr := ctx.Err(); ctxErr != nil {
        err.Cause = ctxErr
    } else if len(err.DialErrors) == 0 {
        err.Cause = network.ErrNoRemoteAddrs
    } else {
        err.Cause = ErrAllDialsFailed
    }
    return nil, err
}

A Peer may have multiple addresses. After filtering the addresses that can be dialed, sort these addresses and throw them here. Put these addresses into the channel first, and then traverse the addresses in the channel (dialLoop):

STEP 1. Check the context and response
1.1. If the context is cancelled, jump out of the loop.
1.2. Receive the response and decrement the active count by 1. If there is an error in the last cyclic dialing, add Backoff and record the error. If the last cyclic dialing succeeds, it will directly return Conn. If neither of them is found, the next cycle will continue.

STEP 2. Try to dial.
2.1 Get the address from the channel, call limitedDial to dial (internally, the coroutine will be started to dial), and the active count is increased by 1.
2.2 If the context is canceled, then jump out of the loop.
2.3 Receive the response and decrement the active count by 1. If the dialing error occurs, add Backoff and record the error. If the dialing is successful, return Conn directly, if neither is the case, continue to the next cycle (check the context and response next time) ).

STEP 3. An error is returned.
If dialLoop ends without returning Conn, it means: the context is canceled or there is no address to dial, otherwise it is a dialing error and no address is dialed successfully.

Suppose there are three addresses here. Because the coroutine is started to dial, the first one fails, and the second one succeeds. While waiting for the second successful response, the third dialing task may have been executed, and then it returns Conn Before, the defer cancel()third dialing task will be executed first, and the cancel signal will be received. If the third dialing task is successful, the Conn will be closed, see details dialLimiter.executeDial. Re-execute and clean up defer s.limiter.clearAllPeerDials(p)the waitingOnPeerLimitdata here , no matter whether the dialing succeeds or fails, the dialing is over for this Peer.

Netwarps is composed of a senior cloud computing and distributed technology development team in China. The team has very rich experience in the financial, power, communications and Internet industries. Netwarps currently has R&D centers in Shenzhen and Beijing, with a team size of 30+, most of which are technicians with more than ten years of development experience, from professional fields such as the Internet, finance, cloud computing, blockchain, and scientific research institutions.
Netwarps focuses on the development and application of secure storage technology products. The main products include decentralized file system (DFS) and decentralized computing platform (DCP), and are committed to providing distributed storage and distributed based on decentralized network technology. The computing platform has the technical characteristics of high availability, low power consumption and low network, and is suitable for scenarios such as the Internet of Things and Industrial Internet.
Official account: Netwarps

Guess you like

Origin blog.51cto.com/14915984/2549259