SONiC Warm Reboot

Original: https://github.com/jipanyang/SONiC/blob/69d76c5fd2d870e2c53cbe367fd09927bb4836ba/doc/warm-reboot/SONiC_Warmboot.md

 

       

 

Overview

        The goal of SONiC hot restart is to be able to restart and upgrade the SONiC software without affecting the data plane. The hot restart of each process/docker is also part of the goal. Except for syncd and database docker, all other network applications and docker need to support unplanned hot restarts.
       For restart processing, SONiC can be roughly divided into three layers:

Web applications and Orchagent: Each application will go through a similar process. The application and the corresponding orchagent sub-module need to work together to restore the original data and fill in the hot-start increment. Take routing as an example. After the restart operation, the network application BGP restarts gracefully. It synchronizes with the latest routing status through a dialogue with the peer. fpmsyncd uses the input of BGP to write the appDB program and performs all the old routes except for these routes. /The new route is processed without any changes. RouteOrch responds to operation requests from fpmsyncd and propagates any changes down to syncd.

Syncd: Syncd should dump ASICDB before restarting and restore it to the same state as before restarting. Restoring SONiC syncd itself should not interfere with the state of the ASIC. It gets the changes from Orchagent and passes them to LibSAI/ASIC after the necessary conversion.
LibSAI/ASIC: ASIC vendors need to ensure that the state of ASIC and LibSAI is restored to the same state as before the restart.

 

example

In-Service restart

A mechanism for restarting components without affecting the service. This assumes that the software version of the component has not changed after the restart. During the restart window, there may be data changes, such as new/old routes, port status changes, and fdb changes.
The components here can be the entire SONiC system, or one or more dockers running SONiC.

Un-Planned restart

All web applications and orchagents want to be able to handle unplanned restarts and recover normally. Due to the dependence on ASIC processing, this is not a necessary condition for syncd and ASIC/LibSAI.

BGP docker restart

After BGP docker restarts, it can learn new routes from BGP peers. Some routes that have been pushed down to APPDB and ASIC may no longer exist. The system should be able to clear the stale route from APPDB to ASIC and program the new route.

SWSS docker restart

After swss docker restarts, all ports/LAG, vlan, interface, arp and route data should be restored from configDB, APPDB, Linux kernel and other reliable sources. There may be port status, ARP, FDB changes during the restart window, and proper synchronization should be performed.

Syncd docker restart

Restarting syncd docker should keep the data plane intact. After restarting, syncd resumes control of ASIC/LibSAI and communication with swss docker. All other functions running in syncd docker should also be restored like flexcounter processing.

Teamd docker restart

Restarting teamd docker should not cause link jitter or any traffic loss. All lags on the data plane should remain constant.

In-Service upgrade

A mechanism for upgrading components to a newer version without affecting the service.
The components here can be the entire SONiC system, or one or more dockers running SONiC.

Case 1: without SAI api call change

There are software changes in network applications such as BGP, neighsyncd, portsyncd and even orchagent, but these changes will not affect the interface with syncd and the organization of existing data (metadata and dependency graphs). During the restart window, there may be data changes, such as new/old routes, port status changes, and fdb changes.
All processing of restart within the service also applies to this.

Case 2: with SAI api call change

Case 2.1 attribute change with SET

The new version of orchagent may cause the SET api to use different values ​​for some attributes from the previous version. Or a new set of attributes will be called.

Case 2.2 Object change with REMOVE

In the new software version, objects that existed in the previous version can be deleted by default.

Case 2.3 Object change with CREATE

Two situations:

case 2.3.1 New SAI object

This is a new object defined in the SAI layer, which triggers a CREATE call at the orchagent of the new version of the software.

case 2.3.2 Old object in previous version to be replaced with new object in new software version

For example, objects will be created with more or fewer attributes or different attribute values, or multiple instance objects will be replaced with aggregate objects. This is the most complicated scenario. If the old object is not a leaf object, then all other objects that depend on the old object should be cleaned up correctly.

Cold restart fallback

The option of cold restart or hot restart through SWSS, syncd and teamd dockers configuration should be provided. If the hot restart fails, a fallback mechanism for cold restart should be provided.

Proposal 1: Reconciliation at Orchagent

Key steps

Restore to original state

a. LibSAI/ASIC can restore to the state before restarting without interrupting the upper layer.

b. Syncd can restore to the state before restarting without interrupting the ASIC and upper layer.

c. The Syncd state is driven by Orchagent (except for FDB). Once restored, there is no need to mediate by yourself.

Remove stale date and perform new update

a. According to the specific behavior of each network application, it either reads data from configDB or obtains data from other sources such as Linux kernel (for example, for port, ARP) and BGP protocol, and then program APPDB again. It tracks any obsolete data for deletion. Orchagent uses the request from APPDB.
b. Orchagent restores data of applications running in other dockers (such as BGP and teamd) from APPDB, so as to be able to handle the case of only restarting swss, and restore ACL data from configDB. Orchagent ensures that the idempotent operation on the LibSaiRedis interface does not pass any create/remove/set operations previously performed on the object.
Please note that in order to reduce the dependent waiting time in orchagent, loose sequence control is helpful. The restoration of the route can be carried out after the port, lag, interface and ARP data (mainly) are processed.
Each application is responsible for collecting any increments before and after restart, and performing operations on the increment data to create (new objects), set, or delete (obsolete objects).
c. Syncd processes requests from Orchagent just like normal booting.

States of ASIC/LibSAI, syncd, orchagent and applications become consistent and up to date.

(The status of ASIC/LibSAI, syncd, orchagent and applications have become consistent and up-to-date)

Questions

How syncd restores to the state of pre-shutdown

In this method, syncd only needs to save and restore the mapping between the object RID and VID.

How Orchagent manages data dependencies during state restore

(How does Orchagent manage data dependencies during state restoration)

The constructor of each orchagent subroutine can be started normally.

Each application recovers data from reading configDB data or the Linux kernel, or refills data through network protocols when restarting, and program appDB accordingly. Each web application and orchagent subroutine handles dependencies accordingly, which means that certain operations may be delayed until all necessary objects are ready. Dependency checking has become part of the existing implementation in orchagent, but this new scenario may present new problems.
In order to be able to handle the situation where only swss is restarted, in addition to subscribing to the APPDB consumer channel, orchagent also directly restores route (for BGP docker) and portchannel data (for teamd docker) from APPDB. The loose sequential control of data recovery helps speed up processing.

What is missing in Orchagent for it to restore to the state of pre-shutdown

(What is missing in Orchagent so that it can be restored to the state before the shutdown)

Orchagent and application can get data from configDB and APPDB during normal startup, but in order to be able to synchronize and communicate with syncd, it is necessary to provide OID for each object whose key type is sai_object_id_t.

typedef struct _sai_object_key_t
{
    union _object_key {
        sai_object_id_t           object_id;
        sai_fdb_entry_t           fdb_entry;
        sai_neighbor_entry_t      neighbor_entry;
        sai_route_entry_t         route_entry;
        sai_mcast_fdb_entry_t     mcast_fdb_entry;
        sai_l2mc_entry_t          l2mc_entry;
        sai_ipmc_entry_t          ipmc_entry;
        sai_inseg_entry_t         inseg_entry;
    } key;
} sai_object_key_t;

How Orchagent gets the OID information

For the SAI redis create operation of an object whose object key type is sai_object_id_t, Orchagent must be able to use the exact same OID as before shutdown, otherwise it will be out of sync with syncd. But the current implementation of Orchagent only saves the OID in the runtime data structure.
The same method is still valid for the object ID obtained through the sai redis get operation.
One possible solution is to save the mapping between OID and attr_list in redis_generic_create(). This assumes that during restoration, the object creation will use the exact same attr_list list, so the same OID can be found and returned.

When the attribute is changed for the first time, the original default mapping can be saved in the DEFAULT_ATTR2OID_ and DEFAULT_OID2ATTR_ tables table. This is because during the restoration process, object create may use the default attributes instead of the current attributes.
All new changes will be applied to the regular ATTR2OID_ and OID2ATTR_ mapping tables.
For the case of creating multiple objects for the same set of attributes, an additional owner identifier can be assigned to the mapping from the attribute to the OID, so each object is uniquely identifiable based on the owner context. A prominent example is the use of lag_alias as the lag owner, so each lag can retrieve the same OID during restart, despite the NULL attribute provided for lag create.

+    SET_OBJ_OWNER(lag_alias);
     sai_status_t status = sai_lag_api->create_lag(&lag_id, gSwitchId, 0, NULL);
+    UNSET_OBJ_OWNER();

No virtual OID is required in this solution. But if you keep the virtual OID layer, it won't hurt.
The LibSaiRedis interface requires Idempotency.

How to handle the cases of SAI api call change during restore phase.

(How to deal with the situation where the SAI-api call changes during the recovery phase)

Case 2.1 The attribute changes with the set: In the sai_redis_generic_set layer, based on the object key, compare the attribute values ​​and apply the changes directly to syncd/libsai/ASIC.

Case 2.2 Object change with REMOVE :: At the sai_redis_gereric_remove layer, if the object key is found in restoreDB, it will directly call the syncd/libsai/ASIC application REMOVE sai api. Dependency is already guaranteed in orchagent.
Case 2.3 Use CREATE to change the object:
Case 2.3.1 New SAI object: Just apply the SAIAPI creation operation down to syncd/libsai/ASIC. Dependency is already guaranteed in orchagent. But if it is not a leaf object, it will have a cascading effect on other objects that depend on it when it is created, which will be handled in the next use case scenario. If the new SAI object is only used as an attribute in the SET call of other objects, it can be processed when the attribute changes with the SET in Case 2.1.
Case 2.3.2 Replace the old object in the previous version with the new object in the new software version: If this is a leaf object, such as a routing entry, neighbor entry, or fdb entry, it can be deleted by simply adding version-specific logic And create a new one. Otherwise, if there are other objects that must use this object as one of the attributes during the create call, these objects should be deleted first, and then this old object. Version-specific logic is required here.

How to deal with the missing notification during the reboot/restart window

Port/fdb may have new status notifications during the restart window? Maybe the corresponding orchid subroutine should perform a get operation on the object?

Requirements on LibSAI and ASIC

LibSAI and ASIC should be able to save all necessary state shutdown requests and hot restart options. At the request of create_switch(), LibSAI/ASIC should be restored to the exact state of pre-shutdown. The data plane should not be affected during the entire recovery process. Once the recovery is complete, LibSAI/ASIC is working in normal operating state, and they are unaware of any hot restart processing in the upper layer. I hope to support the idempotency of create/remove/set in LibSAI, but it may not be absolutely necessary for a hot restart solution.

Requirements on syncd

Syncd should be able to save all necessary state shutdown requests and hot restart options. When restarting, syncd should return to the exact state of the pre-shutdown. Once the recovery is complete, syncd is working in a normal operating state, it is unknowable, and any warm restart processing takes place at the upper level.

Requirement on network applications and orch data

General requirements

Each application should be able to restore to the state before shutdown.
Orchagent must be able to save and restore the OID of the object created by Orchagent, and the object key type is sai_object_id_t. Other objects not created by Orchagent can restore the OID through the get operation of the libsaireis interface.
The orchagent subroutine of each application can use the existing normal constructor and producerstate/consumerstate processing flow to ensure dependencies and fill internal data structures.
If docker only restarts swss, it should be able to restore routing and delay data directly from appDB, because bgp docker and TeamDocker will not provide the entire set of data to appDB again in this scenario.
After the state is restored, each application should be able to delete any obsolete objects/states, and perform any necessary creation/settings, or process requests in the normal way.

Port

Team

Interface

Fdb

Arp

Route

Acl

Buffer

Qos

Summary

Layer Restore Reconciliation Idempotency Dependency management
Application/Orchagent AND AND Y for LibSaiRedis interface AND
Syncd AND N Good to have Good to have
LibSAI/ASIC AND N Good to have Good to have

Approach evaluation

Advantages

  • Simple and clear logic, easy to implement for most upgrade/restart cases.
  • Layer/application decoupling makes it easy to divide and conquer.
  • Each docker is independent and prepares for unexpected hot restarts of swss processes and other network applications.

Concerns/Issues with this approach

  • Orchagent software upgrades may be very convenient, especially for the case of SAI object replacement, which requires Orchagent to use a code to process them for in-service upgrades.

Proposal 2: Reconciliation at syncd

The existing syncd INIT/APPLY view framework

Basically there are two views created for hot restart. The current view shows the ASIC state before shutdown, and the temporary view shows the new expected ASIC state after restarting. Based on the SAI object data model, each view is a directed acyclic graph, and all (?) objects are linked together.

Invariants for view comparison (invariants for view comparison)

Switch internal objects discovered vis SAI get operation.

They include SAI_OBJECT_TYPE_PORT, SAI_OBJECT_TYPE_QUEUE, SAI_OBJECT_TYPE_SCHEDULER_GROUP, SAI_OBJECT_TYPE_SCHEDULER_GROUP, etc. Assume that the RID/VID of these objects remain unchanged. Question 1: What happens if the objects found after the version change changes?

Question 2: what if some of the discovered objects got changed? Like dynamic port breakout case.

Configured attribute values like VLAN id, interface IP and etc.

There could be change to the configured value, those not being changed may work as invariants. Question 3: could some virtual OIDs for created objects in tmp view coincidently match with object in current view, but the objects are different? matchOids().

View comparison logic

The view comparison logic
uses the metadata of the object and uses these invariants as anchor points. For each object in the temp view, starting from the root of the tree, down to all levels of child nodes, until the leaf finds the best match. If no match is found, the object should be created in the temporary view, which is an object creation operation. If the best match is found, but the properties of the object in the temporary view and the object in the current view are different, the setting operation should be performed. Exact matching produces the conversion of Temp VID to the current VID, which also paves the way for the upper level comparison. All objects in the current view whose end reference count is 0 should be deleted, please perform the "delete" operation.

Question 4: how to handle two objects with exactly same attributes? Example: overlay loopback RIF and underlay loopback RIF. VRF and possibly some other object in same situation?

Question 5: New version of software call create() API with one extra attribute, how will that be handled? Old way of create() plus set() for the extra attribute, or delete the existing object then create a brand new one?

Question 6: findCurrentBestMatchForGenericObject(), the method looks dynamic. What we need is deterministic processing which matches exactly what orchagent will do (if same operation is to be done there instead), no new unnecessary REMOVE/SET/CREATE, how to guarantee that?

Question 4: How to deal with two objects with exactly the same properties? Example: Override loopback RIF and reference basemap loopback RIF. VRF and other objects in the same situation?
Question 5: The new version of the software calls the create() API, it has an additional attribute, how to deal with it? "The old method of creating() and set() for additional properties, or deleting the existing object and then creating a brand new object?"?
Question 6: FindCurrentBestMatchForgericObject(), the method seems to be dynamic. What we need is deterministic processing, which exactly matches what the orchagent will do (if the same operation is to be performed there), there is no new unnecessary removal/setting/creation, how to ensure?

Orchagent and network application layer processing

Except for the idempotency support for create/set/remove operations on the libsaireis interface, this solution requires the same processing as in solution 1, such as original data recovery and each application deletes obsolete appDB data as needed.
One possible but extreme solution is to always refresh all related appDB tables, or even the entire appDB, when the application restarts, and let each application repopulate with new data from the beginning. Then, push the new data set down to syncd. syncd performs the comparison logic between the old data and the new data.

Approach evaluation

Advantages

  • Generic processing based on SAI object model.
  • There is no need to change the implementation of the libsairedis library, and there is no need to restore the OID at the orchagent layer.

Concerns/Issues with this approach

  • Highly complex logic in syncd
  • Hot restart of the upper-level application closely integrated with syncd.
  • Various corner situations from the SAI object model and changes in the SAI object model itself must be handled.

Open issues

How to perform version control of software upgrades at the docker level?

The Show version command can retrieve the version data of each docker. Further expansion may be based on this.

[email protected]:/home/admin# show version
SONiC Software Version: SONiC.130-14f14a1
Distribution: Debian 8.1
Kernel: 3.16.0-4-amd64
Build commit: 14f14a1
Build date: Wed May 23 09:12:22 UTC 2018
Built by: jipan@ubuntu01

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-fpm-quagga          latest              0f631e0fb8d0        390.4 MB
docker-syncd-brcm          130-14f14a1         4941b40cc8e7        444.4 MB
docker-syncd-brcm          latest              4941b40cc8e7        444.4 MB
docker-orchagent-brcm      130-14f14a1         40d4a1c08480        386.6 MB
docker-orchagent-brcm      latest              40d4a1c08480        386.6 MB
docker-lldp-sv2            130-14f14a1         f32d15dd4b77        382.7 MB
docker-lldp-sv2            latest              f32d15dd4b77        382.7 MB
docker-dhcp-relay          130-14f14a1         df7afef22fa0        378.2 MB
docker-dhcp-relay          latest              df7afef22fa0        378.2 MB
docker-database            130-14f14a1         a4a6ba6874c7        377.7 MB
docker-database            latest              a4a6ba6874c7        377.7 MB
docker-snmp-sv2            130-14f14a1         89d249faf6c4        444 MB
docker-snmp-sv2            latest              89d249faf6c4        444 MB
docker-teamd               130-14f14a1         b127b2dd582d        382.8 MB
docker-teamd               latest              b127b2dd582d        382.8 MB
docker-sonic-telemetry     130-14f14a1         89f4e1bb1ede        396.1 MB
docker-sonic-telemetry     latest              89f4e1bb1ede        396.1 MB
docker-router-advertiser   130-14f14a1         6c90b2951c2c        375.4 MB
docker-router-advertiser   latest              6c90b2951c2c        375.4 MB
docker-platform-monitor    130-14f14a1         29ef746feb5a        397 MB
docker-platform-monitor    latest              29ef746feb5a        397 MB
docker-fpm-quagga          130-14f14a1         5e87d0ae9190        389.4 MB

Rollback support in SONiC

This is a general requirement and is not limited to hot restart. Perhaps a separate design document should be prepared for this topic.

What are the requirements for control plane downtime?

Currently, there is no hard requirement for the downtime of the control plane during a hot restart. An appropriate number should be agreed.

Support hot restart upgrade path

No clear requirement available yet. The general idea is to support warm reboot between consecutive SONiC releases.

There are currently no clear requirements. The general idea is to support hot restart between successive SONiC versions.

LibSAI/SDK hot restart delay requirements

There are no strict requirements for this layer. About a few seconds, say 10 seconds?

SAI/LibSAI/SDK backward compatibility requirements?

Yes, for hot restart support, backward compatibility is required.

What are the requirements of LibSAI/SDK for data plane traffic during a hot restart? Can FDB be flushed? 

The existing data stream has no packet loss on the data plane. Usually, FDB flush should be triggered by NOS instread of LibSAI/SDK.

What is the principle of SONiC supporting hot restart?

One of the principles is that there is hot restart support on each layer/module/docker, and each layer/module/docker is independent.

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_39094034/article/details/115317300