IS-IS Distributed Flooding Reduction

Internet-Draft	IS-IS Distributed Flooding Reduction	October 2024
White, et al.	Expires 21 April 2025	[Page]

Abstract

In dense topologies (such as data center fabrics based on the Clos and butterfly though not limited to those; in fact any large topology or one with relatively high degree of connectivity qualifies here) IGP flooding mechanisms designed originally for rather sparse topologies can "overflood", or in other words generate too many identical copies of same information arriving at a given node from other devices. This normally results in longer convergence times and higher resource utilization to process and discard the superfluous copies. Flooding algorithm extensions that restrict the amount of flooding performed can be constructed and can reduce resource utilization significantly, while improving convergence performance.¶

One such flooding modification (based on previous art) optimized for operational considerations, described further in Section 2, is described in this document.¶

1. MANET-Based, Load-Balancing Flooding Extension

The following section describes a distributed algorithm similar to and based on those implemented in OSPF to support mobile ad-hoc networks, as described in [RFC5449],[RFC5614]. These solutions have been widely implemented and deployed.¶

1.1. Empirical Evidence of Correctness and Efficiency Improvement

Laboratory tests based on a well known open source codebase show that modifications similar to the algorithm presented here reduce flooding in a large scale emulated butterfly network topology significantly. Under unmodified flooding procedures intermediate systems receive, on average, 40 copies of any changed LSP fragment in a 2'500 nodes butterfly network. With the changes described in this document said systems received, on average, two copies of any changed LSP fragment. In many cases, only a single copy of each changed LSP was received and processed per node. In terms of performance, overall convergence times were cut in roughly half. Other topologies under experimentation in CLOS networks using another implementation show similar performance and simulations of the extension indicate significant reductions in flooding volumes.¶

An early version of mechanisms described here has been implemented in the FR Routing open source routing stack as part of `fabricd` daemon and the described modification has been implement by commercial vendors.¶

1.2. Flooding Modifications

This section describes detailed modifications to the IS-IS flooding process to reduce the full topology to a dominating connected set of links used for flooding. It does at the same time balance the remaining flooding across all links in the topology to prevent hot-spots.¶

1.2.1. Example Network

Following spine and leaf fabric will be used in further description of the introduced modifications.¶

+====+ +====+ +====+ +====+ +====+ +====+
| 1A | | 1B | | 1C | | 1D | | 1E | | 1F | (T0)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 2A | | 2B | | 2C | | 2D | | 2E | | 2F | (T1)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 3A | | 3B | | 3C | | 3D | | 3E | | 3F | (T2)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 4A | | 4B | | 4C | | 4D | | 4E | | 4F | (T1)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 5A | | 5B | | 5C | | 5D | | 5E | | 5F | (T0)
+====+ +====+ +====+ +====+ +====+ +====+

Figure 1

The above picture does not contain the connections between devices for readability purposes. The reader should assume that each device in a given layer is connected to every device in the layer above it in a butterfly network fashion. For instance:¶

5A is connected to 4A, 4B, 4C, 4D, 4E, and 4F¶
5B is connected to 4A, 4B, 4C, 4D, 4E, and 4F¶
4A is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and 5F¶
4B is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and 5F¶
etc.¶

The tiers or stages of the fabric are marked for easier reference. Alternate representation of this topology is a "folded Clos" with T2 being the "top of the fabric" and T0 representing the leaves.¶

1.2.2. Optimizing Flooding

The simplest way to conceive of the solution presented here is in two stages:¶

Stage 1: Forward Optimization¶
- Find the group of intermediate systems that will all flood to the same set of neighbors as the local IS¶
- Decide (deterministically) which subset of the intermediate systems within this group should re-flood any received LSPs¶
Stage 2: Reverse Optimization¶
- Find neighbors on the shortest path towards the origin of the change¶
- Do not flood towards these neighbors¶

The first stage is best explained through an illustration. In the network above, if 5A transmits a modified Link State Protocol Data Unit (LSP) to 4A-4F, each of 4A-4F nodes will, in turn, flood this modified LSP to 3A (for instance). With this, 3A will receive 6 copies of the modified LSP, while only one copy is necessary for the intermediate systems shown to converge on the same view of the topology. If 4A-4F could determine that all of them will all flood identical copies of the modified LSP to 3A, it would be possible for all of them except one to decide not to flood the changed LSP to 3A.¶

The technique used in this draft to determine such flooding group is for each intermediate system to calculate a special SPT (shortest-path spanning tree) from the point of view of the transmitting neighbor. As next step, by setting the metric of all links to 1 and truncating the SPT to two hops, the local IS can find the group of neighbors it will flood any changed LSP towards and the set of intermediate systems (not necessarily neighbors) which will also flood to this same set of neighbors. If every intermediate system in the flooding set performs this same calculation, they will all obtain the same flooding group.¶

Once such a flooding group is determined, the members of the flooding group will each (independently) choose which of the members should re-flood the received information. A common hash function is used across a set of shared variables so each member of the group comes to the same conclusion as to the designated flooding nodes. The group member which is in such a way `selected` to flood the changed LSP does so normally; the remaining group members suppress the flooding of the LSP initially.¶

Each IS calculates the special, truncated SPT separately, and determines which IS should flood any changed LSPs independently based on a common hash function. Because these calculations are performed using a shared view of the network, however (based on the common link state database) and such a shared hash function, each member of the flooding group will make the same decision under converged conditions. In the transitory state of nodes having potentially different view of topologies the flooding may either overflood or in worse case not flood enough for which we introduce a 'quick-patching' mechanism later but ultimately will converge due to periodic CSNP origination per normal protocol operation.¶

The second stage is simpler, consisting of a single rule: do not flood modified LSPs along the shortest path towards the origin of the modified LSP (with one notable exception). This rule relies on the observation that any IS between the origin of the modified LSP and the local IS should receive the modified LSP from some other IS closer to the source of the modified LSP. It is worth to observe that if all the nodes that should be designated to flood within a peer group are pruned by the second stage the receiving node is at the `tail-end` of the flooding chain and no further flooding will be necessary. Also, per normal protocol procedures flooding to the node from which the LSP has been received will not be performed.¶

1.2.3. Optimization Process Details

This section provides normative description of the specification. Any node implementing this solution MUST exhibit external behavior that conforms to the algorithms provided.¶

Each intermediate system will determine whether it should re-flood LSPs as described below. When a modified LSP arrives from a Transmitting Neighbor (TN), the result of the following algorithm obtains the necessary flooding decision unless TN is originator of the LSP and the LSP is older than an existing copy in the LSDB (this exception allows for fast flushing of LSPs retained in the network on TN's restart):¶

Step 1: Build the Two-Hop List (THL) and Remote Neighbor's List (RNL) of nodes:¶

A)

Set all link metrics to 1¶

B)

Calculate an SPT truncated to 2 hops from the perspective of TN¶

C)

For each IS that is two hops away (has a metric of two in the truncated SPT) from TN:¶

i.: If the IS is the LSP originator, skip¶
ii.: If the IS is a neighbor of the LSP originator, skip¶
iii.: If the IS is on the shortest path from the TN towards towards the originator of the modified LSP, skip¶
iv.: If the IS is *not* on the shortest path from the TN towards the originator of the modified LSP, add it to THL¶

D)

Add each IS that is one hop away from TN to the RNL¶

Step 2: Sort nodes in RNL by system IDs, from the least value to the greatest.¶

Step 3: Take the initial value of the fragment ID and shift it by 4 bits to the right. Calculate a number H by adding shifted fragment ID to the each byte of system ID, starting from the highest network order byte, for all bytes XOR each with the current value and rotate after each byte the current value to the left by 4 bits. Section 6 provides reference implementation.¶

The shifting of the fragment ID will put 16 sequent fragments onto the same flooding tree to minimize re-ordering and the subsequent XOR'ing and rotating of the system ID will maximize the amount of entropy obtained for a good hash value. RNum is the number of nodes in the RNL. Consequently, set N to the H MOD of RNum (N=H MOD RNum). With that N will be less than the number of members of RNL. (footnote 1: this allows for some balancing of LSPs coming from same system ID).¶

Step 4: Starting with the Nth member of RNL: where N is the index into the members in RNL, with index starting from zero (Index zero assigned to the IS with lowest system-id):¶

A): If THL is empty or the walk wrapped around, move to Step 5¶
B): If this member of RNL is the local calculating IS, it MUST re-flood the modified LSP to at least the remaining members in the THL it is adjacent to; move to Step 5¶
C): If this member of the RNL signals capability to or running another flood reduction extension or signals that it runs a different version of this extension move to the next member of RNL¶
D): Remove all members of THL connected to (adjacent to) this member of RNL¶
E): Move to the next member of RNL, wrapping to the beginning of RNL if necessary¶

Note 1: This description is leaning towards clarity rather than optimal performance when implemented.¶

Note 2: An implementation in a node MAY choose independently of others to provide a configurable parameter to allow for more than one node in RNL to re-flood, e.g. it may re-flood even if it's only the member that would be chosen from the RNL if a double coverage of THL is required. The modifications to the algorithm are simple enough to not require further text.¶

1.2.4. Flooding Failures

It is possible that during initial convergence or in some failure modes the flooding will be incomplete due to the optimizations outlined. Specifically, if a re-flooder fails, or is somehow disconnected from all the links across which it should be re-flooding, an LSP could be only partially distributed through the topology. To speed up convergence under such partition failures (observe that periodic CSNPs will under any circumstances converge the topology though at a slower pace), an intermediate system which does not re-flood a specific LSP (or fragment) SHOULD:¶

A): Set a short, configurable timer which should be significantly shorter than CSNP interval used.¶
B): When the timer expires, send Partial Sequence Number Packet (PSNP) of all LSPs that have *not* been re-flooded during the timer runtime to all neighbors unless an up-to-date PSNP or CSNP has been already received from the neighbor.¶
C): Per normal protocol procedures process any Partial Sequence Number Packets (PSNPs) received that indicate that neighbors still have older versions of the LSP will lead to the usual synchronization of the databases that are out of sync due to optimized flooding.¶
D): If such resynchronizations above a configurable threshold are required (i.e. PSNPs are sent to the neighbors and are answered with requests), an implementation SHOULD notify the network operator via the according mechanism about the condition.¶

1.2.5. Flooding Example

Assume, in the network specified, that 5A floods some modified LSP towards 4A-4F and we only use a single node to re-flood. To determine whether 4A should flood this LSP to 3A-3F:¶

5A is TN; 4A calculates a truncated SPT from 5A's perspective with all link metrics set to 1¶
4A builds THL, which contains 3A, 3B, 3C, 3D, 3E, 3F, 5B, 5C, 5D, 5E and 5F¶
4A builds RNL, which contains 4A,4B,4C,4D,4E and 4F, sorting it by the system ID¶
4A computes hash on the received LSP-ID to get N; assume N is 1 in this case¶
Since 4A is the 1st member of RNL and there are members in THL, 4A must re-flood; the loop exits¶

1.2.6. A Note on Performance

The calculations described here seem complex, which might lead the reader to conclude that the cost of calculation is so much higher than the cost of flooding that this optimization is counter-productive. First, The description provided here is designed for clarity rather than optimal calculation. Second, many of the involved calculations can be easily performed in advance and stored, rather than being performed for each LSP occurrence and each neighbor. Optimized versions of the process described here have been implemented, and do result in strong convergence speed gains.¶

2. Operational Considerations

The extension introduced to flooding in this document exhibits per se many desirable properties important for large production IS-IS networks. The critical function of the IGP makes their flag day migration close to impossible and any significant outage, i.e. non-negligible blast radius, introduces a significant operational risk to any service relying on the relevant IGP backbone. Such outages are caused often by configuration changes or issuance of unintended commands, especially since those networks tend to be deployed for tens of years under changing personnel and varying degree of competence. Moreover, due to geographic and co-location realities in many cases the network topology changes its properties and solution that is resilient to such properties presents a lower risk than one that can misfunction on node failures in certain topological configurations. Obviously it is desirable to detect easily which nodes are using the feature and allow for simple fixing of defects without disturbing the topology significantly.¶

The solution proposed in this draft pays particular attention to those realities and operational requirements. It is designed to be deployable without initial configuration, allows node by node introduction and removal of the feature into the network, balances the reduced flooding not only on a single or few trees but uses the information about the origin of the fragment to spread the load across whole topology. All of those properties are guaranteed to work while asserting minimal blast radius on introduction of the feature and also changes of any node or link. Deployment of this extension presents the same risk no matter the specific properties of the underlying topology.¶

To simplify deployment and make the feature detectable on nodes deploying it a node running the algorithm SHOULD advertise a Sub-TLV of the IS-IS Router Capability TLV-242 carrying the currently active version of the algorithm.¶

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |     Type      |     Length    |    Version      |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 2

Type: TBD1¶
Length: 1¶
Version: version of the algorithm active on the node.¶

A change to the extension presented in this document that does not guarantee backwards compatibility and advertises this sub-TLV MUST reserve and use a new version number.¶

Additionally, a node deploying the presented algorithm on point-to-point links MUST send CSNPs on such links. This does not represent a dramatic change given most deployed implementations today already exhibit this behavior to prevent possible slow synchronization of IS-IS database across such links and to provide additional periodic consistency guarantees.¶

IS-IS Distributed Flooding Reduction

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. MANET-Based, Load-Balancing Flooding Extension

1.1. Empirical Evidence of Correctness and Efficiency Improvement

1.2. Flooding Modifications

1.2.1. Example Network

1.2.2. Optimizing Flooding

1.2.3. Optimization Process Details

1.2.4. Flooding Failures

1.2.5. Flooding Example

1.2.6. A Note on Performance

2. Operational Considerations

3. Security Considerations

4. IANA Section

5. Contributors

6. Reference Hash Implementation

7. Normative References

8. Informative References

Authors' Addresses