TABLE OF CONTENTS

1. Symptom

A specific tenant's SD-WAN branch experiences repeated IKE, BGP, and SLA flaps toward the Controller, while the underlay circuits and Provider-VR paths to the same Controller remain fully stable. Other tenants on the same Controller are unaffected.

  • IKE sessions toward the Controller terminate and re-establish cyclically
  • BGP adjacency with the controller drops and recovers alongside IKE
  • SLA probe loss is observed only on the affected tenant's path — not in the provider transport VR
  • Rebooting the device or restarting VSH does not resolve the issue
  • Underlying circuit quality is clean with no packet loss

2. Identification

2.1 Check IKE History for Unexpected Peers

On the affected branch and Controller, look for IKE sessions from unexpected remote gateways sharing the same SPI range:

show orgs org-services <tenant> ipsec vpn-profile <profile> ike history

Look for rapid IKE Done → IKE Deleted → Peer Down cycles that do not correspond to the active branch.

2.2 Check p2mp Neighbors on the Controller

On the controller, verify whether any stale entries are present:

vsh connect infmgr
infmgr> show p2mp-nbrs brief <tenant-Control-VR>

A decommissioned site will appear with a valid branch-id and active tunnel endpoints but should not exist in the current Director topology. Compare the output against the Director's device list.

2.3 Confirm ESP Decap Failure in VXLAN Trace

Enable a VXLAN packet trace on the Controller for the affected tenant's path. A drop event looks like:

ipsec_esp_decap_handler:663  SPI from ESP header: 2001a6b
m_freem_internal:172  caller [ipsec_esp_decap_handler:1203]

The packet is freed immediately after SPI lookup — no GRE or MPLS decap follows. A successful decap continues through GRE, MPLS label lookup, and SLAM PDU processing.

2.4 Verify Decap Drop Counters

On the Controller in vsmd, check whether ESP decap drop counters are incrementing during the SLA loss window:

vsh connect vsmd
vsm-vcsn0> show vsf tunnel stats | grep decap drop

Incrementing counters correlated with the SLA flap timestamps confirm that the decap failure is the trigger.

2.5 Check versa-ipsec.log on the Controller

The IPsec log will show IKE negotiations from two different remote peer IDs using the same tunnel IP — one is the legitimate branch, the other is the stale site:

grep -i <tunnel-IP> /var/log/versa/versa-ipsec.log

A collision looks like this — two IKE events, same local gateway, same remote IP, but different remote peer ID strings:

18003 IKE-Event: Local IKE peer 10.4.0.5:500   ID Chicago-Controller-A@Chicago.com
18004 IKE-Event: Remote IKE peer 10.0.2.65:500   ID Chicago-Branch-B@Chicago.com  ← stale site

16878 IKE-Event: Local IKE peer 10.4.0.5:500   ID Chicago-Controller-A@Chicago.com
16879 IKE-Event: Remote IKE peer 10.0.2.65:500   ID Chicago-Branch-A@Chicago.com  ← active site

Both events share the same remote IP but carry different peer ID strings. Cross-reference the peer ID against the Director's device list to confirm which site is no longer provisioned.

2.6 Confirm Multiple Sites on Same Tunnel IP — vsmd

Confirm that two different branch IDs are mapped to the same tunnel IP, which directly confirms the collision:

vsh connect vsmd
vsm-vcsn0> show vsf tunnel branch-table | grep <tunnel-IP>      (Verify when the flap occurs)

If two different branch entries appear for the same IP address, the stale site is confirmed. Under normal conditions, only one branch-id should be present per tunnel endpoint IP.

3. Root Cause

A decommissioned site that was never fully removed from the network is still initiating IKE connections to the Controller. This creates a stale IPSec SA on the Controller that shares the same SPI space as the active branch's SA.

When the Controller receives an ESP-encapsulated SLA PDU from the legitimate branch, it attempts decapsulation using the SPI from the packet header. If that SPI resolves to the stale decommissioned SA rather than the active branch's SA, the decap fails and the packet is freed (dropped). This manifests as periodic SLA probe loss, which triggers BFD/SLA threshold violations, IKE rekeying, and ultimately BGP flaps.

Key indicator: The drop occurs after VXLAN outer decap succeeds but inner ESP decap fails — the stale SA is found but cannot decrypt the payload, so the packet is freed at ipsec_esp_decap_handler.

4. Resolution

The fix is to decommission the stale site fully. This removes its IKE/IPSec context from the Controller, eliminates the SPI collision, and allows the active branch's SA to be the sole decap path.

  1. Identify the decommissioned site name from the p2mp-nbr  output on the Controller.
  2. Confirm with the customer that this device is no longer in service and should not be connecting.
  3. On Versa Director, delete the decommissioned device from the org — this pushes a config update to the Controller removing the ptvi and IKE profile.
  4. Verify on the Controller that the stale branch-id no longer appears in show p2mp-nbrs
  5. Monitor IKE history and SLA probe loss on the affected branch — flaps should stop immediately.
Note: If the customer cannot immediately decommission the device, Configure a QOS rule to block the decommissioned site's Public IP.

4.2 Alternative — Re-onboard the Active Branch with a New Site ID

If the decommissioned site cannot be located or taken offline on time, the active branch can be re-onboarded under a new Site ID. This assigns it a fresh IKE identity and SPI space, eliminating the collision without requiring the stale device to be reachable or powered down.

  1. On Versa Director > Workflow > Device, delete the active branch device entry and recreate it with a new Site ID (make sure old Site ID and the new Site ID is different, if director is re-using the same site id when you delete the sites, create dummy device and configure it in the save state).
  2. Re-stage the branch — this generates a new chassis-id, IKE id-string, and PSK.
  3. The Controller will now track the active branch under a new branch-id, completely separate from the stale site's SPI range.
  4. Verify with show vsf tunnel branch-table that only one branch-id maps to the active site's tunnel IP.
  5. Monitor IKE history and SLA — flaps should stop immediately after the new SA is established.
Important: Re-onboarding causes a brief service outage on the active branch while the new staging and IKE negotiation completes. Schedule during a maintenance window and coordinate with the customer to ensure LAN-side routing re-converges after the branch comes back up under the new Site ID.

5. Quick Reference — Commands

PurposeCommand
IKE session historyshow orgs org-services <tenant> ipsec vpn-profile <profile> ike history
Controller p2mp neighborsvsh connect infmgr → show p2mp-nbrs detail vrf <Control-VR>
VXLAN packet traceUse internal wiki procedure for vxlan trace on Controller
ESP decap drop countersvsh connect vsmd → show vsf ipsec stats | grep decap-drop
Identify stale peer in IKE loggrep -i <tunnel-IP> /var/log/versa/versa-ipsec.log
Confirm dual branch on same IPvsh connect vsmd → show vsf tunnel branch-table | grep <tunnel-IP>
Clear stale IKE SArequest clear ipsec ike <vpn-profile> — use only after Director decom