TABLE OF CONTENTS
- 1. Symptom
- 2. Identification
- 2.1 Check IKE History for Unexpected Peers
- 2.2 Check p2mp Neighbors on the Controller
- 2.3 Confirm ESP Decap Failure in VXLAN Trace
- 2.4 Verify Decap Drop Counters
- 2.5 Check versa-ipsec.log on the Controller
- 2.6 Confirm Multiple Sites on Same Tunnel IP — vsmd
- 3. Root Cause
- 4. Resolution
- 4.2 Alternative — Re-onboard the Active Branch with a New Site ID
- 5. Quick Reference — Commands
1. Symptom
A specific tenant's SD-WAN branch experiences repeated IKE, BGP, and SLA flaps toward the Controller, while the underlay circuits and Provider-VR paths to the same Controller remain fully stable. Other tenants on the same Controller are unaffected.
- IKE sessions toward the Controller terminate and re-establish cyclically
- BGP adjacency with the controller drops and recovers alongside IKE
- SLA probe loss is observed only on the affected tenant's path — not in the provider transport VR
- Rebooting the device or restarting VSH does not resolve the issue
- Underlying circuit quality is clean with no packet loss
2. Identification
2.1 Check IKE History for Unexpected Peers
On the affected branch and Controller, look for IKE sessions from unexpected remote gateways sharing the same SPI range:
show orgs org-services <tenant> ipsec vpn-profile <profile> ike history
Look for rapid IKE Done → IKE Deleted → Peer Down cycles that do not correspond to the active branch.
2.2 Check p2mp Neighbors on the Controller
On the controller, verify whether any stale entries are present:
vsh connect infmgr infmgr> show p2mp-nbrs brief <tenant-Control-VR>
A decommissioned site will appear with a valid branch-id and active tunnel endpoints but should not exist in the current Director topology. Compare the output against the Director's device list.
2.3 Confirm ESP Decap Failure in VXLAN Trace
Enable a VXLAN packet trace on the Controller for the affected tenant's path. A drop event looks like:
ipsec_esp_decap_handler:663 SPI from ESP header: 2001a6b m_freem_internal:172 caller [ipsec_esp_decap_handler:1203]
The packet is freed immediately after SPI lookup — no GRE or MPLS decap follows. A successful decap continues through GRE, MPLS label lookup, and SLAM PDU processing.
2.4 Verify Decap Drop Counters
On the Controller in vsmd, check whether ESP decap drop counters are incrementing during the SLA loss window:
vsh connect vsmd vsm-vcsn0> show vsf tunnel stats | grep decap drop
Incrementing counters correlated with the SLA flap timestamps confirm that the decap failure is the trigger.
2.5 Check versa-ipsec.log on the Controller
The IPsec log will show IKE negotiations from two different remote peer IDs using the same tunnel IP — one is the legitimate branch, the other is the stale site:
grep -i <tunnel-IP> /var/log/versa/versa-ipsec.log
A collision looks like this — two IKE events, same local gateway, same remote IP, but different remote peer ID strings:
18003 IKE-Event: Local IKE peer 10.4.0.5:500 ID Chicago-Controller-A@Chicago.com 18004 IKE-Event: Remote IKE peer 10.0.2.65:500 ID Chicago-Branch-B@Chicago.com ← stale site 16878 IKE-Event: Local IKE peer 10.4.0.5:500 ID Chicago-Controller-A@Chicago.com 16879 IKE-Event: Remote IKE peer 10.0.2.65:500 ID Chicago-Branch-A@Chicago.com ← active site
Both events share the same remote IP but carry different peer ID strings. Cross-reference the peer ID against the Director's device list to confirm which site is no longer provisioned.
2.6 Confirm Multiple Sites on Same Tunnel IP — vsmd
Confirm that two different branch IDs are mapped to the same tunnel IP, which directly confirms the collision:
vsh connect vsmd vsm-vcsn0> show vsf tunnel branch-table | grep <tunnel-IP> (Verify when the flap occurs)
If two different branch entries appear for the same IP address, the stale site is confirmed. Under normal conditions, only one branch-id should be present per tunnel endpoint IP.
3. Root Cause
A decommissioned site that was never fully removed from the network is still initiating IKE connections to the Controller. This creates a stale IPSec SA on the Controller that shares the same SPI space as the active branch's SA.
When the Controller receives an ESP-encapsulated SLA PDU from the legitimate branch, it attempts decapsulation using the SPI from the packet header. If that SPI resolves to the stale decommissioned SA rather than the active branch's SA, the decap fails and the packet is freed (dropped). This manifests as periodic SLA probe loss, which triggers BFD/SLA threshold violations, IKE rekeying, and ultimately BGP flaps.
ipsec_esp_decap_handler.4. Resolution
The fix is to decommission the stale site fully. This removes its IKE/IPSec context from the Controller, eliminates the SPI collision, and allows the active branch's SA to be the sole decap path.
- Identify the decommissioned site name from the p2mp-nbr output on the Controller.
- Confirm with the customer that this device is no longer in service and should not be connecting.
- On Versa Director, delete the decommissioned device from the org — this pushes a config update to the Controller removing the ptvi and IKE profile.
- Verify on the Controller that the stale branch-id no longer appears in
show p2mp-nbrs - Monitor IKE history and SLA probe loss on the affected branch — flaps should stop immediately.
4.2 Alternative — Re-onboard the Active Branch with a New Site ID
If the decommissioned site cannot be located or taken offline on time, the active branch can be re-onboarded under a new Site ID. This assigns it a fresh IKE identity and SPI space, eliminating the collision without requiring the stale device to be reachable or powered down.
- On Versa Director > Workflow > Device, delete the active branch device entry and recreate it with a new Site ID (make sure old Site ID and the new Site ID is different, if director is re-using the same site id when you delete the sites, create dummy device and configure it in the save state).
- Re-stage the branch — this generates a new chassis-id, IKE id-string, and PSK.
- The Controller will now track the active branch under a new branch-id, completely separate from the stale site's SPI range.
- Verify with
show vsf tunnel branch-tablethat only one branch-id maps to the active site's tunnel IP. - Monitor IKE history and SLA — flaps should stop immediately after the new SA is established.
5. Quick Reference — Commands
| Purpose | Command |
|---|---|
| IKE session history | show orgs org-services <tenant> ipsec vpn-profile <profile> ike history |
| Controller p2mp neighbors | vsh connect infmgr → show p2mp-nbrs detail vrf <Control-VR> |
| VXLAN packet trace | Use internal wiki procedure for vxlan trace on Controller |
| ESP decap drop counters | vsh connect vsmd → show vsf ipsec stats | grep decap-drop |
| Identify stale peer in IKE log | grep -i <tunnel-IP> /var/log/versa/versa-ipsec.log |
| Confirm dual branch on same IP | vsh connect vsmd → show vsf tunnel branch-table | grep <tunnel-IP> |
| Clear stale IKE SA | request clear ipsec ike <vpn-profile> — use only after Director decom |