Table of Contents
Purpose
Step 1: Perform ping tests to ensure WAN link sanity
Step 2: Check SLA metrics
Step 3: Use tcpdump to determine direction of loss
Step 4: Determine underlay loss using ip-identifier
Purpose
The purpose of this documentation is to help you troubleshoot underlay issues in the face of packet-loss, application slowness/degradation, throughput issues and service impact. The "underlay" is simply a term used to reference the "transport", or the WAN link, which is used as the medium to transfer packets between branches.
Step 1: Perform ping tests to ensure wan link sanity
The ping utility is useful to determine "reachability" as well as "packet loss" on a WAN link
Please follow the below steps to perform a ping test
Step 1: Login to the problematic branch (or the branch which is being troubleshooted for problems)
Step 2: Execute "show interfaces brief" to determine the WAN interfaces on this branch
Step 3: Execute "show interfaces brief" on one of the Controllers or on another Branch (let's say the branch towards which the packet-loss or application degradation is observed), get the relevant WAN interface ip-address
Step 4: Execute a rapid ping from the problematic branch towards the remote-end WAN interface (ip-address obtained from Step 3)
For ex, in my lab Flex1 is the problematic branch which has two interfaces, mpls and internent. Let's say I am interested in testing "mpls" interface, so my local address is 43.1.1.21
Let's say 43.1.1.22 is the other end WAN interfaces of a remote branch Flex2, you can get this address by executing "show interfaces brief" on Flex2 or by executing "show orgs org <ORG-NAME> sdwan detail" on Flex1 (look for Flex2's information there)
<snip>
Execute "rapid ping" test as below
Make sure you use the correct "routing-instance" while executing the ping, it should be the transport-vr to which the interface belongs
Ideally you should see 0% packet-loss to confirm that the underlay path is clean, any packet-loss in the above ping-test indicates a problematic underlay
Run this test again multiple remote-branches, or controllers, to confirm if different paths in the underlay are problematic
Run the same test for other WAN interfaces, use the same method as above to determine the remote-end wan address and use the correct "routing-instance" while executing the ping test
Note: if you observe SLA flaps, with the SLA context set for FC EF, it's possible that there is a congestion/loss in the underlay purely for packets tagged with dscp EF
In this case, you should try pinging with DSCP set to EF, to detect loss in the underlay for packets tagged with EF dscp. In order to do so, you will need to access the "namespace" prompt as below for the concerned "Transport" - in the below example we are testing "INET" transport.
The -Q option allows you to tag the dscp on the ping packets being set, in the above example 184 is the ToS value for EF, you can change it according to the dscp you are testing
You can also add "-f" option towards the end to execute a "rapid ping", for ex, the below would execute rapid ping with a count of 100 and DSCP set as EF
ping -Q 184 1.1.1.1 -c 100 -f
You can also do UDP probes using command below (it sends 100 iterations): -
for i in {1..100}; do echo "packet $i" | nc -zvw3 -u -T 184 -p 4790 1.1.1.1 4790;done
Step 2: Check the SLA metrics
The SLA metrics provide an indication of whether there is a problem with a specific path. SLA probes are being sent on all the paths, between branches and between branch-controller
You can execute the below to determine the SLA metrics (execute it on the problematic branch)
> show orgs org <ORG-NAME> sd-wan sla-metrics last-1m | tab
The fields of importance are "pdu loss", "fwd loss" and "rev loss" - ideally it should all be 0, but if you see some values under these fields it very likely indicates a problematic underlay with loss along the path
pdu loss - this indicates the number of sla probes that were lost in transit (the SLA probes which were not acked by the remote end)
FWD loss - this indicates the "data loss" (actual traffic loss) in the forward direction, from this branch to the remote end
REV loss - this indicates the "data loss" (actual traffic loss) in the reverse director, from the remote to this branch
The "last-1m" argument in the above command indicates that the metrics are cumulative of the last minute stats
You can also check the SLA metrics historically (last hour, last day, last 7 days etc) on your Analytics as below
You can check the fwd loss, rev loss, pdu loss, for "last day", "last 12 hours", "last 7 days" or "last 30 days" via
Analytics as below, you can check for specific circuits and look for "remote end" branches
You can scroll down further on this page to get the below tabular stats
Side note: PDU loss, fwd loss or rev loss, are all indicative of an underlay issue - the SLA probes are usually small sized packets (~300 bytes) which are sent over the SDWAN tunnels, using the same vxlan encapsulation as the data packets, and hence can measure the sanity of these paths
Step 3: USE tcpdump to determine the direction of the loss
You can use tcpdump to determine the direction of the loss using the method below
step 1: Open two ssh (putty) terminals to the problematic branch
step 2: On one of the terminal execute a rapid ping (set count as 100 and size as 800 for the rapid ping) to the other end WAN address
step 3: On the other terminal, execute tcpdump using the wan-address and size as a filter (for ex, below I've set size as "greater 700" to capture the above packets which are of size ~800 bytes. Please provide the correct "vni" interface in the below command
The idea is to check of packet loss in the "ping test" above and then check the tcpdump to confirm if the "echo reply" was missing, which means that the "echo-reply" was not received on the wan-interface and was dropped in the underlay
You should also execute the same "tcpdump" on the remote-end branch to capture these packets and confirm if the "echo-request" is being received on the other end and "echo-reply" is being sent out.
<remote end tcpdump>
If your "tcpdump" captures show that the remote-end is receiving the "echo-request" and it's sending "echo-reply" but the "tcpdump" on the local branch indicates that the "echo-reply" is not received, it indicates a "rev loss" condition where packets are being dropped in the underlay in the direction from remote to the local branch
Just by using the above ping test, tcpdump and sla-metrics check one can determine if the "underlay" is problematic. If you find that the underlay has loss/drops, please follow up with your underlay provider. Sometimes the underlay might have drops only for a specific path, say from branch1 to branch2, because of an issue with some router along the path (either hardware, QoS or link issue on the router)
Step 4: Determine underlay loss using ip-identifier
Procedure to validate packet transfer between two VOS devices on a WAN link to determine underlay loss using ip-identifier is as below
In our example we will consider two sites
BranchBM-01
Hub-01-R2
Firstly, we will need to identify the WAN interface which needed to be investigated for loss
On BranchBM-01 the interface under question is vni-0/0 and the address is 10.210.210.42
On Hub-01-R2 the interface is vni-0/2 and address is 10.210.210.44 (ignore the second ip, it’s not relevant)
Now we will enable tcpdump on both these nodes, using filters to filter out the packets as shown below
The idea is to monitor the “ip identifier” field as highlighted below in the direction of the loss, for example if we want to monitor packet transfer from BranchBM-01 towards Hub-01-R2, we should focus on packets sent with “source” ip-address as 10.210.210.42 and destination as 10.210.210.44 (hub wan address) and monitor the ip-identifier field
You should see all the ip-identifiers sent from BranchBM-01 on Hub-01-R2 – if you see some packets (ip-identifier) missing, it would mean that it’s being dropped in the underlay.
In the below screenshots, you can see the packets highlighted are being sent from BranchBM-01 and received on Hub-01-R2 without any loss.
On BranchBM-01
tcpdump vni-0/0 filter “’host 10.210.210.44 -vv’”
On Hub-01-R2
tcpudmp vni-0/2 filter “’host 10.210.210.42 -vv’”