Table of Contents

Purpose    

Step 1: Perform ping tests to ensure WAN link sanity

Step 2Check SLA metrics

Step 3: Use tcpdump to determine direction of loss

Step 4: Determine underlay loss using ip-identifier


Purpose

The purpose of this documentation is to help you troubleshoot underlay issues in the face of packet-loss, application slowness/degradation, throughput issues and service impact. The "underlay" is simply a term used to reference the "transport", or the WAN link, which is used as the medium to transfer packets between branches.


Step 1: Perform ping tests to ensure wan link sanity

The ping utility is useful to determine "reachability" as well as "packet loss" on a WAN link

Please follow the below steps to perform a ping test

Step 1: Login to the problematic branch (or the branch which is being troubleshooted for problems)

Step 2: Execute "show interfaces brief" to determine the WAN interfaces on this branch

Step 3: Execute "show interfaces brief" on one of the Controllers or on another Branch (let's say the branch towards which the packet-loss or application degradation is observed), get the relevant WAN interface ip-address

Step 4: Execute a rapid ping from the problematic branch towards the remote-end WAN interface (ip-address obtained from Step 3)

For ex, in my lab Flex1 is the problematic branch which has two interfaces, mpls and internent. Let's say I am interested in testing "mpls" interface, so my local address is 43.1.1.21


Let's say 43.1.1.22 is the other end WAN interfaces of a remote branch Flex2, you can get this address by executing "show interfaces brief" on Flex2 or by executing "show orgs org <ORG-NAME> sdwan detail" on Flex1 (look for Flex2's information there)


<snip>



Execute "rapid ping" test as below



Make sure you use the correct "routing-instance" while executing the ping, it should be the transport-vr to which the interface belongs


Ideally you should see 0% packet-loss to confirm that the underlay path is clean, any packet-loss in the above ping-test indicates a problematic underlay


Run this test again multiple remote-branches, or controllers, to confirm if different paths in the underlay are problematic


Run the same test for other WAN interfaces, use the same method as above to determine the remote-end wan address and use the correct "routing-instance" while executing the ping test


Note: if you observe SLA flaps, with the SLA context set for FC EF, it's possible that there is a congestion/loss in the underlay purely for packets tagged with dscp EF

 

In this case, you should try pinging with DSCP set to EF, to detect loss in the underlay for packets tagged with EF dscp. In order to do so, you will need to access the "namespace" prompt as below for the concerned "Transport" - in the below example we are testing "INET" transport.



The -Q option allows you to tag the dscp on the ping packets being set, in the above example 184 is the ToS value for EF, you can change it according to the dscp you are testing


You can also add "-f" option towards the end to execute a "rapid ping", for ex, the below would execute rapid ping with a count of 100 and DSCP set as EF


ping -Q 184 1.1.1.1 -c 100 -f   


You can also do UDP probes using command below (it sends 100 iterations): -


for i in {1..100}; do echo "packet $i" | nc -zvw3 -u -T 184 -p 4790 1.1.1.1 4790;done


Step 2: Check the SLA metrics

The SLA metrics provide an indication of whether there is a problem with a specific path. SLA probes are being sent on all the paths, between branches and between branch-controller

You can execute the below to determine the SLA metrics (execute it on the problematic branch)

>  show orgs org <ORG-NAME> sd-wan sla-metrics last-1m | tab


The fields of importance are "pdu loss", "fwd loss" and "rev loss" - ideally it should all be 0, but if you see some values under these fields it very likely indicates a problematic underlay with loss along the path


pdu loss - this indicates the number of sla probes that were lost in transit  (the SLA probes which were not acked by the remote end)


FWD loss - this indicates the "data loss" (actual traffic loss) in the forward direction, from this branch to the remote end


REV loss - this indicates the "data loss" (actual traffic loss) in the reverse director, from the remote to this branch


The "last-1m" argument in the above command indicates that the metrics are cumulative of the last minute stats


You can also check the SLA metrics historically (last hour, last day, last 7 days etc) on your Analytics as below


You can check the fwd loss, rev loss, pdu loss, for "last day", "last 12 hours", "last 7 days" or "last 30 days" via

Analytics as below, you can check for specific circuits and look for "remote end" branches



You can scroll down further on this page to get the below tabular stats



Side note: PDU loss, fwd loss or rev loss, are all indicative of an underlay issue - the SLA probes are usually small sized packets (~300 bytes) which are sent over the SDWAN tunnels, using the same vxlan encapsulation as the data packets, and hence can measure the sanity of these paths


Step 3: USE tcpdump to determine the direction of the loss

You can use tcpdump to determine the direction of the loss using the method below

step 1: Open two ssh (putty) terminals to the problematic branch

step 2: On one of the terminal execute a rapid ping (set count as 100 and size as 800 for the rapid ping) to the other end WAN address


step 3: On the other terminal, execute tcpdump using the wan-address and size as a filter (for ex, below I've set size as "greater 700" to capture the above packets which are of size ~800 bytes. Please provide the correct "vni" interface in the below command


The idea is to check of packet loss in the "ping test" above and then check the tcpdump to confirm if the "echo reply" was missing, which means that the "echo-reply" was not received on the wan-interface and was dropped in the underlay


You should also execute the same "tcpdump" on the remote-end branch to capture these packets and confirm if the "echo-request" is being received on the other end and "echo-reply" is being sent out.


<remote end tcpdump>



If your "tcpdump" captures show that the remote-end is receiving the "echo-request" and it's sending "echo-reply" but the "tcpdump" on the local branch indicates that the "echo-reply" is not received, it indicates a "rev loss" condition where packets are being dropped in the underlay in the direction from remote to the local branch


Just by using the above ping test, tcpdump and sla-metrics  check one can determine if the "underlay" is problematic. If you find that the underlay has loss/drops, please follow up with your underlay provider. Sometimes the underlay might have drops only for a specific path, say from branch1 to branch2, because of an issue with some router along the path (either hardware, QoS or link issue on the router)



Step 4: Determine underlay loss using ip-identifier

Procedure to validate packet transfer between two VOS devices on a WAN link to determine underlay loss using ip-identifier is as below

 

In our example we will consider two sites

 

BranchBM-01

Hub-01-R2

 

Firstly, we will need to identify the WAN interface which needed to be investigated for loss

 

On BranchBM-01 the interface under question is vni-0/0 and the address is 10.210.210.42

 

 

On Hub-01-R2 the interface is vni-0/2 and address is 10.210.210.44 (ignore the second ip, it’s not relevant)

 

 

 

Now we will enable tcpdump on both these nodes, using filters to filter out the packets as shown below

 

The idea is to monitor the “ip identifier” field as highlighted below in the direction of the loss, for example if we want to monitor packet transfer from BranchBM-01 towards Hub-01-R2, we should focus on packets sent with “source” ip-address as 10.210.210.42 and destination as 10.210.210.44 (hub wan address) and monitor the ip-identifier field

 

You should see all the ip-identifiers sent from BranchBM-01 on Hub-01-R2 – if you see some packets (ip-identifier) missing, it would mean that it’s being dropped in the underlay.

 

In the below screenshots, you can see the packets highlighted are being sent from BranchBM-01 and received on Hub-01-R2 without any loss.

 

On BranchBM-01

 

tcpdump vni-0/0 filter “’host 10.210.210.44 -vv’”

 

 

On Hub-01-R2

 

tcpudmp vni-0/2 filter “’host 10.210.210.42 -vv’”