Architecture, Packet flow and troubleshooting bandwidth/throughput issue : Versa Support

Architecture and Packet Flow
Verify half duplex issues and link speed
Verify asymmetrical SD-WAN path
Verify packets are not dropped by Versa CoS
Verify minimum number of traffic sessions to utilize all worker cores
Verify underlay throughput issues
Verify optimizations enabled (Recommendation)
- 7.1. Verify application offload is enabled if UTM features are not configured
- 7.2. Verify isolcpu is enabled
Verify sessions are load-balanced on all workers
Verify there are not too many fragmented packets
Verify no punt packets across workers
Verify poller count
Verify worker/poller CPU utilization and worker/poller drops
Verify link bandwidth for 16.1R2 and 20.2+ (automatic bandwidth test)
Additional Reading

1. Architecture and packet flow

Versa VOS runs on platforms based on multicore Intel and AMD CPUs, both in the baremetal and virtual environment. In order to achieve flexible architecture and optimal performance at the same time, Versa VOS uses poll mode drivers in the user-space. VOS uses asynchronous pipeline model for packet processing. In this model, some cores may be dedicated to the retrieval/transmission of the packets from the physical/virtual Ethernet ports (poller threads) and other cores may be dedicated to the processing of the previously received packets (worker threads). Packets are exchanged between these cores using rings. The following diagram shows a very high level packet flow thru the VOS:

A poller thread can serve many ports. A port is served by a single poller or multiple poller based on the speed of interface. 1G and 10G interfaces are served by single poller. 100G interfaces are served by multiple pollers. Based on the number of ports, its speed and the available cores, VOS automatically assigns the number of poller threads, worker thread and control threads.

2. Verify half duplex issues and link speed

Verify link speed and half duplex issues using following CLI command:

CLI: show interfaces detail

Example: admin@Silver-Customer-CPE31-cli> show interfaces detail vni-0/0

If there are any half duplex and link speed issues, please fix those issues by correcting the config on device to which versa device is connected. Sometime its works on as auto/auto as well with respect to transmission modes configured on ISP side.

Check highlighted above, it should not be half-duplex and 100.

3. Verify asymmetrical sd-wan path

Verify there are no asymmetrical sd-wan path like traffic going on one transport and comes back via another transport which has different bandwidth.

Run following command to verify traffic going in asymmetrical sd-wan path:

CLI: show orgs org sessions sdwan brief

Example:

RX and TX WAN CKT in above output must be same on local and remote site. If packet goes on Internet to remote branch and comes back on MPLS, we may get different throughput based on bandwidth available across both Internet and MPLS.

** There are tools ( spirent or IXIA are the best tools and versa recommended ) headers added while doing performance testing with traffic sent over tunnel adds 82 bytes and and tunnel overhead is based on packet size **

** without FEC and Replication enabled, it is 82 bytes overhead, if FEC enabled, another 12 bytes and Replication another 12 bytes so total 24 bytes in addition to 82 bytes **

** any traffic going from to LAN to LAN over sd-wan will have overhead with respect to packet size and traffic **

4. Verify packets are not dropped by versa CoS

If there is shaper or rate limiter configured on versa device, it is possible that it may drop the packets when sent higher than configured shaping rate.

Verify versa CoS is not dropping the packets using below CLIs:

>show class-of-services interfaces brief

>show orgs org-services class-of-service qos-policies

>show orgs org-services class-of-service app-qos-policies

Correct the shaper and rate limiter configuration in CoS to avoid drops. If you are running throughput tests in lab, remove CoS and verify throughput.

5. Verify minimum number of traffic sessions to utilize all worker cores

In order to get the best throughout out of the box, all the cores must be utilized.

Versa allocates separate CPU cores for control daemons, worker threads and poller threads. Control daemons like BGP, DHCPD etc will be running on control cores. CPUs assigned for worker will be responsible for forward plane, encryption, decryption etc. CPUs assigned for poller thread is responsible for reading the packets from NIC and giving it to worker threads and writing it to NIC while sending it out.

A session (src IP, dst IP, src port, dst port and protocol) is processed by a single Worker CPU. In order to use all the worker CPUs, send enough sessions so that at least few sessions get processed by each worker cpu. Send traffic for atleast 100 sessions while running throughput tests on 8 core CPU so that atleast few sessions gets processed by each core.

6. Verify underlay throughput issues

Make sure that the underlay is not dropping the packets. For example, if customer trying to measure 10 Gbps throughput but underlay switches are not capable of switching 10 Gbps.

Check if SD-Wan SLA has no losses

show orgs org sd-wan sla-monitor metrics last-1m

Please check the input rate (pps/bps) and output rate (pps/bps) To confirm, verify that packet sent out of transport reaches on other side of versa transport.

Use tcpdump utility on wan interface. tcpdump vni-x/x filter " host other side transport"

Use Rapid ping utility at VOS as below with parameters like count 1000 to check drops on Circuit side.

admin@Hub-DualCPE1-cli> show interfaces port statistics brief

Tx PPS and BPS of local site matches Rx PPS and BPS on remote site.

7. Verify optimizations enabled

7.1 Verify application offload is enabled if UTM features are not configured

Enable application offload when UTM features are not configured.

CLI: show configuration orgs org-services application-identification application-generic-options

If application offload is not configured, configure using config CLI: set orgs org-services application-identification application-generic-options offload enabled

7.2 Verify isolcpu is enabled

During performance throughput testing, if user wants to achieve close to zero packet loss when (i.e., < 0.01% packet loss), then Versa recommends enabling the isolcpu.

Verify using CLI: request system isolate-cpu status

If isolcpu is not enabled, enable using CLI:>

8. Verify sessions are load-balanced on all workers

Verify traffic sessions are equally load balanced across all the worker cores. Please refer

Flex-VNF-Arch-Object:

If sessions are not load-balanced across worker threads, please check class of traffic being received using below command. Traffic for given class is mapped to worker core by default in versa software. It can be changed using config commands

If sessions are not equally distributed across worker cores and throughput is less than expected, please contact versa engineer.

9. Verify there are not too many fragmented packets

Traffic going through sd-wan tunnel will have tunnel overhead. If bigger size packets sent to versa sd-wan LAN, before sending to WAN, we may fragment the packets and send it over sd-wan tunnel. Fragmented packets are reassembled on remote site before sent to customer LAN. Since fragmentation and reassembly is CPU intensive task, throughput will reduce if there are too many fragments. Director workflow pushes readjusting MSS for TCP packets going over sd-wan tunnel. TCP packets will not get fragmented if readjust MSS is set for tunnel. Only bigger size UDP packets which may not fit into sd-wan tunnel will get fragmented.

To check packets fragmented and reassembled, run below commands:

When customer traffic arrives on LAN/WAN with DF bit set but fragmentation is needed to send it over sd-wan tunnel, Versa FlexVNF will send ICMP error message to sender with message “DF bit set but fragmentation needed”. Most of the network devices react to this message and send the packet with DF bit unset. We have seen some of the SIP phones and some old legacy devices like radius server sends packets with DF bit set and does not respond to ICMP “DF bit set but fragmentation needed”. Versa introduced ‘override-df-bit tunnel’ to solve this issue. When customer traffic arrives on LAN/WAN which requires fragmentation but DF bit set and sender does not respond to ICMP error message “DF bit set but fragmentation needed”, versa will unclear DF bit, fragment the packet and send it over tunnels. If there is high fragmentation is seen, please check ‘override-df-bit tunnel’ is set.

Verify ‘override-df-bit tunnel’ using CLI: show configuration orgs org-services options override-df-bit

Verify the packets fragmented at tunnel using below commands:

10. Verify no punt packets across workers

Traffic for a session is processed by single worker core. To anchor a session to a worker core, 5 tuples is used. All the traffic between local site and remote site goes via single sd-wan tunnel which has same 5 tuples for all customer sessions carried in the tunnel. It requires tunnel decapsulation by a worker thread to anchor a session on a core which itself is a cpu intensive operation. In order to achieve load balancing among worker threads at the remote end, the local site sends CRC of the 5 tuples in the encap headers to remote site. Remote site will anchor the session based on the CRC. It might to be possible that some session may be anchored on an incorrect core and later gets punted to right core. This may reduce the throughput if there are too many packets punted at high rate. Please check packets punt to WT count using below CLI.

If there are lot of fragmented packets with NAT, firewall and HA, it is expected that packets punted between worker threads.

If packets are punted to WT at high rate and throughput is less than expected, please contact versa engineer.

11. Verify poller count

Typically, FlexVNF would allocate 1 poller cpu for every 10G of Tx/Rx link. For e.g.: Based on the platform, if there are 6x1G and 2x10G, 3 poller CPUs may be assigned. Assigning of the poller CPU happens while versa services come up during reboot/restart. Even though some of the NICs are not connected/used, poller CPU is assigned based on number of NICs present on the device.

If some of NICs are not used, let say 1x10G, number of poller CPUs assigned can be reduced to give more CPUs worker cores. Please run below command to check number of poller CPUs assigned.

12. Verify worker/poller cpu utilization and worker/poller drops

Verify the worker and poller CPUs usage. If poller and worker CPU is already running at 100%, then it could be the max throughput of the FlexVNF, if all the optimizations are enabled.

Note: When run mode is configured as hyper, CPUs are run at 100% speed even if there is no packet. By default, FlexVNF runs in Performance mode

To verify worker and poller CPU utilization, run htop command and top -H ( Press 1 in case to sort process-id ) or to check high usage on memory use top -o %MEM and to check high CPU use top -o %CPU

If worker and poller is running at 100%, there might be drops at worker and poller. Run the below command to check the drops in worker and poller. Flags definition will help to understand top -H above outputs.

us, user: time running un-niced user processes

sy, system: time running kernel processes

ni, nice : time running niced user processes

wa, IO-wait: time waiting for I/O completion

hi : time spent servicing hardware interrupts

si : time spent servicing software interrupts

st : time stolen from this vm by the hypervisor

st : time stolen from this vm by the hypervisor (if KVM or Hypervisor is over subscriber/having High CPU, this number will be high)

Also run ‘show vsm statistics thrm detail’ which gives more details of where packets get dropped.

If worker and poller is not running at 100% CPU and still packets drops are seen in poller and/or worker and all the above checks are done and throughput is less than expected, involve versa engineer to debug further.