Purpose:
This Article explains troubleshooting steps to debug High CPU issues in Versa operating system.
Table of Contents:
Case1: Troubleshooting high CPU
1.1: System Verification and CPU core allocation/mapping
Case2: Workers/Pollers showing high CPU utilization
2.1: System configuration for virtualized environments
2.2: Generating live core and performance reports
2.3: High CPU utilization on Worker-0 and Core-0
Case3: Versa-VSMD process showing High CPU Utilization
Case4: CPU spikes at random time intervals
Case5: Data collection
Troubleshooting High CPU:
High CPU usage investigation needs in-depth analysis to root cause the problem in the device running Versa operating system.
You can follow below detailed steps to identify the root cause for high CPU utilization.
Case-1: System verification and CPU core allocation/mapping
Check the system details using the command “vsh details”. These commands will show you the hardware type(BareMetal or Hypervisor), Number of CPU cores and total installed/available memory, VOS software details, Spack and OSS Pack details in the system.
These are very important details to check if device is running on recommended hardware and having supported number of CPU cores and memory.
Top output is very useful and provide the list of CPU cores and sort the processes as per their utilization in the system. You can press 1 after executing the “top -H” command and it will show you the utilization for all individual cores. One should check the values listed in front of each CPU lines such as wa, si, st etc. As we offer VOS implementation in virtual environments as well, we should check the st(time spent) column to understand the CPU core status. If this number is higher, then hypervisor may be offering the CPU cycles to other VM on the host machine. This value can typically increase in the deployments where CPU cores are not dedicated for the virtual machine and running in shared fashion.
For virtual machine deployments, you must allocate a dedicated CPU and memory (with 1:1 provisioning) to the Versa branch components. Depending on the number of sockets present in the host, you might need to use CPU pinning. To improve deterministic performance, it is recommended that you turn off hyperthreading at the host level, by disabling hyperthreading in the host's BIOS.
To verify the number of active cores and check the status of hyperthreading, issue the lscpu command in the host's operating system. If you notice thread per core 2 or more then hyperthreading is enabled.
[admin@cpe2: ~] $ lscpu | grep Thread
Thread(s) per core: 1
You can also dump the output of /proc/cpuinfo, which shows detailed information for each CPU core.
Step-2: Workers/Pollers showing high CPU utilization:
Check if workers or poller threads running high and contributing to overall high CPU utilization. Poller thread job is to poll the packets and serve them to workers threads where processing of all service modules are done.
You can use following commands to check worker/poller mappings and ensure system has adequate number of pollers and workers assigned. VOS devices having 10Gig interfaces should have a dedicated poller assigned for each 10gig interface.
vsm-vcsn0> show vsm cpu info VSM CPU info: ----------------------------------------------------- # of CPUs : 4 # of poller threads : 1 # of worker threads : 2 # of control threads : 1 Used CPUs : [ 0 1 2 3 ] Poller CPUs : [ 3 ] Worker CPUs : [ 1 2 ] Control CPUs : [ 0 ]
vsm-vcsn0> show vsm cpu port-to-poller-map VNI Name VNI Socket Poller IDX Poller CPU Pid -------- ---------- ---------- ---------- -------- vni-0/0 0 0 3 3987 vni-0/1 0 0 3 3987 vni-0/2 0 0 3 3987 vni-0/3 0 0 3 3987 vni-0/4 0 0 3 3987 spt-0/0 -1 0 3 3987
Please verify the total traffic arriving on the ports and ensure there is no oversubscription happening which in turn causing high CPU. You can execute the cli command “show interfaces port statistics brief” to check the traffic stats.
Check total number of sessions currently active on the box. following command will give you the total session limit of the box and how many sessions are currently active.
admin@cpe2-cli> show device clients
CLIENT VSN CPU MEM MAX ACTIVE FAILED
ID ID LOAD LOAD SESSIONS SESSIONS SESSIONS
-----------------------------------------------------
15 0 36 80 500000 25 0
admin@cpe2-cli> show orgs org Tenant-1 sessions summary
NAT NAT NAT SESSION TCP UDP ICMP OTHER
VSN SESSION SESSION SESSION SESSION SESSION SESSION SESSION COUNT SESSION SESSION SESSION SESSION
ID COUNT CREATED CLOSED COUNT CREATED CLOSED FAILED MAX COUNT COUNT COUNT COUNT
-----------------------------------------------------------------------------------------------------------------
0 29 628100 628071 0 20 20 0 100000 0 28 0 1
Check the configuration of the device showing high worker CPU, to see if there is any SSL decryption, NGFW, AV, IDS, IPS, URLF etc enabled on the policy which matches this traffic.
The throughput drastically reduces when these features are enabled.
Step-2.1: Setting correct Run-Mode in Virtual environments
Verify if the system is running in a virtualized environment and you notice poller/worker CPUs are show showing high CPU utilization despite having low traffic. In such conditions you can consider setting the run-mode to manual with following configuration parameters for achieving optimal performance.
These configuration settings are only valid for virtual environments, and it takes effect without restart. Once these settings are configured, please ensure same are being reflected under vsmd VTY:
set system service-options run-mode Manual
set system service-options min-idle-sleep-time 5
set system service-options max-idle-sleep-time 20
vsm-vcsn0> show vsm settings datapath
vsm datapath settings:
run-mode : manual
thrm idle sleep min poller : 5 usec
thrm idle sleep max poller : 20 usec
thrm idle sleep min worker : 5 usec
thrm idle sleep max worker : 20 usec
thrm rx bulking : 16
thrm tx bulking : 16
thrm pkt ring size : 2048
dpdk max frags cap : 48
Versa FlexVNF Run-modes
Versa FlexVNF uses poll mode drivers in the user-space to achieve optimal performance. Polling frequency of the interfaces/rings are controlled using two parameters; min-idle-sleep-time and max-idle-sleep-time. The polling logic in Versa FlexVNF is to poll for the packets/events in a tight loop, and if no activity is seen for continuous min-idle-sleep-time then sleep for max-idle-sleep-time. Different run-mode controls the value of these two variables.
These are the different run-modes:
Run-mode | Description |
Performance | Here the poller/worker will sleep for a short duration once idle. This is the default mode. Values:
|
Density | Here the poller/worker sleeps for a long duration after idle state. This is a power saving mode, but may experience long latency. Values:
|
Hyper | Here the poller/woker will be in a continues loop polling for packets. The CPU utilization will be 100% even when is no traffic. Better latency results are possible with this mode. Values:
|
Manual | Here user can control the value of these variables:
|
Step-2.2: High CPU utilization on Worker-0 and Core-0
Worker 0 and control core are separate CPUs. Under top -H command, First CPU starting with cpu0 is assigned to Control core and cpu-1 will be worker-0. When you check htop command output you would notice that CPU numbering starts with 1 therefore CPU-1 under this command belongs to control core and CPU-2 is worker-0. Control core is handling all host bound traffic however Worker-0 also handles the transit data traffic as well.
There are cases when Control Core or worker-0 CPU is running on very high utilization(95+) and it causes service degradation or performance issues. In such conditions we may see Sdwan control plane protocol flaps as SDwan control packets are subjected to go to Worker-0 CPU and then to control Core 0. However, Worker/control core high CPU could be an after effect of control plane protocol flaps therefore once should check the system and logs to identify the trigger.
You can dump the following commands to see the traffic stats and session load in each worker thread. From below output you can verify if sessions are load balanced properly.
vsm-vcsn0> show vsf per-thread nfp stats summary Thr-Id Sess-Active Sess-Created Sess-Closed -------- ------------- -------------- ------------- 0 3 244355 244352 1 3 213876 213873 2 4 233494 233490 3 5 212999 212994 4 6 259292 259286 5 5 245905 245900
Following commands will give you data classified in each queue from each worker thread perspective. This can help in determining if there is too much control traffic being processed. These counters are cumulative therefore clear the counters first and take multiple iterations while debugging the issue in production networks.
vsm-vcsn0> show vsm cq stats +-------+-------------+-------------+-------------+-------------+ | W TID | CTRL | DATA | EF | AF | +-------+-------------+-------------+-------------+-------------+ | 0 | 181448678 | 4821848 | 5 | 0 | | 1 | 267252 | 11041411 | 0 | 0 | | 2 | 23458 | 3416402 | 0 | 0 | | 3 | 348178 | 5951665 | 0 | 0 | | 4 | 86661 | 34802973 | 0 | 0 | | 5 | 66002 | 3727956 | 0 | 0 | +-------+-------------+-------------+-------------+-------------+
You can check the thrm stats and check if there is outstanding data queued which is not getting processed or data processing being slowed down.
---------------------------------------------------------------------------------------------------------------
S-tindex | D-tindex | EV-posted | EV-processed | EV-proc-fail | EV-po-invalid| EV-po-alloc | EV-po-write |
---------------------------------------------------------------------------------------------------------------
Thread group: 1
------------------------------------------------------------------------------------------------------------
3943(1)| 3948(0)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3943(1)| 3949(0)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3943(1)| 3907(2)| 2349577952| 2016057686| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3943(1)| 4155(3)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3944(1)| 3948(0)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3944(1)| 3949(0)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3944(1)| 3907(2)| 76294088| 35576529| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3944(1)| 4155(3)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3945(1)| 3948(0)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3945(1)| 3949(0)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3945(1)| 3907(2)| 74415257| 35138969| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3945(1)| 4155(3)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3946(1)| 3948(0)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3946(1)| 3949(0)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3946(1)| 3907(2)| 77491717| 34925201| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3946(1)| 4155(3)| 0| 0| 0| 0| 0| 0|
------------------------------------------------------------------------------------------------------------
3947(1)| 3948(0)| 0| 0| 0| 0| 0| 0|
You can also execute following command to check the Rx/Tx counters from pollers and workers perspective. This will also show you the classified counters in each queue. This could give us an understanding how much control and data traffic being processed from worker thread perspective.
vsm-vcsn0> show vsm statistics thrm detail
Number of RX Poller Threads [2]
RX Poller PID [27306], CPU-ID [3]
RX Poller PID [27308], CPU-ID [4]
Network Packets RX : 232086663
Network Packets TX : 10094853
Worker Thrd - Packets RX : 10094858
Worker Thrd - Packets TX : 232086663
Drop Packets RX : 0
Drop Packets TX - Link Down : 5
Drop Packets TX - Driver : 0
Worker thread queue stats [Queued/Dropped] :
|-----------------------------------------------------------------------------------------------------------------
| W| CONTROL | DATA | EF | AF | Total
|-----------------------------------------------------------------------------------------------------------------
| | Queued | Dropped | Queued | Dropped | Queued | Dropped | Queued | Dropped | Queued | Dropped |
|-----------------------------------------------------------------------------------------------------------------
| 0| 193100893| 0| 38985752| 0| 18| 0| 0| 0| 232086663| 0
|-----------------------------------------------------------------------------------------------------------------
Port tx stats :
|-----------------------------------------------------------|
| P| Name | Queued |Link Down Dropped|Driver Dropped|
|-----------------------------------------------------------|
| 0| vni-0/0 | 5846280| 2| 0|
| 1| vni-0/2 | 137499| 1| 0|
| 2| spt-0/0 | 1| 1| 0|
| 3| vni-0/1 | 2213160| 0| 0|
| 4| vni-0/3 | 1897913| 1| 0|
|-----------------------------------------------------------|
Number of Worker Threads [1]
Worker PID [27301], CPU-ID [2]
Network Packets RX : 232086663
Network Packets TX : 9849798
Multiseg Coalescing TX : 0
Multiseg Coalescing fail TX : 0
Network Packets TX Fail : 0
Number of Control Threads [1]
Control PID [26896], CPU-ID [0]
Network Packets RX : 0
Network Packets TX : 0
Network Packets TX Fail : 0
Perf record:
You can run the performance report(perf report) using the process Id of the core running high and check the processing. You can also dump the live core for the thread running high and take engineering assistance to root cause it further.
Execute the following command from device shell to get perf report and live core:
sudo perf record -g -t <thread-id of the ipsec-control-thread/vsmd/dhcp/confd/worker-thread>
Stop the "perf record" by doing "Ctrl + C".
Collect the perf.data
You can read the perf report using the command sudo perf report. this can give the clue under which process most of the CPU time being spent.
When there are instances of high control core utilization/control thread getting stuck or high amount of ITC flood being sent to control thread and eventually end up building the queue.
The queue build up can either be due to:
- fast producer - worker threads sending ITC floods to control thread which control thread cannot handle.
- Collect the command "show vsm thrm ev-time-log 0"
- slow consumer – control thread stuck in some hot spots such as lot of SLA/link flaps, PBF path rebuild across branches causing more control plane activity.
- In this case, it is good to collect the control thread back-trace to get more insight what is really being processed by the CPU at that time. You can collect it using the below command
sudo kill -SIGUSR1 `pgrep versa-vsmd`
This will dump the back-trace under /var/log/versa/versa-service.log file
Example:
Thread versa-vsmd[3761] received Signal User defined signal 1[10] lcore_id 0, Dumping callstack...
/opt/versa/bin/versa-vsmd() [0x79947b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fae6f047330]
/lib/x86_64-linux-gnu/libc.so.6(epoll_wait+0x33) [0x7fae6b76e6d3]
/opt/versa/bin/versa-vsmd() [0x85018c]
/opt/versa/bin/versa-vsmd(event_base_loop+0x150) [0x846260]
/opt/versa/bin/versa-vsmd(vsm_main+0x9ad) [0x79f29d]
/opt/versa/bin/versa-vsmd(main+0x20) [0x79f80d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fae6b691f45]
/opt/versa/bin/versa-vsmd() [0x7418fb]
Live core:
You can collect the live core using below command. take the PID if the process and dump the live core.
sudo gcore 3606 <<<<<<<<<<<<<<<Get the PID from top -H output>>>>>>>>>>>>>
Note: Running gcore in production network can cause service interruption at times and vary case to case. Be careful while executing during production hours. You can try to execute this in a MW or non-production hours while having the OOB management connection available.
Step 2.3:
In some cases Lcore stuck can be observed which in turn show high worker CPU utilization or thread bring stuck. In case you see worker CPU is stuck at very high utilization, please do check if there are any lcore stuck seen. from 21.2 above you can refer to /var/log/versa-service.log file where you can see see the back-trace for the lcore dumped.
SIGUSR1 signal sent to lcore 12
Thread worker-0[5853] received Signal User defined signal 1[10] lcore_id 12, Dumping callstack...
lcore 13 is stuck, last seen 2013ms ago.
lcore 14 is stuck, last seen 2013ms ago.
lcore 15 is stuck, last seen 2013ms ago.
lcore 16 is stuck, last seen 2013ms ago.
lcore 17 is stuck, last seen 2013ms ago.
lcore 18 is stuck, last seen 2013ms ago.
lcore 19 is stuck, last seen 2013ms ago.
lcore 20 is stuck, last seen 2013ms ago.
lcore 21 is stuck, last seen 2014ms ago.
lcore 22 is stuck, last seen 2014ms ago.
lcore 23 is stuck, last seen 2014ms ago.
Once Lcore stuck is seen it will also dump the call-stack trace in the same file. You can also collect the stack trace files from /var/log/versa/debug/lcore-stuck and share it with engineering for further analysis.
If the system is running on older build then you can enable the call-graph and collect the same for further analysis.
test vsm cpu lcore-keepalive disable
test vsm cpu lcore-keepalive enable 2
test vsm cpu lcore-keepalive callgraph enable
Please check the memory utilization of the system and overall session traffic load on he box. If the system is running out of memory then we may witness lcore stuck which is after effect as system becomes slow.
Step 2.4:
In the deployments where more control activity is expected such as Controllers/Hub-as-controller/hubs it is recommended to isolate the CPU and if required add more CPU to control core. You can execute the following command to enable the CPU isolation.
request system isolate-cpu enable
request system isolate-cpu enable num-control-cpu 2
Additionally you can refer the below article for more information:
https://support.versa-networks.com/a/solutions/articles/23000024219
Step-2.5
There would be cases where specific worker/poller thread CPU is going high and causing performance issues. We have witnessed cases where tcpdump was enabled on specific interface and it wasn’t terminated properly therefore leaving the tap interface on. If this was done on interfaces carrying high amount of traffic such as on HUB devices, it could potentially lead high CPU on associated poller thread which is mapped to the interface.
Kindly ensure that tcpdump is terminated properly and tap interface is also removed. You can verify the tap interface from the appliance shell using the “ifconfig” command. Once you enable the tcpdump on any interface it creates a tap interface usually seen as _vni_
[admin@cpe2: ~] $ ifconfig
_vni_0_0 Link encap:Ethernet HWaddr e2:05:2d:cd:83:c6
inet6 addr: fe80::e005:2dff:fecd:83c6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:156 errors:0 dropped:0 overruns:0 frame:0
TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:22526 (22.5 KB) TX bytes:648 (648.0 B)
Step 2.6:
Please set the func profiler and collect the log output for the same.
vsm-vcsn0> show vsm func profiler callpts
+------------------------------------------+--------------+----------+
| CALLPT NAME | Callpt IDx | Enabled |
+------------------------------------------+--------------+----------+
| VS_CALLPT_NFP_ENTER | 1 | 0 | <<<0 disable/1 enable>>>
| VS_CALLPT_NFP_SESS_INTR | 2 | 0 |
| VS_CALLPT_NFP_SESS_CREATE | 3 | 0 |
| VS_CALLPT_NFP_SESS_DELETE | 4 | 0 |
| VS_CALLPT_NFP_SESS_REEVAL | 5 | 0 |
| VS_CALLPT_NFP_FLOW_EVENT | 6 | 0 |
vsm-vcsn0> set vsm func profiler 0 1 <Setting the func profiler manually>
Value 0 is for all callpts and 1 stands for enable. Once you enable the fun profiler logs please collect the below output for 3-4 times.
vsm-vcsn0> show vsm func profiler log
+-------+-------------------------------------------+------------+------------+------------+----------------+
| W TID | CALLPT NAME |Max Time(us)|Avg Time(us)| Count | Cookie |
+-------+-------------------------------------------+------------+------------+------------+----------------+
| 0 | VS_CALLPT_NFP_SESS_INTR | 2236 | 2178 | 2| 12 |
| 0 | VS_CALLPT_NFP_FLOW_EVENT | 579 | 246 | 5| 9 |
| 0 | VS_CALLPT_NFP_INPUT_CHAIN | 1377 | 232 | 24| 9 |
| 0 | VS_CALLPT_NFP_SESS_PURGE_LIST | 1491 | 223 | 24| 0 |
| 0 | VS_CALLPT_THRM_TMR_TICK | 5457 | 449 | 427| 0 |
| 0 | VS_CALLPT_THRM_VUNET_COUT | 8578 | 433 | 777| 0 |
| 0 | VS_CALLPT_THRM_VJOB_SCHED | 639 | 224 | 5| 0 |
| 0 | VS_CALLPT_THRM_EV_TMR | 3773 | 446 | 195| 0 |
| 0 | VS_CALLPT_THRM_IORX_WQS | 53585 | 845 | 1210| 0 |
| 0 | VS_CALLPT_THRM_APP_EXT | 30649 | 461 | 15217| 0 |
| 0 | VS_CALLPT_THRM_CT_NWQS | 2233 | 450 | 27| 0 |
| 0 | VS_CALLPT_THRM_DISP_APP | 1492 | 219 | 25| 0 |
Once done please disable the func profile logging.
vsm-vcsn0> set vsm func profiler 0 0 <turn of logging for all>
Case:3: VSMD process showing High CPU Utilization
There could be cases where you notice versa-vsmd process showing high CPU utilization.
top - 05:39:13 up 46 days, 14:33, 5 users, load average: 1.11, 1.11, 1.14
Tasks: 310 total, 2 running, 308 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.2 us, 0.4 sy, 0.0 ni, 93.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 65622768 total, 29087104 used, 36535664 free, 212536 buffers
KiB Swap: 0 total, 0 used, 0 free. 4215504 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4083 root 20 0 24.808g 0.012t 211044 R 154.8 20.0 76237:21 versa-vsmd
2357 versa 20 0 1073528 296340 8340 S 1.3 0.5 1716:28 versa-dnsd
93312 versa 20 0 57468 21580 6020 S 1.0 0.0 0:06.21 versa-certd
3041 versa 20 0 40488 3872 3416 S 0.7 0.0 572:37.98 versa-devmgr
In such cases you should also check the control core utilization as well. You may notice in
some cases that control thread is busy processing the messages and in turn you would see Versa-VSMD showing high utilization. There is a control thread running in VSMD as well which processes the control traffic for protocol such as IPsec.
In such cases you can check the IKE/IPsec processing on the box and also verify the configuration. Do check for he configuration, IKE/IPSEC tunnel scale. Do check the policy configured and associated objects such as IP, FQDN related match criteria.
You can dump the performance report data and it can help determining which process is consuming most of the CPU cycles.
You can do few additional checkups for the following:
1> Verify if there is any script is running on the box. you can verify the same from shell of the box.
sudo ps -aux | grep -i .py
sudo ps -aux | grep -i .sh
2> You can verify if there is any cron job running.
sudo crontab -e
3> We should verify the session load on the box and verify the session rate at which it is getting created. It is possible that over all session load may not be reaching the total session capacity of the box however session/sec rate may be much higher which the box can process without compromising the CPU resources. analytics graphs can help understand this better.
4> Do verify the traffic for protocols such as SNMP, DHCP, DNS etc which is coming to the box.
5> If you high session load on spoke devices you can verify how the NAT is being done, in case box is configured for DIA. If SNAT is done on WAN side, box would end up creating 2 sessions for one flow. this may not be seen usually and may not directly cause CPU issues at first however box can reach to the session limit.
Note: At times we execute htop command to check the CPU usage of the box as it shows a graphical view of CPU core usage separately, However this can be misleading for the process usage such as Versa-VSMD. htop command shows the aggregate CPU usage of vsmd process which will always be high as VSMD is multi threaded process and it displays the aggregate usage therefore CPU usage for versa-vsmd will always be high under this command and it is normal. This should ne be a cause of concern.
Always take the top -H(Press 1) output and see the utilization if each core along with the process utilization to take the informed decision.
Case 4: CPU spikes at random time intervals
There are cases where High CPU triggers are seen at random time interval or some time it can be seen on specific time slot. Please verify the following to isolate such cases:
- Note down the time stamp of high CPU and see if there is any traffic pattern getting established.
- There could be cases of some unwanted traffic coming to the box due to some attack such as DDOS, ARP flood, TCP syn flood etc. you can use the analytics to check the number of traffic sessions, NAT sessions, overall CPU usage, Access-Circuit usage to understand if there was a actual traffic surge or increased number of control traffic arriving on the box.
- You can also check the DDOS counters, ARP counters, TCP counters from the box and vsmd VTY to make the correlation.
- Check if there is any scripts(sudo ls -ltrha /etc/cron.d/) running.
admin@cpe2-cli> show statistics internal arp
statistics internal arp arp_global 0
addr_match 547
bad_addr_len 0
bad_proto_len 0
bcast_addr_match 0
drop_mcast 0
drop_noproxy 0
dropped 8893
dupips 0
last_addr_match 68939275
no_addr_match 0
received 69274295
rlexceeded 334473 <<<<<<<<<<<these are rate limit counters for ARP packets.
rxlooped 0
rxreplies 425
rxrequests 68939397
timeouts 4045
txreplies 359
txrequests 7900
If you see rlexceeded counter incrementing, then there must be lot of ARP request coming in and getting dropped after being rate limited. There is a default ARP rate limiter set to 300 ARP/second. Please do check that it has not been increased to bigger number. This would cause system to send a greater number of ARP to the CPU which can increase the utilization.
4. You can check if there are too many SLA/BGP/BFD/IPSec protocol flaps and causing high CPU however one should keep a note that this can be after effect of high CPU as well. You can check CPU usage graphs/alarms threshold triggers and corresponding protocol flaps to understand actual trigger of high CPU and take an informed decision and set the action plan accordingly.
show orgs org <Org-name> sd-wan sla-monitor status
show orgs org <Org-name> sd-wan sla-monitor metrics last-1m
show orgs org <Org-name> sd-wan sla-monitor metrics last-10s
show bgp neighbor brief
show bfd session brief
5. Please ensure there are no debugs enabled on the device.
show configuration | display set | match debug | details
show configuration | display set | match trace | details
request debug session filter-display
show filter trace table <<<from vsmd vty>>>>
Following are few additional points which can be verified while debugging high CPU issues.
- Hyperthreading shouldn’t be enabled.
- Ensure CPU pinning is done and CPU resources are not being shared.
- Verify if SSL inspection is enabled and matching all the traffic in LAN. This could lead to high CPU.
- Verify if there are more number of SNMP managers polling the device extensively and causing high CPU.
- Please check the device configuration in detail and check if we are hitting any scaling limit with respect to any module/service on the box.
Case 5: Data Collection
Following commands can be obtained to debug the high CPU utilization:
Note: VOS releases 21.X please refer to /var/log/versa/debug/high-cpu/tshoot.log. For older VOS releases you can obtain /var/log/versa/tshoot.log
From VOS command Cli:
show interfaces port statistics brief
show interfaces port statistics detail
show system details
show device clients
show system load status
show orgs org Tenant-1 sessions summary <Change the Tenant-name>
show configuration system service-options | details
show system load-stats
show configuration | display set | match debug | details
show configuration | display set | match trace | details
request debug session filter-display
request system isolate-cpu status
From shell:
top -H (Shift +M)
free -h
lscpu
vsh details
cat /proc/cmdline
cat /proc/net/dev
cat /proc/cpuinfo
lscpu | grep Thread
sudo ps -aux | grep -i .py
sudo ps -aux | grep -i .sh
sudo crontab -e
Perf record:
sudo perf record -g -t <thread-id of the ipsec-control-thread/vsmd/dhcp/confd/worker-thread>
Stop the "perf record" by doing "Ctrl + C".
Collect the perf.data
Live core:
sudo gcore 3606 <<<<<<<<<<<<<<<Get the PID from top -H output>>>>>>>>>>>>>
Use “vsh connect vsmd” shell command to enter vsmd vty prompt.
show vsm cpu info
show vsm cpu health
show vsm cpu port-to-poller-map
show vsf nfp module stats brief
show vsm statistics dropped
show vsm statistics datapath
show vsf session stats brief
show vsm load now
show vsm load details
show vsm anchor core map
show vsf nfp tcp stats
show vsm cq stats
show vsf per-thread nfp stats summary
show vsf per-thread nfp stats detail 0 -1
show vsf per-thread nfp stats detail 1 -1
show vsm statistics thrm evstats
show vsm statistics thrm detail
show vsm statistics thrm port
show vsm settings datapath
show vsf tunnel stats
show vsm statistics mbuf
show filter trace table
From Analytics:
Collect the Memory-Usage/CPU-Usage/Access-Circuits-usage/Top-Consumer graphs from analytics for correlation.