Purpose:

This Article explains troubleshooting steps to debug High CPU issues in Versa operating system.



Table of Contents:

 Case1: Troubleshooting high CPU

   1.1: System Verification and CPU core allocation/mapping

 Case2: Workers/Pollers showing high CPU utilization

   2.1: System configuration for virtualized environments 

   2.2: Generating live core and performance reports 

   2.3High CPU utilization on Worker-0 and Core-0

 Case3: Versa-VSMD process showing High CPU Utilization 

 Case4: CPU spikes at random  time intervals
 
Case5: Data collection 



 

Troubleshooting High CPU:

High CPU usage investigation needs in-depth analysis to root cause the problem in the device running Versa operating system.
 

 

You can follow below detailed steps to identify the root cause for high CPU utilization.

 

 

 

Case-1: System verification and CPU core allocation/mapping 


Check the system details using the command “vsh details”. These commands will show you the hardware type(BareMetal or Hypervisor), Number of CPU cores and total installed/available memory, VOS software details, Spack and OSS Pack details in the system. 

 

 

These are very important details to check if device is running on recommended hardware and having supported number of CPU cores and memory. 

 

 

 

 


Top output is very useful and provide the list of CPU cores and sort the processes as per their utilization in the system. You can press 1 after executing the “top -H” command and it will show you the utilization for all individual cores. One should check the values listed in front of each CPU lines such as wa, si, st etc. As we offer VOS implementation in virtual environments as well, we should check the st(time spent) column to understand the CPU core status. If this number is higher, then hypervisor may be offering the CPU cycles to other VM on the host machine. This value can typically increase in the deployments where CPU cores are not dedicated for the virtual machine and running in shared fashion.

 

 

 

 

 

For virtual machine deployments, you must allocate a dedicated CPU and memory (with 1:1 provisioning) to the Versa branch components. Depending on the number of sockets present in the host, you might need to use CPU pinning. To improve deterministic performance, it is recommended that you turn off hyperthreading at the host level, by disabling hyperthreading in the host's BIOS.

 

To verify the number of active cores and check the status of hyperthreading, issue the lscpu command in the host's operating system. If you notice thread per core 2 or more then hyperthreading is enabled.

 

 

[admin@cpe2: ~] $ lscpu | grep Thread

Thread(s) per core:    1

 

 

 

You can also dump the output of /proc/cpuinfo, which shows detailed information for each CPU core. 

 

 

 

Step-2: Workers/Pollers showing high CPU utilization:

 

 

Check if workers or poller threads running high and contributing to overall high CPU utilization. Poller thread job is to poll the packets and serve them to workers threads where processing of all service modules are done.

 

You can use following commands to check worker/poller mappings and ensure system has adequate number of pollers and workers assigned. VOS devices having 10Gig interfaces should have a dedicated poller assigned for each 10gig interface.

 

 

vsm-vcsn0> show vsm cpu info

VSM CPU info:
-----------------------------------------------------
# of CPUs                                  : 4
# of poller threads                        : 1
# of worker threads                        : 2
# of control threads                       : 1

  Used     CPUs      :  [ 0 1 2 3  ]
  Poller CPUs       :   [ 3  ]
  Worker CPUs       :   [ 1 2  ]
  Control CPUs      :   [ 0  ]

 

 

 

vsm-vcsn0> show vsm cpu port-to-poller-map

VNI Name   VNI Socket   Poller IDX  Poller CPU      Pid
--------   ----------   ----------   ----------   --------
vni-0/0          0          0            3           3987
vni-0/1          0          0            3           3987
vni-0/2          0          0            3           3987
vni-0/3          0          0            3           3987
vni-0/4          0          0            3           3987
spt-0/0         -1          0            3           3987

 

 

Please verify the total traffic arriving on the ports and ensure there is no oversubscription happening which in turn causing high CPU.  You can execute the cli command “show interfaces port statistics brief” to check the traffic stats.

 

 

Check total number of sessions currently active on the box. following command will give you the total session limit of the box and how many sessions are currently active.

 

 

 

admin@cpe2-cli> show device clients

 

CLIENT  VSN   CPU   MEM   MAX        ACTIVE    FAILED

ID      ID    LOAD  LOAD  SESSIONS   SESSIONS  SESSIONS

-----------------------------------------------------

15      0     36    80    500000     25        0

 

 

 

 

admin@cpe2-cli> show orgs org Tenant-1 sessions summary

                                NAT      NAT       NAT               SESSION  TCP       UDP      ICMP     OTHER

VSN  SESSION   SESSION  SESSION  SESSION   SESSION  SESSION  SESSION   COUNT    SESSION  SESSION   SESSION  SESSION

ID   COUNT     CREATED  CLOSED   COUNT     CREATED  CLOSED   FAILED    MAX      COUNT    COUNT     COUNT    COUNT

-----------------------------------------------------------------------------------------------------------------

0    29        628100   628071   0         20       20       0         100000   0        28        0        1



Check the configuration of the device showing high worker CPU, to see if there is any SSL decryption, NGFW, AV, IDS, IPS, URLF etc enabled on the policy which matches this traffic.

The throughput drastically reduces when these features are enabled.

 

 

Step-2.1: Setting correct Run-Mode in Virtual environments


Verify if the system is running in a virtualized environment and you notice poller/worker CPUs are show showing high CPU utilization despite having low traffic. In such conditions you can consider setting the run-mode to manual with following configuration parameters for achieving optimal performance. 

 

These configuration settings are only valid for virtual environments, and it takes effect without restart. Once these settings are configured, please ensure same are being reflected under vsmd VTY:

 

 

 

set system service-options run-mode Manual 

set system service-options min-idle-sleep-time 5

set system service-options max-idle-sleep-time 20

 

 

 

vsm-vcsn0> show vsm settings datapath

vsm datapath settings:

 

    run-mode                   : manual

 

    thrm idle sleep min poller : 5 usec

 

    thrm idle sleep max poller : 20 usec

 

   thrm idle sleep min worker : 5 usec

 

    thrm idle sleep max worker : 20 usec

 

    thrm rx bulking            : 16

 

    thrm tx bulking            : 16

 

    thrm pkt ring size         : 2048

 

    dpdk max frags cap         : 48

 

 


Versa FlexVNF Run-modes

 

Versa FlexVNF uses poll mode drivers in the user-space to achieve optimal performance. Polling frequency of the interfaces/rings are controlled using two parameters; min-idle-sleep-time and max-idle-sleep-time. The polling logic in Versa FlexVNF is to poll for the packets/events in a tight loop, and if no activity is seen for continuous min-idle-sleep-time then sleep for max-idle-sleep-time. Different run-mode controls the value of these two variables.

 


These are the different run-modes:

Run-mode
Description
Performance
Here the poller/worker will sleep for a short duration once idle.
This is the default mode.

Values:
  • min-idle-sleep-time = 5 micro sec
  • max-idle-sleep-time = 20 micro sec
Versa encourages users to use Performance mode unless the system needs tuning for a specific latency/throughput/power.
Density
Here the poller/worker sleeps for a long duration after idle state. This is a power saving mode, but may experience long latency.

Values:
  • min-idle-sleep-time = 10 micro sec
  • max-idle-sleep-time = 10 milli sec
Hyper
Here the poller/woker will be in a continues loop polling for packets. The CPU utilization will be 100% even when is no traffic. Better latency results are possible with this mode.

Values:
  • min-idle-sleep-time = 0xfffffff0 micro sec
  • max-idle-sleep-time = 1 micro sec
Manual
Here user can control the value of these variables:
  • min-idle-sleep-time = user defined
  • max-idle-sleep-time = user defined



 


Step-2.2: High CPU utilization on Worker-0 and Core-0

 

Worker 0 and control core are separate CPUs. Under top -H command, First CPU starting with cpu0 is assigned to Control core and cpu-1 will be worker-0. When you check htop command output you would notice that CPU numbering starts with 1 therefore CPU-1 under this command belongs to control core and CPU-2 is worker-0. Control core is handling all host bound traffic however Worker-0 also handles the transit data traffic as well.

 

 

 

 

 

There are cases when Control Core or worker-0 CPU is running on very high utilization(95+) and it causes service degradation or performance issues. In such conditions we may see Sdwan control plane protocol flaps as SDwan control packets are subjected to go to Worker-0 CPU and then to control Core 0.  However, Worker/control core high CPU could be an after effect of control plane protocol flaps therefore once should check the system and logs to identify the trigger.

 

 

You can dump the following commands to see the traffic stats and session load in each worker thread. From below output you can verify if sessions are load balanced properly.

 

 

vsm-vcsn0> show vsf per-thread nfp stats summary

Thr-Id      Sess-Active     Sess-Created     Sess-Closed
--------    -------------   --------------   -------------
    0               3        244355         244352
    1               3        213876         213873
    2               4        233494         233490
    3                5        212999         212994
    4               6        259292         259286
    5               5        245905         245900

 

 

Following commands will give you data classified in each queue from each worker thread perspective. This can help in determining if there is too much control traffic being processed. These counters are cumulative therefore clear the counters first and take multiple iterations while debugging the issue in production networks.

 

 

vsm-vcsn0> show vsm cq stats
+-------+-------------+-------------+-------------+-------------+

| W TID | CTRL        | DATA        | EF          | AF          |

+-------+-------------+-------------+-------------+-------------+

| 0     | 181448678   | 4821848      | 5           | 0           |
| 1     | 267252      | 11041411    | 0           | 0           |
| 2     | 23458       | 3416402     | 0           | 0           |
| 3     | 348178      | 5951665     | 0           | 0           |
| 4     | 86661       | 34802973    | 0           | 0           |
| 5     | 66002       | 3727956     | 0           | 0           |

+-------+-------------+-------------+-------------+-------------+

 

You can check the thrm stats and check if there is outstanding data queued which is not getting processed or data processing being slowed down.

                                                                                                        

---------------------------------------------------------------------------------------------------------------                                      

  S-tindex |  D-tindex | EV-posted  | EV-processed | EV-proc-fail | EV-po-invalid| EV-po-alloc  | EV-po-write  |                                     

---------------------------------------------------------------------------------------------------------------                                                                                                                    

Thread group: 1                                                                                                                                      

------------------------------------------------------------------------------------------------------------                                         

    3943(1)|     3948(0)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3943(1)|     3949(0)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3943(1)|     3907(2)|   2349577952|   2016057686|            0|            0|            0|            0|                                     

------------------------------------------------------------------------------------------------------------                                         

    3943(1)|     4155(3)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3944(1)|     3948(0)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3944(1)|     3949(0)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3944(1)|     3907(2)|     76294088|     35576529|            0|            0|            0|            0|                                  

------------------------------------------------------------------------------------------------------------                                         

    3944(1)|     4155(3)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3945(1)|     3948(0)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3945(1)|     3949(0)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3945(1)|     3907(2)|     74415257|     35138969|            0|            0|            0|            0|                                            

------------------------------------------------------------------------------------------------------------                                         

    3945(1)|     4155(3)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3946(1)|     3948(0)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3946(1)|     3949(0)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3946(1)|     3907(2)|     77491717|     34925201|            0|            0|            0|            0|                              

------------------------------------------------------------------------------------------------------------                                         

    3946(1)|     4155(3)|            0|            0|            0|            0|            0|            0|                                        

------------------------------------------------------------------------------------------------------------                                         

    3947(1)|     3948(0)|            0|            0|            0|            0|            0|            0|                                        




You can also execute following command to check the Rx/Tx counters from pollers and workers perspective. This will also show you the classified counters in each queue. This could give us an understanding how much control and data traffic being processed from worker thread perspective.


vsm-vcsn0> show vsm statistics thrm detail

 

Number of RX Poller Threads [2]

RX Poller PID [27306], CPU-ID [3]

RX Poller PID [27308], CPU-ID [4]

    Network Packets RX           : 232086663

    Network Packets TX           : 10094853

    Worker Thrd - Packets RX     : 10094858

    Worker Thrd - Packets TX     : 232086663

    Drop Packets RX              : 0

    Drop Packets TX - Link Down  : 5

    Drop Packets TX - Driver     : 0

 Worker thread queue stats [Queued/Dropped] :

  |-----------------------------------------------------------------------------------------------------------------

  | W|       CONTROL       |          DATA       |         EF          |         AF          |      Total

  |-----------------------------------------------------------------------------------------------------------------

  |  |  Queued  |  Dropped |  Queued  |  Dropped |  Queued  |  Dropped |  Queued  |  Dropped |  Queued  |  Dropped |

  |-----------------------------------------------------------------------------------------------------------------

  | 0| 193100893|         0|  38985752|         0|        18|         0|         0|         0| 232086663|         0

  |-----------------------------------------------------------------------------------------------------------------

 Port tx stats :

  |-----------------------------------------------------------|

  | P|    Name    |  Queued  |Link Down Dropped|Driver Dropped|

  |-----------------------------------------------------------|

  | 0|    vni-0/0 |   5846280|                2|             0|

  | 1|    vni-0/2 |    137499|                1|             0|

  | 2|    spt-0/0 |         1|                1|             0|

  | 3|    vni-0/1 |   2213160|                0|             0|

  | 4|    vni-0/3 |   1897913|                1|             0|

  |-----------------------------------------------------------|


Number of Worker Threads [1]

Worker PID [27301], CPU-ID [2]

    Network Packets RX          : 232086663

    Network Packets TX          : 9849798

    Multiseg Coalescing TX      : 0

    Multiseg Coalescing fail TX : 0

    Network Packets TX Fail     : 0



Number of Control Threads [1]

Control PID [26896], CPU-ID [0]

    Network Packets RX        : 0

    Network Packets TX        : 0

    Network Packets TX Fail   : 0





Perf record:

You can run the performance report(perf report) using the process Id of the core running high and check the processing. You can also dump the live core for the thread running high and take engineering assistance to root cause it further.


Execute the following command from device shell to get perf report and live core:

 


 

    sudo perf record -g -t <thread-id of the ipsec-control-thread/vsmd/dhcp/confd/worker-thread>

 

 

    Stop the "perf record" by doing "Ctrl + C".

    Collect the perf.data

 

 

You can read the perf report using the command sudo perf report. this can give the clue under which process most of the CPU time being spent.





When there are instances of high control core utilization/control thread getting stuck or high amount of ITC flood being sent to control thread and eventually end up building the queue. 


The queue build up can either be due to:


  • fast producer -  worker threads sending ITC floods to control thread which control thread cannot handle. 
    •  Collect the command "show vsm thrm ev-time-log 0"


  • slow consumer – control thread stuck in some hot spots such as lot of SLA/link flaps, PBF path rebuild across branches causing more control plane activity.
    • In this case, it is good to collect the control thread back-trace to get more insight what is really being processed by the CPU at that time. You can collect it using the below command 



sudo kill -SIGUSR1 `pgrep versa-vsmd`


This will dump the back-trace under /var/log/versa/versa-service.log file


Example:


Thread versa-vsmd[3761] received Signal User defined signal 1[10] lcore_id 0, Dumping callstack...

/opt/versa/bin/versa-vsmd() [0x79947b]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fae6f047330]

/lib/x86_64-linux-gnu/libc.so.6(epoll_wait+0x33) [0x7fae6b76e6d3]

/opt/versa/bin/versa-vsmd() [0x85018c]

/opt/versa/bin/versa-vsmd(event_base_loop+0x150) [0x846260]

/opt/versa/bin/versa-vsmd(vsm_main+0x9ad) [0x79f29d]

/opt/versa/bin/versa-vsmd(main+0x20) [0x79f80d]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fae6b691f45]

/opt/versa/bin/versa-vsmd() [0x7418fb]




Live core: 

You can collect the live core using below command. take the PID if the process and dump the live core.

 


sudo gcore 3606      <<<<<<<<<<<<<<<Get the PID from top -H output>>>>>>>>>>>>>

 

 

Note: Running gcore in production network can cause service interruption at times and vary case to case. Be careful while executing during production hours. You can try to execute this in a MW or non-production hours while having the OOB management connection available.

 


Step 2.3:

In some cases Lcore stuck can be observed which in turn show high worker CPU utilization or thread bring stuck. In case you see worker CPU is stuck at very high utilization, please do check if there are any lcore stuck seen. from 21.2 above you can refer to /var/log/versa-service.log file where you can see see the back-trace for the lcore dumped.

 

 

SIGUSR1 signal sent to lcore 12

Thread worker-0[5853] received Signal User defined signal 1[10] lcore_id 12, Dumping callstack...

 

lcore 13 is stuck, last seen 2013ms ago.

lcore 14 is stuck, last seen 2013ms ago.

lcore 15 is stuck, last seen 2013ms ago.

lcore 16 is stuck, last seen 2013ms ago.

lcore 17 is stuck, last seen 2013ms ago.

lcore 18 is stuck, last seen 2013ms ago.

lcore 19 is stuck, last seen 2013ms ago.

lcore 20 is stuck, last seen 2013ms ago.

lcore 21 is stuck, last seen 2014ms ago.

lcore 22 is stuck, last seen 2014ms ago.

lcore 23 is stuck, last seen 2014ms ago.

 

 

 

Once Lcore stuck is seen it will also dump the call-stack trace in the same file. You can also collect the stack trace files from /var/log/versa/debug/lcore-stuck and share it with engineering for further analysis.

 

If the system is running on older build then you can enable the call-graph and collect the same for further analysis.


test vsm cpu lcore-keepalive disable

test vsm cpu lcore-keepalive enable 2

test vsm cpu lcore-keepalive callgraph enable



Please check the memory utilization of the system and overall session traffic load on he box. If the system is running out of memory then we may witness lcore stuck which is after effect as system becomes slow. 

 

 

 

Step 2.4:

In the deployments where more control activity is expected such as Controllers/Hub-as-controller/hubs it is recommended to isolate the CPU and if required add more CPU to control core. You can execute the following command to enable the CPU isolation.

 

 

request system isolate-cpu enable 

request system isolate-cpu enable num-control-cpu 2

 

Additionally you can refer the below article for more information:

https://support.versa-networks.com/a/solutions/articles/23000024219


 

Step-2.5 

There would be cases where specific worker/poller thread CPU is going high and causing performance issues. We have witnessed cases where tcpdump was enabled on specific interface and it wasn’t terminated properly therefore leaving the tap interface on. If this was done on interfaces carrying high amount of traffic such as on HUB devices, it could potentially lead high CPU on associated poller thread which is mapped to the interface. 

 

 

Kindly ensure that tcpdump is terminated properly and tap interface is also removed. You can verify the tap interface from the appliance shell using the “ifconfig” command. Once you enable the tcpdump on any interface it creates a tap interface usually seen as _vni_

 

 

 

[admin@cpe2: ~] $ ifconfig

_vni_0_0  Link encap:Ethernet  HWaddr e2:05:2d:cd:83:c6

          inet6 addr: fe80::e005:2dff:fecd:83c6/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500   Metric:1

          RX packets:156 errors:0 dropped:0 overruns:0 frame:0

        TX packets:8 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:22526 (22.5 KB)  TX bytes:648 (648.0 B)

 

 

Step 2.6: 

Please set the func profiler and collect the log output for the same.


vsm-vcsn0> show vsm func profiler callpts


+------------------------------------------+--------------+----------+

|             CALLPT NAME                  |  Callpt IDx  | Enabled  |

+------------------------------------------+--------------+----------+


| VS_CALLPT_NFP_ENTER                       |            1 |        0 |   <<<0 disable/1 enable>>>

| VS_CALLPT_NFP_SESS_INTR                   |            2 |        0 |

| VS_CALLPT_NFP_SESS_CREATE                 |            3 |        0 |

| VS_CALLPT_NFP_SESS_DELETE                 |            4 |        0 |

| VS_CALLPT_NFP_SESS_REEVAL                 |            5 |        0 |

| VS_CALLPT_NFP_FLOW_EVENT                  |            6 |        0 |




vsm-vcsn0> set vsm func profiler 0 1     <Setting the func profiler manually>



Value 0 is for all callpts and 1 stands for enable. Once you enable the fun profiler logs please collect the below output for 3-4 times.



vsm-vcsn0> show vsm func profiler log


 +-------+-------------------------------------------+------------+------------+------------+----------------+

 | W TID |             CALLPT NAME                   |Max Time(us)|Avg Time(us)|    Count   |     Cookie     |

 +-------+-------------------------------------------+------------+------------+------------+----------------+


 |     0 | VS_CALLPT_NFP_SESS_INTR                   |      2236  |     2178  |           2|              12 |

 |     0 | VS_CALLPT_NFP_FLOW_EVENT                  |       579  |      246  |           5|               9 |

 |     0 | VS_CALLPT_NFP_INPUT_CHAIN                 |      1377  |      232  |          24|               9 |

 |     0 | VS_CALLPT_NFP_SESS_PURGE_LIST             |      1491  |      223  |          24|               0 |

 |     0 | VS_CALLPT_THRM_TMR_TICK                   |      5457  |      449  |         427|               0 |

 |     0 | VS_CALLPT_THRM_VUNET_COUT                 |      8578  |      433  |         777|               0 |

 |     0 | VS_CALLPT_THRM_VJOB_SCHED                 |       639  |      224  |           5|               0 |

 |     0 | VS_CALLPT_THRM_EV_TMR                     |      3773  |      446  |         195|               0 |

 |     0 | VS_CALLPT_THRM_IORX_WQS                   |     53585  |      845  |        1210|               0 |

 |     0 | VS_CALLPT_THRM_APP_EXT                    |     30649  |      461  |       15217|               0 |

 |     0 | VS_CALLPT_THRM_CT_NWQS                    |      2233  |      450  |          27|               0 |

 |     0 | VS_CALLPT_THRM_DISP_APP                   |      1492  |      219  |          25|               0 |

 

 

Once done please disable the func profile logging. 

 

vsm-vcsn0> set vsm func profiler 0 0    <turn of logging for all>




Case:3: VSMD process showing High CPU Utilization 


There could be cases where you notice versa-vsmd process showing high CPU utilization. 



top - 05:39:13 up 46 days, 14:33, 5 users, load average: 1.11, 1.11, 1.14

Tasks: 310 total, 2 running, 308 sleeping, 0 stopped, 0 zombie

%Cpu(s): 6.2 us, 0.4 sy, 0.0 ni, 93.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 65622768 total, 29087104 used, 36535664 free, 212536 buffers

KiB Swap: 0 total, 0 used, 0 free. 4215504 cached Mem


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4083 root 20 0 24.808g 0.012t 211044 R 154.8 20.0 76237:21 versa-vsmd

2357 versa 20 0 1073528 296340 8340 S 1.3 0.5 1716:28 versa-dnsd

93312 versa 20 0 57468 21580 6020 S 1.0 0.0 0:06.21 versa-certd

3041 versa 20 0 40488 3872 3416 S 0.7 0.0 572:37.98 versa-devmgr




In such cases you should also check the control core utilization as well. You may notice in 

some cases that control thread is busy processing the messages and in turn you would see Versa-VSMD showing high utilization. There is a control thread running in VSMD as well which processes the control traffic for protocol such as IPsec.


In such cases you can check the IKE/IPsec processing on the box and also verify the configuration. Do check for he configuration, IKE/IPSEC tunnel scale. Do check the policy configured and associated objects such as IP, FQDN related match criteria. 


You can dump the performance report data and it can help determining which process is consuming most of the CPU cycles.



You can do few additional checkups for the following:


1> Verify if there is any script is running on the box. you can verify the same from shell of the box.

sudo ps -aux | grep -i .py    

sudo ps -aux | grep -i .sh 


2> You can verify if there is any cron job running.

sudo crontab -e 


3> We should verify the session load on the box and verify the session rate at which it is getting created. It is possible that over all session load may not be reaching the total session capacity of the box however session/sec rate may be much higher which the box can process without compromising the CPU resources. analytics graphs can help understand this better.


4> Do verify the traffic for protocols such as SNMP, DHCP, DNS etc which is coming to the box.


5> If you high session load on spoke devices you can verify how the NAT is being done, in case box is configured for DIA. If SNAT is done on WAN side, box would end up creating 2 sessions for one flow. this may not be seen usually and may not directly cause CPU issues at first however box can reach to the session limit.


Note:  At times we execute htop command to check the CPU usage of the box as it shows a graphical view of CPU core usage separately, However this can be misleading for the process usage such as Versa-VSMD. htop command shows the aggregate CPU usage of vsmd process which will always be high as VSMD is multi threaded process and it displays the aggregate usage therefore CPU usage for versa-vsmd will always be high under this command and it is normal. This should ne be a cause of concern.


Always take the top -H(Press 1) output and see the utilization if each core along with the process utilization to take the informed decision.

 



Case 4: CPU spikes at random time intervals


There are cases where High CPU triggers are seen at random time interval or some time it can be seen on specific time slot. Please verify the following to isolate such cases:

 

  1. Note down the time stamp of high CPU and see if there is any traffic pattern getting established. 
  2. There could be cases of some unwanted traffic coming to the box due to some attack such as DDOS, ARP flood, TCP syn flood etc. you can use the analytics to check the number of traffic sessions, NAT sessions, overall CPU usage, Access-Circuit usage to understand if there was a actual traffic surge or increased number of control traffic arriving on the box. 
  3. You can also check the DDOS counters, ARP counters, TCP counters from the box and vsmd VTY to make the correlation.
  4. Check if there is any scripts(sudo ls -ltrha /etc/cron.d/) running. 

 

 

 

admin@cpe2-cli> show statistics internal arp

statistics internal arp arp_global 0

addr_match       547

bad_addr_len     0

bad_proto_len    0

bcast_addr_match 0

drop_mcast       0

drop_noproxy    0

dropped          8893

dupips           0

last_addr_match  68939275

no_addr_match    0

received         69274295

rlexceeded       334473                  <<<<<<<<<<<these are rate limit counters for ARP packets.

rxlooped         0

rxreplies        425

rxrequests       68939397

timeouts         4045

txreplies        359

txrequests       7900

 

 

If you see rlexceeded counter incrementing, then there must be lot of ARP request coming in and getting dropped after being rate limited. There is a default ARP rate limiter set to 300 ARP/second. Please do check that it has not been increased to bigger number. This would cause system to send a greater number of ARP to the CPU which can increase the utilization.

 

 

 

4. You can check if there are too many SLA/BGP/BFD/IPSec protocol flaps and causing high CPU however one should keep a note that this can be after effect of high CPU as well. You can check CPU usage graphs/alarms threshold triggers and corresponding protocol flaps to understand actual trigger of high CPU and take an informed decision and set the action plan accordingly. 

 

 

show orgs org <Org-name> sd-wan sla-monitor status

show orgs org <Org-name> sd-wan sla-monitor metrics last-1m

show orgs org <Org-name> sd-wan sla-monitor metrics last-10s

show bgp neighbor brief

show bfd session brief

 

 

5. Please ensure there are no debugs enabled on the device.

 

show configuration | display set | match debug | details

show configuration | display set | match trace | details

request debug session filter-display 

 

show filter trace table     <<<from vsmd vty>>>>

 

 

 

Following are few additional points which can be verified while debugging high CPU issues.

 

  1. Hyperthreading shouldn’t be enabled.
  2. Ensure CPU pinning is done and CPU resources are not being shared.
  3. Verify if SSL inspection is enabled and matching all the traffic in LAN. This could lead to high CPU.
  4. Verify if there are more number of SNMP managers polling the device extensively and causing high CPU.
  5. Please check the device configuration in detail and check if we are hitting any scaling limit with respect to any module/service on the box.

 

 

Case 5: Data Collection


Following commands can be obtained to debug the high CPU utilization:

 

Note: VOS releases 21.X please refer to /var/log/versa/debug/high-cpu/tshoot.log. For older VOS releases you can obtain /var/log/versa/tshoot.log

 

 

From VOS command Cli:

 

show interfaces port statistics brief

show interfaces port statistics detail

show system details

show device clients 

show system load status

show orgs org Tenant-1 sessions summary            <Change the Tenant-name>

show configuration system service-options | details

show system load-stats

show configuration | display set | match debug | details

show configuration | display set | match trace | details

request debug session filter-display

request system isolate-cpu status


 

 

From shell: 

 

top -H (Shift +M)

free -h

lscpu

vsh details

cat /proc/cmdline

cat /proc/net/dev

cat /proc/cpuinfo

lscpu | grep Thread

sudo ps -aux | grep -i .py    

sudo ps -aux | grep -i .sh

sudo crontab -e 

 



Perf record:

 

    sudo perf record -g -t <thread-id of the ipsec-control-thread/vsmd/dhcp/confd/worker-thread>

 

 

    Stop the "perf record" by doing "Ctrl + C".

    Collect the perf.data

  

 


Live core: 

 

sudo gcore 3606      <<<<<<<<<<<<<<<Get the PID from top -H output>>>>>>>>>>>>>

 

 

Use “vsh connect vsmd” shell command to enter vsmd vty prompt.

 

 

show vsm cpu info

show vsm cpu health

show vsm cpu port-to-poller-map

show vsf nfp module stats brief

show vsm statistics dropped

show vsm statistics datapath

show vsf session stats brief

show vsm load now

show vsm load details

show vsm anchor core map

show vsf nfp tcp stats

show vsm cq stats

show vsf per-thread nfp stats summary

show vsf per-thread nfp stats detail 0 -1

show vsf per-thread nfp stats detail 1 -1

show vsm statistics thrm evstats

show vsm statistics thrm detail

show vsm statistics thrm port

show vsm settings datapath

show vsf tunnel stats

show vsm statistics mbuf

show filter trace table

 


 

From Analytics:


 

Collect the Memory-Usage/CPU-Usage/Access-Circuits-usage/Top-Consumer graphs from analytics for correlation.