Troubleshooting Crash
=====================
Crash happens when the system encounters unexpected CPU interrupt. We need to understand what function is causing the interrupt and the chain of calling of other functions. When crash happens the system takes entire copy of memory that relates to the crash function. Therefore, Coredump basically has the memory copy of the function when failure happens.
Follow below steps:
-------------------
(1) For any crash, coredump file should be generated. Ask customer to collect coredump and upload link provided by Versa.
If customer does not have access to the URL, you need to create one or ask help from your colleague.
Core files are stored in "/var/tmp/versa-cores".
admin@CPE-1-cli> show coredumps
total 1.2G
-rw-rw-r-x 1 root root 635K Sep 25 2017 core.versa-vsmd.2579.versa-flexvnf.1506405622.gz
-rw-rw-rw- 1 root root 46M Aug 13 03:05 core.versa-certd.2400.CPE-1.1534154716.gz
-rw-rw-rw- 1 root root 3.4M Aug 16 00:17 core.versa-vmod.2391.CPE-1.1534403835.gz
-rw-rw-rw- 1 root root 587M Aug 16 00:22 core.versa-vsmd.2249.CPE-1.1534403863.gz
-rw-rw-rw- 1 root root 3.4M Aug 16 00:29 core.versa-vmod.2092.CPE-1.1534404581.gz
-rw-rw-rw- 1 root root 530M Aug 16 00:31 core.versa-vsmd.1895.CPE-1.1534404590.gz
(2) Take a backtrace of the core file. It will give functions that caused the crash. Look for a function right
after "assert_fail" line. In the example below, "lef_process_collector_grp_update" is the function right
after "assert_fail". Backtrace gives so many other functions. As we already know the functions who are involved
in crash, you need to search for bug that matches these functions in bugzilla or freshdesk. If you dont find any
matching case or bug, you will need to create PR.
admin@CPE-1-cli> show backtrace corefile core.versa-vsmd.1895.CPE-1.1534404590.gz
[New LWP 2943]
[New LWP 2941]
[New LWP 2946]
[New LWP 3128]
[New LWP 3142]
[New LWP 2939]
[New LWP 3145]
[New LWP 1895]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/versa/bin/versa-vsmd -N -H 2 '.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f055225fc37 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0 0x00007f055225fc37 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f0552263028 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f0552258bf6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f0552258ca2 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00000000008e4d93 in lef_process_collector_grp_update (tenant=0x7f04d350af00, coll_grp_cfg=coll_grp_cfg@entry=0x7f04d35136c0) at ../usr/module/lef/lef_cfg_process.c:1582
#5 0x00000000008e17cd in lef_cfg_clnt_process_collector_grp_cfg (msg_len=12, msg=0x7f0495be174e "ROUP\020\001\030\001", msg_params=0x7f0495be1740) at ../usr/module/lef/lef_cfg_clnt.c:142
#6 lef_cfg_clnt_process_cfg (msg_params=msg_params@entry=0x7f0495be1740, msg=msg@entry=0x7f0495be174a "\n\006CGROUP\020\001\030\001", msg_len=<optimized out>) at ../usr/module/lef/lef_cfg_clnt.c:216
#7 0x00000000008f2ee7 in lef_itc_cfg_process (cfg_itc_msg_len=<optimized out>, cfg_itc_msg=<optimized out>) at ../usr/module/lef/lef_itc.c:40
#8 lef_itc_msg_process (data=0x7f0495be173c, len=<optimized out>) at ../usr/module/lef/lef_itc.c:497
#9 0x000000000146f7cf in vs_thrm_itc_process_evmsg (evmsg=evmsg@entry=0x7f04d401e6e0, ret_msg_type=ret_msg_type@entry=0x7f054f33e75c, tstamp_diff=tstamp_diff@entry=0x7f054f33e760, ret_opq_data=ret_opq_data@entry=0x7f054f33e780) at ../usr/lib/libvsthrm/vs_thrm_itc.c:79
#10 0x00000000006b667a in vsm_process_thrm_work (works=0x7f054f33e8e0, nworks=<optimized out>, work_type=<optimized out>) at ../usr/sbin/vsm/vsm_thrm.c:2928
#11 0x000000000147201b in process_ev_n_tmr_workqs (tmr_flag=<optimized out>, ev_flag=<optimized out>, tinfo=0x7f055084e200, ctx=<optimized out>) at ../usr/lib/libvsthrm/vs_worker_threads.c:265
#12 vs_worker_thread_routine (arg=0x7f055084e200) at ../usr/lib/libvsthrm/vs_worker_threads.c:604
#13 0x000000000146a873 in vs_generic_thread_routine (arg=0x7f055084e200) at ../usr/lib/libvsthrm/vs_thrm.c:198
#14 0x000000000146d0f3 in vs_thrm_start_routine (handle=<optimized out>, tgid=<optimized out>, tgid@entry=VS_THREAD_GROUP_MAX, shared_cpu=shared_cpu@entry=false) at ../usr/lib/libvsthrm/vs_thrm.c:1511
#15 0x00000000006bdfd7 in vsm_thrm_start (arg=<optimized out>) at ../usr/sbin/vsm/vsm_thrm.c:4504
#16 0x000000000167fe85 in eal_thread_loop (arg=<optimized out>) at ../usr/lib/DPDK/lib/librte_eal/linuxapp/eal/eal_thread.c:184
#17 0x00007f0555903184 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#18 0x00007f0552326ffd in clone () from /lib/x86_64-linux-gnu/libc.so.6
[ok][2018-08-21 02:36:25]
(3) unzip the file and run gdb on the file. It give outputs similar to traceback.
admin@Branch1:.../tmp/versa-cores$ sudo gunzip core.versa-vsmd.25919.Branch1.1477388878.gz
admin@Branch1:.../tmp/versa-cores$ sudo gdb /opt/versa/bin/versa-vsmd -c core.versa-vsmd.25919.Branch1.1477388878
(4) Get the release info.
admin@CPE-1-cli> show system package-info
Package Versa FlexVNF software
Release 16.1-R2
Build S3
Release date 20180808
Package id 6e92440
Package name versa-flexvnf-20180808-212106-6e92440-16.1R2S3
Branch 16.1R2
Creator
(5) Check "versa-service.log" file for anything suspicious during the time of the crash.
cat /var/log/versa/versa-service.log
134331 sdb memory limit is 2147483648 bytes
134332 sdb timeout is 1 days
134333 hscan_exceed_memory is FALSE
134334 ips memory limit is 1073741824 bytes
134335 2018-08-16 00:29:50.707 NOTIC [0x201] [VSN:0] rfm_cfg_tenant_add: Tenant 3 config add
134336 2018-08-16 00:29:50.707 ERROR [0x201] vfp_ev_handler:636 Unknown event recieved
134337 versa-vsmd: ../usr/module/lef/lef_cfg_process.c:1582: lef_process_collector_grp_update: Assertion `!collector->lef_coll_grp_ptr' failed.
134338 versa-vsmd: ../usr/module/lef/lef_cfg_process.c:1582: lef_process_collector_grp_update: Assertion `!collector->lef_coll_grp_ptr' failed.
(6) Analyse the log which was created during the incident. Looks for anything interesting.
cd /var/log
[admin@CPE-1: log] # ls -ltrh | grep syslog
-rw-r----- 1 syslog adm 13K Jul 23 02:29 kern.log.4.gz
-rw-r----- 1 syslog adm 22K Aug 5 05:51 kern.log.3.gz
-rw-r--r-- 1 syslog adm 208K Aug 12 09:28 cloud-init.log
-rw-r----- 1 syslog adm 64K Aug 12 09:30 kern.log.2.gz
-rw-r----- 1 syslog adm 14K Aug 15 00:17 syslog.7.gz
-rw-r----- 1 syslog adm 195K Aug 16 00:17 syslog.6
-rw-r----- 1 syslog adm 562K Aug 17 00:17 syslog.5
-rw-r----- 1 syslog adm 23K Aug 18 00:17 syslog.4.gz
-rw-r----- 1 syslog adm 14K Aug 19 00:17 syslog.3.gz
-rw-r----- 1 syslog adm 33K Aug 19 00:51 kern.log.1
-rw-r----- 1 syslog adm 12K Aug 20 00:17 syslog.2.gz
-rw-r----- 1 syslog adm 222K Aug 21 00:17 syslog.1
-rw-r----- 1 syslog adm 416 Aug 21 04:02 kern.log
-rw-r----- 1 syslog adm 103K Aug 21 07:37 syslog
[admin@CPE-1: log] # sudo cat syslog.6 | grep 2018-08-16
(7) If relevant check for configuration change:
admin@CPE-1-cli> show commit list
2018-08-21 07:41:01
SNo. ID User Client Time Stamp Label Comment
~~~~ ~~ ~~~~ ~~~~~~ ~~~~~~~~~~ ~~~~~ ~~~~~~~
0 10228 1001 system 2018-08-21 04:02:42
1 10227 admin netconf 2018-08-17 00:31:32
2 10226 admin netconf 2018-08-17 00:22:50
3 10223 admin netconf 2018-08-13 03:13:12
4 10222 admin netconf 2018-08-13 03:09:39
5 10221 admin netconf 2018-08-13 03:05:16
6 10220 admin netconf 2018-08-13 02:55:56
(8) Collect tech-support.
request system tech-support
(9) Collect generic logs.
show system package-info
show system uptime
show system status
show system details