Troubleshooting Analytics : Versa Support

TABLE OF CONTENTS

Introduction
Section 2: Troubleshooting Scenarios
1. Section 2.1: Unable to observe Analytic or Search logs
2. Section 2.2: Issues with Scaling Analytics
  - Section 2.2.1: Hard-disk as a scaling limitation
  - Section 2.2.2: Adding more nodes to the cluster
  - Section 2.2.3: Database Retention
  - Section 2.2.4: Hitting max-connections on the log-collector
  - Section 2.2.5: Log volume issues
3. Section 2.3: Disk utilization issues
  - Section 2.3.1: Clearing up Archives
  - Section 2.3.2: Clearing up the backup folder
  - Section 2.3.3: Clearing the Backlogs
  - Section 2.3.4: Clearing the Database overload
  - Section 2.3.5: Clearing up Analytics DB
  - Section 2.3.6: Clearing up Search DB
4. Section 2.4: Issues with DB - DSE/Fusion
  - Section 2.4.1: Performing a Sanity check on the DB
5. Section 2.5: Troubleshooting DB issues.
  - Section 2.5.1. Analytic DB down in DSE/Fusion
  - Section 2.5.2 Search DB down in DSE
  - Section 2.5.3. Search DB down in Fusion
  - Section 2.5.4. Zookeeper status down in fusion
Section 3: Adding a new node to the existing cluster
- Section 3.1: Add node manually
- Section 3.2: Add node using installation script
Section 4. Troubleshooting log-forwarder connectivity
Section 5: Unable to access Analytics via the Director UI

Introduction

Before getting into the troubleshooting aspect, it's important to have a basic understanding of Analytics architecture - which is what this section aims to cover.

An Analytics node performs 4 functions

1. Log collection and storage

2. Log processing and ingestion in the Database

3. Database management (retention, deletion, syncing of records)

4. Web interface which retrieves data from the Database via API calls

Analytics nodes have 2 personalities

1. Analytic personality

2. Search personality

The personality is defined by the type of logs that are stored in the respective Database.

"Analytic personality" nodes store the "aggregate stats" which are generated every 5 mins by the branch - there are different types of aggregate stats, for ex

bwmonlog - aggregates sdwan, DIA and access-circuit usage

monstatlog - aggregates user and application stats

intfutil - aggregates wan utilizations

qoslog - aggregates qos statistics

slamlog - aggregates sla metrics

All of these logs are stored in the Database of the "Analytic personality" node, where they are subject to the retention limit and summarization (min/hourly/daily) as required. The default retention period is 3 months of "daily" logs and 1 month of "hourly" logs, so you can check hourly granularity for these aggregate stats for a period of 1 month.

"Search personality" nodes store the real-time logs that come in the form of alarmlog, accesslog (firewall rule hits), flowmonlog (traffic-monitoring rule hits), dhcplog, cgnatlog (cgnat rule hits), AV/IDP/threat logs etc

You can also setup a "Log forwarder" node which performs log-collection and processing but ingests these logs to a remote-database (analytics cluster is the remote database for the log-forwarder). So, the "log forwarder" does not have it's own local Database.

For the sake of convenience, we will call the a regular "Analytic/Search personality" node as a "Cluster Node" to differentiate it from a "Log-forwarder node"

A cluster node's architecture is as below

- The LEF logs are received by the "log-collector" task (LCED) running on the cluster-node or log-forwarder

- The Logs are parsed, tenant/appliance-name is determined, and placed under /var/tmp/log/tenant-x/VSN0-branch-x directory

- The Analytics-driver module processes

A Log-forwarder node setup is as below, as can be seen, a log-forwarder does not have a local DB, it has to use a "Cluster node" as the remote-DB. Also the "Cluster node" in this case does not have to perform the "log-collector's" role

The branch sends out these logs to the analytics either directly or via controller ADC load-balancing, the more common design is the latter where the branch sends logs to the controller's ADC VIP and the controller further NATs this traffic using source-ip as the controller's egress interface and destination-ip as the actual log-collector.

So essentially the data-path involved in transporting the logs, sent by the branch, to the Analytics cluster is as below.

In the rest of this document we will discuss the various problematic scenarios that are commonly encountered in the context of Analytics and pointers to troubleshoot the same

Section 2: Troubleshooting Scenarios

Section 2.1: Unable to observe Analytic or Search logs

If while checking the dashboard/GUI of Analytics you are unable see the logs for a certain branch, while being able to see the logs/stats for other branches in the same tenant, please follow the below steps to troubleshoot this issue

1. Check the "date" on the branch and confirm if it's shows the current time, the timezone doesn't matter, what's important is that the branch should show the current time

If the date/time on the branch does not reflect the current time, you should either configure "NTP" sync or set the date/time manually as below (setting the time-zone is optional)

1) Set time-zone

sudo timedatectl set-timezone Africa/Cairo

2) Set time

sudo date --set "6 Apr 2021 14:14:00"

Wait for approx 20 mins and check the GUI again to confirm if the logs show up

2. Check the LEF connection from the branch to the log-collector, there can be more than one LEF connection on the branch, at-least one of them should be in established state

If the status shows up as "re-connect", it indicates an issue with the LEF connection where TCP session is not established successfully

If you've enabled ADC load-balancing on the controller (which is the common design), you can check if the nat session is present in the parent org of the controller using the branch's source ip

Make a note of the "source port" that's assigned post the translation, the "nat-destination-ip" is the address of the log-collector towards which the tcp session is being sent - this is the log-collector which will receive this branch's logs

admin@Controller1-cli> show orgs org D1-DIOS sessions nat brief

Access the shell of the concerned log-collector (nat-destination-ip) can check if the connection is received therein, as follows

If you don't see a connection on the concerned log-collector, it would indicate a routing issue between the controller and the log-collector (check if the L2 or L3 devices in between are dropping the packets)

3. Check "netstat -rn" on the log-collector and confirm there is a return route for the controller's south-bound subnet (in the case of ADC) or the branch's subnet (in the case direct connection from branch to log-collector)

4. Check the /var/tmp/log/tenant-x/backup/VSN0-branch-x folder to confirm if logs are being received

Execute the below and confirm if versa-van-driver is processing logs (ideally you should see the below output, with logsRate showing up per tenant, if you see it stuck at "validating cluster" it's in an error state)

tail -f /var/log/versa/versa-van-driver.log

Instead of the above logs, if you see errors or if you see if stuck at 'validating cluster", please follow the below steps

check the contents of the below file

cat /opt/versa/scripts/van-scripts/vandriver.conf

You can also compare the vandriver.conf file for an existing/working node with the non-working node and make sure the contents are similar

Execute the below on the non-working node

ping <ip>(trying pinging all the addresses mentioned under DB_ADDRESS and SEARCH_HOSTS in vandriver.conf file)

You should be able to ping all the addresses, if not please fix the reachability issue

Execute the below

cqlsh -u cassandra -p cassandra <address> (replace <address> with each ip in the DB_ADDRESS list one by one)

Check if cqlsh succeeds towards each of the ip-address listed in DB_ADDRESSS (you should get into cqlsh prompt), if you get any errors there may be a port block or may be cassandra service is down on that node - we will need to fix the cassandra failure on that node (refer to the Fusion troubleshooting further below)

Once the above checks and done, if you still see the logsRate as 0, please check if the local-collector configuration is proper - the collector address, port and directory should be properly configured

Execute the below validate if you have "active connections" on this node

Finally, check the below and confirm if the logsRate is positive (you should see it processing x logs at a positive logsRate)

tail -f /var/log/versa/versa-van-driver.log

5. Check the configuration on the cluster-nodes, or log-forwarders, to confirm if the collector address is configured correctly (the ip-address here should be address of the interface that's supposed get the logs)

6. Confirm if the branch's VoS version matches the version on the Analysis or is "lower" than the version on Analysis. If the branch version is higher than the version on Analytics the logs may get dropped on the Analytics (unable to parse the logs)

7. If specific tabs don't show up data (for ex, you don't see data under SLA-metrics, Application or User tabs), please ensure that the specific data is turned on in the "data configuration" section as seen below

8. Please ensure that the specific data is being sent by branch, as below

shell

vsh connect vsmd

Check to confirm the "active collector" (this is the collector to which the logs are being sent actively)

Now check the statistics against this collector and confirm if the concerned stats are incrementing. Below are the category of logs which correspond to the tabs seen on the UI

bwmon = sdwan usage and access-circuit usage stats

mon_stats = application and user stats

b2b_slam = SLA metrics and violation stats

acc_ckt_cos = QOS stats

intf_util = system/wan interface logs

flow_mon = traffic-monitoring logs (search log)

access-policy = firewall logs (search log)

alarm_log = alarm logs (serach logs)

Execute the command in a space of 10 mins and confirm if the counters of the concerned logs are incrementing

9. Perform a sanity check on the configuration

Make sure lef logging is enabled and set to the "default logging profile" under the concerned policy/rules

For Security src/dst stats below should be enabled

For monstats (application/user stats) and bwmon stats below should be enabled

Below should be in "checked" state

10. Ensure that the "settings" configuration on the UI has the entires for all the cluster nodes in the relevant boxes. "Driver hosts" should have entries for all the nodes in the cluster, "Search hosts" should have entries for the search nodes and "Analytics hosts" should have entries for the analytic nodes. Check the UI of all the nodes in your cluster for this configuration.

Also, check "status" page to confirm if all the cluster nodes display status as "UP" as seen below

11. Validate if the search logs are enabled (in "on" state), click on "save" once to ensure that the configuration is pushed to the nodes.

12. Check the "alarms" section to confirm if there are any alarms pertaining to "global daily limit" or "tenant limit" breaches, in which case you will not be receiving search logs till the 24 hr time-block is complete (or you can increase the threshold limits in the "data configuration" section).

13. Look at the section "performing a sanity check on the DB" further below in this KB and perform the checks to ensure that the DB status is fine.

14. Perform a "vsh restart" on all the nodes in the cluster to clear up any transient state.

15. You can generate a test alarm (it will not impact the node, it will just generate test alarm towards analytics) as shown below on one of the branches (please validate the check provided in step 2 above before proceeding).

shell

vsh connect vsmd

vsm-vcsn0> test vsm trap interface down vni-0/0

vsm-vcsn0> test vsm trap interface up vni-0/0

16. Wait for 5 mins after performing the above step and then login to all the cluster nodes (or log-forwarders) in your setup and check the below on their shell.

sudo su

cd /var/tmp/log/tenant-xyz/backup/VSN0-<branchname> (replace xyz with the actual tenant name and "branchname" with the name of the branch on which the above test was performed)

grep -i "alarmlog" *

Confirm if the alarm is displayed here

17. Check on the UI again and confirm if the alarm shows up (make sure you are checking in the current tenant.

Section 2.2: Issues with Scaling Analytics

There can be various issues, with respect to loss of logs and DB/service failures, that arise owing to scaling issues on Analytics. The basic scaling guideline is as below (refer the link)

https://docs.versa-networks.com/Getting_Started/Deployment_and_Initial_Configuration/Headend_Deployment/Headend_Basics/Hardware_and_Software_Requirements_for_Headend

Section 2.2.1: Hard-disk as a scaling limitation

For production cluster-nodes, the usual recommendation is to use 1TB or 2TB hard-disk to accommodate the Database (the DB can sometimes extend by 40-50% of its original size during compaction). You can check the hard-disk size as below

It's important to monitor the disk usage periodically to ensure that it doesn't cross 80% (you can monitor the resources on the Analytics gui)

Also monitor the alarms section for any errors relating to memory breach

You can set the threshold at which alarms are generated as below

Please note that these analytic alarms are not sent to the director, you will have to monitor them locally - there may be a feature to export these alarms to a remote server in the later releases.

Check the size of the Database from the shell of the cluster-node as below

Under "Datacenter: Analytics" you can see the analytic database load, you can check the same under /var/lib/cassandra/data where the DB files are stored

In "dse" based Database (you know if the database is on DSE if you see a valid output for "dse -v", it would show some version like 4.5 or 4.8) you can check the load of the search database in the same way as above, by checking the "Datacenter: Search" in the output of "nodetool status"

In "fusion" based on Database, you can check the load using "vsh dbstatus"

If the Database disk usage become a limitation you can increase the Disk size by adding more disk space.

Section 2.2.2: Adding more nodes to the cluster

You can also add more nodes to the cluster, say for example if you have 1 analytic node you can add 2 nodes, if you have 2 nodes you can add 4 nodes.

If you have log-forwarders in your setup, you can add more log-forwarders to handle the increased volume of logs or increased number of connection (discussed further in sections below)

If you use the installation script to bring up the cluster (in 20.x, 21.x or 22.x) you can use the below KB to add new nodes to the existing cluster, or adding new log-forwarders

https://support.versa-networks.com/support/solutions/articles/23000022623-adding-a-new-node-to-an-existing-analytics-cluster-in-20-2-21-x

You can consult with the Versa PS/SE to help you scale your cluster

Section 2.2.3: Database retention

The default retention is set to 90 days for Analytic data (daily data), you can reduce the retention limit to 60 days if you hard-disk is a limitation, this way you limit the amount of data stored in the DB

The same way the default retention value of search data is 3/7 days, if the value is set to be higher it will take up the DB space, you should ideally set the retention limit to 3 days

You should also set a "global daily limit" to limit the number of search logs that are ingested into the DB per day - you don't want millions of logs ingested into the DB per day, it would overwhelm the DB. The safe limit to set is 10 million or 30 million (in the case of heavy volume of logs)

The optimal storage limit for Search nodes is 100 million logs (per node), so you want to set the global daily limit and the retention period towards ensuring that you don't overwhelm the search node beyond this limit

For ex, if the retention for "Access logs" (firewall logs) is 10 days and your global daily limit is 30 million, and you end up receiving 30 million Access logs per day, it would essentially lead into 10*30 = 300 million logs in the DB, which is not optimal (if the are 2 search nodes in your setup, it can optimally accommodate 200 million logs)

You can also set the limit specific to tenant if you are aware of a tenant that's sending in a larger volume of logs compared to other tenants

You can check the log volume sent in per tenant as below

Please follow the best practices listed in the documentation below to ensure the Database is not overwhelmed by logs

https://docs.versa-networks.com/Management_and_Orchestration/Versa_Analytics/Configuration/Versa_Analytics_Scaling_Recommendations

Section 2.2.4: Hitting max-connections on the log-collector

By default the configuration allows for 512 incoming connections as seen below

You can check the existing number of connections by checking the below output on all the log-collectors

Please note the each branch has multiple connections (depending on the number of lef connections configured) however it sends log actively to just one connection, the other connection is "passive" but it takes up a connection on the log-collector

You can increase the number of connections on the log-collector, by modifying the max-connections in the configuration

Section 2.2.5: Log volume issues

Note that the scaling limit is not with respect to the max-connections but the volume of incoming logs, in the output of "show lced stats" you can check the global stats, and if you see the number of logs being parsed every second is > 4000 (max/avg) it indicates a heavy volume of logs incoming on the log collector

You can check the output of "show lced stats" a few times and determine the type of logs that are contributing to the log volume

flow_mon_v4_base = traffic monitoring logs (search logs placed in search DB)

access-policy = firewall logs (search logs placed in search DB)

mon_stats = application/user stats (analytic stats, placed in analytic DB)

Usually its the flow-mon, access-policy or mon-stat logs that take up the log volume

If the log volume is high, it would lead to the build up of backlog (which means the rate of incoming logs is much higher than the rate at which logs are being processed by the versa-analytics-driver and pushed into the DB).

LCED parses the incoming logs and places it in /var/tmp/log/tenant-x/VSN0-branchx (respective tenant and branch-name), versa-analytics-driver processes logs from these branch folders and ingests it into the DB - post the processing it moves the logs to the /var/tmp/log/tenant-x/backup/VSN0-branchx folder (the log-archive cron job then moves the logs from the "backup" folder to the /var/tmp/archive/tenant-x/VSN0-branchx folder post compressing it to a .tar.gz file)

Check the below to get an understanding of whether there are backlogs building up - if you see the "branch" directories with utilization > 200 MB it's a clear indication of a backlog buildup.

If you see a backlog buildup, please verify if the versa-analytics driver is running - the logs should show up a valid rate of processing and the logs should be "recent" (if the last log is several hours/days old it means that the versa-analytics-driver has stalled for some reason - please re-start the driver using "sudo service versa-analytics-driver restart")

If the versa-analytics-driver is running fine and the backlogs continue to build up, you can try balancing the connections between the log-collectors, please check the "active connections" on all the log-collectors (usng "show lced stats") - in one of the log-collectors is more heavily loaded than the other, you can try restarting the versa-lced service on this log-collector so that the connections are dropped and re-distributed to other log-collectors

sudo service versa-lced restart

Or you can set the max-connections to a lower value (say 100) to shed the connections until the backlogs clear up, post which you can set the max-connections back to 512

Section 2.3: Disk utilization issues

It's important to monitor the disk utilization on the cluster nodes and log-forwards on a periodic basis - you can check this on the "admin/resources" section of the GUI or you can setup a versa analytics platform monitoring server using the information below (it involves setting up a server with promethus/grafana application), it can monitor the analytics has the provision to set up email notifications

https://docs.versa-networks.com/Management_and_Orchestration/Versa_Analytics/Configuration/Configure_Versa_Analytics_Platform_Monitoring

If you see the disk utilization going high on the cluster-node or log-forwarder, it's first important to understand the real cause of the disk utilization - there are four factors usually at play

- Archives, where the /var/tmp/archive folder takes up the Disk space

- Backup, where the /var/tmp/log/tenant-x/backup takes up the Disk space

- Backlogs, where the /var/tmp/log/tenant-x folder takes up the Disk space (without the "backup" folder)

- Database inflation, where the /var/lib folder takes up the Disk space (where DB records are saved)

Please execute the below to find out which folder is taking up the disk space

Archives:

cd /var/tmp/archive

du -sh .

Backup:

cd /var/tmp/log/tenant-x/backup (check for all tenants)

du -sh .

Backlogs:

cd /var/tmp/log/tenant-x

du -ch . (check the utilization against all branch folders /VSN0-branchx)

you can also execute the below to check the utilization per tenant folder, and per node (useful when you have multiple tenants)

du -ch /var/tmp/log/tenant-*/VSN0-*

Database:

cd /var/lib/

du -ch .

There are different solutions to address each of the above case

Section 2.3.1: Clearing up Archives

The /var/tmp/archive folder holds the raw logs compresses in .tar.gz file. This has nothing to do with the Database - the versa-analytics-driver processes the logs and ingests them to the DB, at the same time it also moves the log to the /var/tmp/log/tenant-x/backup folder, there is a log-archive crond job that running every hour which archives all these logs (in the backup folder) into a .tar.gz file and places it in /var/tmp/archive folder

There is no inherent mechanism to automatically delete the files in the archive folder, this is because different customers have a different use for the archives - some customer want to retain the archive logs for auditing purposes, since these archives contain the flow-mon (netflow/traffic-monitoring), access-logs (firewall logs), cgnat logs, dhcp logs, threat logs, which are useful for auditing purposes (especially for service providers or financial institutions) if they want to check a user's record for a specific date/time, one can grep for logs using the below from the archive

cd /var/tmp/archive/tenant-x/VSN0-branchx

zgrep -a -i "access" 202109* (to dump all the access logs from this branch for the month of Sep 2021)

zgrep -a -i "flow" 202109* (for traffic monitoring logs)

zgrep -a -i "alarm" 202109* (for alarm logs)

As you can see, archive logs are useful if you want to look into the "search" logs from a past date/time - it's not useful for analytics data unless you want to restore data from the archives back into the DB.

The below link has the details on how to delete or transfer or restore archive files as the need may be

https://docs.versa-networks.com/Management_and_Orchestration/Versa_Analytics/Configuration/Manage_Versa_Analytics_Log_Archives

In 21.2.1, we have GUI options available to manage the archives, however the same can be accomplished from the cli as well as the shell.

You can refer to the KB below if you want to delete archive files for a specific period of time

https://support.versa-networks.com/support/solutions/articles/23000018968-analytics-node-disk-space-getting-full-or-disk-space-100-full

You can refer to the KB below if you want to set up a cron job to delete archives periodically

https://support.versa-networks.com/support/solutions/articles/23000022657-how-to-configure-a-cron-job-to-run-daily-to-delete-all-archived-logs-older-than-60-days-on-analytics-node-

Or, if you just want to clear up the space held by the archive folder on immediate basis, you can simply delete all the files stored under /var/tmp/archive using the below

cd /var/tmp/archive

rm -rf *

To re-iterate, archives are not related to "database" logs. The archives are raw logs which are stored in .tar.gz files - it does not have a default retention period, so the archive will stay around until they are deleted - whereas the Database has these logs stored in its tables for specific duration of time, 90 days by default of Analytics and 3/7 days for search logs.

Deleting the archives does not affect the "Database" logs

Section 2.3.2: Clearing up the backup folder

The backup folder should only retain the logs for an hour before they are moved to the /var/tmp/archive folder by the log-archive crond job.

If the log-archive cron job is not running or if there is an issue with the archiving, it can cause a build up of logs in the backup folder.

Please check the below

tail -f /var/log/versa/versa-log-archive.log

If the log-archive job is running properly you will see logs as above for the past hour.

If the log-archive job is not running you will not see any logs or you would see logs from an old date/time

If the log-archive job is running but unable to archive, you will see "starting archive" logs every hour but won't see the above logs showing branch folders being archived or "finished log-archiving" log

For the above two problematic scenarios, you can execute the below steps to recover from the conditoin

cd /var/tmp

ls -lrt | grep archive

[versa@DEL-VAN01-SRV: tmp] # ls -lrt
total 44
-rwxrwxrwx 1 root root 5606 Oct 26 2017 postinst.sh
-rwxrwxrwx 1 root root 3600 Oct 26 2017 postinst-utils.sh
-rw------- 1 root root 0 Oct 30 2018 nohup.out
-rw-r----- 1 root root 0 Oct 23 14:00 logarchive.pid

If you see a logarchive.pid from an old date, delete the same as below

rm -rf logarchive.pid

Now go to the /etc/cron.d folder and delete the log-archive cron job (if it's present), if it's not present it would me that the log-archive cron did not get created and archiving has never been activated (while bringing up log-forwarder you have to manually instantiate archives - details further below)

cd /etc/cron.d

ls -lrt | grep log

rm -rf log-archive

Instantiate a new log-archive cron job as below

sudo su

cd /opt/versa/scripts/van-scripts

./log-archive-start /var/tmp/log /var/tmp/archive hourly

Check if the log-archive cron has been created as below

cd /etc/cron.d

ls -lrt

Check the /var/log/versa/versa-log-archive.log after an hour to see if the archiving has started

Section 2.3.3: Clearing the Backlogs

Backlog build up refers to logs building up in the /var/tmp/log/tenant-x/VSN0-branchx folder, there are two reasons for backlog build up

1. The incoming log volume is much higher than the rate at which the versa-analytics-driver can process the logs

2. The versa-analytics-driver is stuck

3. Sufficient resources are not allocated to the analytics node (mostly in the case of VMs)

4. Too many active sessions pinned to one log-collector

Discussion on each of the above points is as below

Point1:

For point 1 refer to the "Scaling" section discussed above in the document, the log-volume can be reduced by following the best practices for scaling the logs and also by re-distributing the connections to other log-collectors

Point 2:

For point 2. you can check /var/log/versa/versa-van-driver.log to check if the logs show up against the current time or if the logs are stuck (on an old time) - if they are stuck you can try re-starting the versa-analytics-driver as below

sudo service versa-analytics-driver restart

Point 3:

For point 3 please check if enough cores and memory are assigned to the VM in-line with the recommendation (refer the section above on "Issues with scaling Analytics"). Also, in case of a VM please ensure the below

- hyperthreading should be disabled on the host sever on which the VM is activated

- The cores assigned to the VM should be dedicated (1:1) there should not be any over-subscription of cores on the server (for example, if the server is 32-core and there are 3 VMs each assigned 16-cores, there would be an over-subscription since cores would be shared between VMs instead of being dedicated)

- The disk should be SSD (HDD disks can result in slow/sluggish DB read/write issues)

- The disk controller setting on the host server should be SCSI and not IDE

Below are the most common symptoms seen when the VMs are not able to access sufficient cpu cycles or ram from the host server

- When you execute "vsh dbstatus" there is a huge lag in displaying the entire output

- Cli/Shell access is found to be slow

- You see a huge lag in the time taken to process logs which checking /var/log/versa/versa-van-driver.log. For example, lets say in the output below you see logs=3000 and time=10 secs it would indicate a definite sluggishness, usually 3000 logs should be cleared out in < 1 sec

You can get an idea about the amount of backlogs by checking the utilization of the /var/tmp/log/ folder

cd /var/tmp/log

du -sh .

If the values is in Giga-bytes (say >10 GB) it would indicate a significant build up of backlogs (it could also be because the "backup" folder is building up owing to a error/failure in the archive cron - refer to the section above on clearing backup logs)

You can determine the tenant that has the most backlogs by checking the below

cd /var/tmp/log

du -ch . --max-depth = 1

You can then access the tenant directory and figure out which CPE(s) are contributing to the backlogs

cd /var/tmp/log/tenant-<tenant-name>

du -ch . --max-depth =1 --exclude backup

For example, if you see >1G against a CPE it would indicate backlogs (built up for several hours)

Check specific CPEs folder and determine if there files that have >3M of data and if you consistently see these files being created

cd /var/tmp/log/tenant-xyz/VSN0-<branch-name> (tenant-xyz - replace xyz with the tenant-name)

ls -lrth | less (use this to check the oldest date of logs that are present in the folder)

Look for files that are > 3M in size and check their contents as below

Check the log-type that's most prevalent in the above output, the likely logs are below

monstatlogs - These logs contain applicaiton/user stats in the form of "session" stats, it has information the src/dst ip and application pertaining to all the session on the branch. On hubs, or branches, which are subject to huge number of sessions (say <50K) would end up generating huge volume of monstatlogs

The below setting (which is default) generates the monstatlogs with the session aggregate stats (ideally the below configuration should be disabled as it will cause application/user stats to no longer be visible on Analytics)

In 20.x there is a feature to generate top-N session logs instead of generating logs for all the sessions (16.1R2) which reduces the volume of monstatlogs - by default it's top 50 in 20.x as seen below

accesslog/flowlog - these are the firewall logs, or traffic monitoring logs, usually these logs can be voluminous again depending on the number of sessions and the number of rules that have logging enabled (please refer to the section on "scaling analytics" above, one has to be judicious in enabling logging on the firewall rules - if you enable logging on the default rule, all logs would be sent over to Analytics causing a huge influx of accesslogs)

Sometimes the backlogs can build up to a level where it's becomes difficult to clear (let's say the versa-van-driver was stuck for a few days and the backlogs keep building), for example if the /var/tmp/log directory utilization (du -sh , ) is >100 G.

If there are a lot of backlogs to be processed, it's a good idea to set the max-connections to 0 so that this node does not have to process as new logs

versa@Analytics16% set log-collector-exporter local collectors collector1 max-connections 0

versa@Analytics16% commit

However, for backlogs as huge as 100G or above it will take 48-72 hours to clear up if there are no issues with cpu-resources available to the versa-van-driver

The other option is to delete all the backlogs and start afresh - you would indeed lose out on the logs/stats for the period of the time for which logs have been deleted but the advantage is that you will start receiving the current logs

If you are looking to "delete" the backlogs, please follow the below procedure

- "vsh stop" to ensure there are new logs incoming and the driver is not processing the logs in the /var/tmp/log directory

- Execute the below on shell of the node. The below execution will delete all the backlogs present in /var/tmp/log, under all the CPEs, for this year (2021). It can take a few minutes for this execution to complete depending on the backlog volume

find /var/tmp/log/tenant-*/VSN0-* -name "2021*.txt" -delete

You can work with different variations of the above cmd, for example if you want to delete the backlogs from a certain branch - you can execute the below

find /var/tmp/log/tenant-ABC/VSN0-XYZ -name "2021*.txt" -delete (where ABC is tenant and XYZ is the client)

Point 4:

It important to ensure that a single log-collector does not get hogged by active connections. Controllers have ADC load-balancer configuration,where the log-collectors are added as seen below

Make sure all the log-collectors are "UP" (reachable) and are configured to as "enabled" - so that the controller can load-balance the incoming connections towards these collectors

Check the "Clients active" counter on all the log-collectors and confirm if there are no log-collectors which are taking a relatively higher volume of clients.

For ex, if you see a log-collector taking 300 connections while other log-collectors are at 10 or 50, you can offload connections on the tasked collector by executing the below on that log-collector

sudo service versa-lced restart

Above will restart the lced service (which caters to incoming connections), this will cause connections to be cleared off for this log-collector (and disperse to the other log-collectors). It takes a second usually for the restart to complete, so new connections will be re-created on this log-collector too.

You can also use other methods like disabling a log-collector on the controller (hit the "disable" check agains the log-collector under ADC lb server configuration) which is highly loaded or setting max-connections to 0 in the cli configuration of the log-collector (for sometime) and re-enable again once the backlogs have cleared up to a certain extent.

Make sure you set the max-connections back to 512 after the backlogs have cleared up and disk utilization has returned to normal

Note: the "active connections" that you see on the log-collectors include the "passive" connections along with the connections which are used to send logs actively. Each CPE sends logs actively on a single LEF connections, while the other LEF connections are passive (but they show up as "active connection" the log-collector and they take up a connection).

When you see "active clients" on a log-collector you can't make out if the connection is actively recieveing logs or if it's a passive connection.

An option is to enable "backup" LEF configuration on the CPEs, this way the passive connections are placed in"suspend" state and they don't take up a connection of the log-collectors and it will be easier to know if a log-collector is actually inundated with connections that actually take up connections

Section 2.3.4: Clearing the Database overload

Database load is dictated by two factors

1. Incoming log volume

2. Retention period

Few best practices to manage the Database load

1. You should look to control the log volume using the best practices guidelines mentioned in the documentation link provided in the "Scaling" section above.

2.The hard-disk should be large enough to support the production load, we recommend 1TB/2TB hard-disk for the same - please refer to the "Scaling" section

3. The retention period by default is 90 days for Analytics data and 3/7 days for Search data (you can check the "admin/settings/data-configuration") - be very cautious about increasing these values as it has a direct impact on the Database inflation, especially if you already have an insufficient hard-disk

4. On Analytics you can check the /var/lib/cassandra/data folder to check the table(s) that are most voluminous and look to "truncate" tables that are not really necessary - more details on this further below

5. Take a judgment call on increasing the number of nodes in the cluster to meet the requirements of your network. If your "search DB" load is not much (let's you've not enabled a lot of firewall logging or traffic-monitoring), you can just look to increase the "Analytic personality" cluster-nodes - it's best to increase it by a factor of 2, please consult Versa SE/PS for such projects as they can provide you a clear evaluation based on your network size.

6. Make sure the vandb-repair cron job is present under /etc/cron.d directory - this cron job executes a repair function to ensure proper clearing of truncated records.

You can check your DB load by the following method

1. Check "nodetool status", this will give you an idea of the DB load on "Analytics personality" nodes (as well as "Search personality" node in the case of DSE schema)

2. Check the size of the /var/lib folder (either /var/lib/solr/data or /var/lib/cassandra/data would hold the table data as the case may be)

sudo su

cd /var/lib

du -ch ,

Section 2.3.5: Clearing up Analytics DB

Whether it's DSE or Fusion schema, the "Analytics DB" is Cassandra, and you can check the DB tables under the below folder

sudo su

cd /var/lib/cassandra/data/van_analytics

ls -lrt

The tables that usually take up space are as below, also mentioned is the GUI context that are populated by these tables

tenantsrcfacts - Security/Firewall/Source

tenantdestfacts - Security/Firwall/Destination

sdwanappsubscriber - Sdwan/Sites/Application/Users

sdwansite2siteslapathstatus - Sdwan/Sites/Sla metrics

sdwansite2siteslam_1 - Sdwan/Sites/Sla metrics

sdwansite2siteslamrt2 - Sdwan/Sites/Sla metrics

sdwansite2siteslaviolation - Sdwan/Sites/Sla violations

You can look turn-off a table by using the below method, once done there will be no further DB table population for this type of data (please note that this will mean that you can no longer check the GUI data for these tables)

You can look to reduce the retention period if you don't really need 90 days data or 7 days of search data

Note: Changing the "retention period" will only impact the new data

The fastest way to clear up the DB load is to "drop" (truncate) a table - when you drop a table you will lose all the data for this table from current time to "retention time" , you will only be able to see new data that's incoming.

Note: When you drop a table, it will just erase the data from the current time to the past retention-time (say 3 months worth of data), but it will continue to be populated with new data that's incoming

The procedure to drop a table is a below

1. Find out the tablename that you want to want to drop

sudo su

cd /var/lib/cassandra/data/van_analytics

ls -lrth

For example, let's say you want to drop tenantsrcfacts, the table name is tenantsrcfacts as can be seen below

[versa@Analytics16: van_analytics] $ ls -lrth | grep tenantsrc

drwxr-xr-x 3 cassandra cassandra 4.0K Sep 27 12:32 tenantsrcfacts-c6243c407b6c11ebac1977d4669a7d1e

2. Stop the services and disable compaction on the concerned tabled

vsh stop

sudo nodetool disableautocompaction van_analytics tenantsrcfacts

3. Login to the database

In dse:

cqlsh

In fusion:

cqlsh -u cassandra -p cassandra

4. Truncate the table - this will drop all the data from this table from current time to retention time

[versa@Analytics16: van_analytics] $ cqlsh -u cassandra -p cassandra

Connected to D5-VAN1 at 127.0.0.1:9042.

[cqlsh 5.0.1 | Cassandra 3.0.9 | CQL spec 3.4.0 | Native protocol v4]

Use HELP for help.

cassandra@cqlsh>

cassandra@cqlsh> truncate van_analytics.tenantsrcfacts ;

5. Re-enable compaction, on the table, clear the snapshots and start the services

sudo nodetool clearsnapshot

sudo nodetool enableautocompaction van_analytics tenantsrcfacts

vsh start

You can check the GUI to confirm if the data has been deleted (for ex, check Sdwan/Security/Firewall/Source tab to check if the data has been cleared up in the case of "tenantsrcfacts")

Check "nodetool status" to confirm if the memory has been cleared up, you can verify the same using "df -kh"

Section 2.3.6: Clearing up Search DB

For all purpose the easiest way to clear up Search DB is to reduce the retention period, say from 3/7 days to 1 day for all the relevant tables like Access logs, Cgnat, Flow logs etc.

The other option is to set "global daily limit" - usually 10Million is optimal, so that the number of search records pushed into the Search DB does not exceed 10 Million logs

You can also set limits agains specific tenants

Note: Once you set the global daily limit you will not be able see any further search logs (such as firewall/flow-logs) are the global daily limit is breached for the day - the logs will only start after the next day.

Global daily limit follows the UTC time, so the limit gets applied at UTC 00:00 time each day

You can also consider 30M global daily limit if you have a large network with a huge volume of search logs, but make sure the Search DB is large enough to handle this influx (at-least 2TB in size)

You can also start afresh by deleting the entire "Search DB" to clear up all the past data (please check with TAC before undertaking the below step)

sudo su

cd /opt/versa/scripts/van-scripts/

cat vansetup.conf <<<<<<< verify if the vansetup.conf is setup properly

./vansetup.py <<< choose "y" when asked to confirm on "Delete DB"

Section 2.4: Issues with DB - DSE/Fusion

DSE based database was used previously however starting 20.x Fusion based database support was incorporated and over time DSE based database will no longer be supported - Fusion based database is opensource and will be used is the default DB schema during fresh installation of cluster in 20.x (or above)

Customers who are on DSE based database are urged to migrate to the Fusion based database - the basic steps involved in migrating from DSE to Fusion are as below

Note: To confirm if your Database is DSE based, just execute "dse -v" on the Analytics shell, if it returns a valid value (like dse-4.5 or dse-4.8.x) it would mean that it's using DSE schema, if return "no command dse found" it implies that the database is "Fusion" based

- If you are on dse-4.5 (as seen in "dse -v" output), you will first need to migrate to dse-4.8 before upgrading the image

- Please follow the below KB to migrate from dse-4.5 to dse-4.8

https://support.versa-networks.com/a/solutions/articles/23000019690

- Once you've migrated to dse-4.8 you can upgrade the image to 20.x or 21.x or 22.x

- Post image upgrade you can migrate from DSE to Fusion using the procedure below

https://support.versa-networks.com/a/solutions/articles/23000021015

Some of the common issues encountered with respect to the DB are as below

- Cassandra state of one, or many, nodes transitions from UN to DN (this can happen on DSE or Fusion)

- Solr failure (search node down) on Fusion

- Log-forwarder is unable to connect to database

Any issues with the DB can be determined by executing the below two commands

nodetool status << will work on DSE and Fusion on Analytic personality

vsh dbstatus <<< only works on Fusion

Section 2.4.1: Performing a Sanity check on the DB

1. Check the status of the DB

DSE based Database:

For DSE based database, you can execute nodetool status on all the nodes in the cluster and it should ideally show up as below, the status should be UN for the Analytic and Search personality node (in DSE both Analytic and Search nodes use Cassandra hence their status can be checked via "nodetool status"

You should also be able to login successfully to the database from the shell of the cluster node as below

Fusion based Database:

For Fusion based database, you can execute "nodetool status" on the Analytics personality node, but not on the Search personality node (because Search nodes do-not use Cassandra database, they use a Solr database).

Instead, you can simply execute "vsh dbstatus" on all the cluster nodes (Analytic and Search nodes) and it should ideally return the status below

On Analytic personality you should see the below status - it will only list Analytic personality nodes (it won't show the Search personality nodes)

On Search personality node you should see the below status

Also check the below status on the search nodes, they should show up "active" (all the nodes should ideally show up "active" in the below listing

You should be in root prompt while executing the output (sudo su)

2. Check the memory usage (disk utilization) on the nodes

This is also covered in the disk utilization troubleshooting section above, basically the database files are stored in the "data" directory of the respective DB (cassandra or solr) as below

You can find the disk utilization of cassandra DB (Analytic personality) as below (same would also work for search personality in DSE based database)

/var/lib/cassandra/data

du -ch .

In Fusion based database you can find the utilization of Search DB as below, use the address present in "cat /etc/hosts" mapped against the search node's hostname in the below command.

[versa@Analytics17: ~] $ curl -u "cassandra:cassandra" 'http://localhost:8983/solr/admin/metrics?nodes=192.10.10.57:8983_solr&prefix=CONTAINER.fs,org.eclipse.jetty.server.handler.DefaultHandler.get-requests,INDEX.sizeInBytes,SEARCHER.searcher.numDocs,SEARCHER.searcher.deletedDocs,SEARCHER.searcher.warmupTime&wt=json'

If the DB usage is close to 70% of the disk usage, it's important to take actions towards reducing the DB size (discussed in the "disk utilization" section above.)

Section 2.5: Troubleshooting DB issues

1. Analytic DB down in DSE/Fusion

2. Search DB down in DSE

3. Search DB down in Fusion

4. Zookeeper status down (in Fusion)

5. Adding a new node to existing cluster

6. Log-forwarder connectivity to the cluster

Section 2.5.1: Analytic DB down in DSE/Fusion

The Analytic personality nodes use "Cassandra" database, in DSE as well as Fusion setup. There are usually two issues seen on the Cassandra DB

- DB crashes owing to memory issues (not enough space left on the node)

- DB fails owing to a transient error

- Cluster status stuck in DN state owing to reachability issues

You can check "nodetool status" or "vsh dbstatus" (in the case of Fusion) on the Analytic personality nodes and if you see the status as DN

You can check the /var/log/cassandra/system.log file to check for any errors (attach the system.log file to the ticket if you open a case with Versa TAC)

If it's owing to a memory issue (if "df -kh" output indicates that the memory on the disk is at 70-80% it likely that the DB crashed because of lack of memory while performing compaction/repair which is a periodic task which usually causes the DB to swell up by 20-30% temporarily)

Once you've cleared up the disk space issues, either by adding more disk or clearing up the disk space (as discussed in the "disk space utilization" section) you can re-start the DB as below

In fact, you can try the below steps if the DB state is "DN", irrespective of the reason, as a recovery step

For DSE:

(on the shell of the node)

sudo service monit stop

sudo service dse stop

sudo service dse start

sudo service monit start

For Fusion:

sudo service monit stop

sudo service cassandra stop

sudo service cassandra start

sudo service monit start

Wait for a few minutes (sometimes >20 mins) after executing the above and check the nodetool status or vsh dbstatus again

If the failure continues you can try the below

cd /var/lib/cassandra/commitlog

rm -rf *

cd /var/lib/cassandra/saved_caches

rm -rf *

The above steps will clear any pending commit on the DB (in case some corruption in the commits was causing the DB failure).

Once done you can execute the below steps again and check if the DB recovers

For DSE:

(on the shell of the node)

sudo service monit stop

sudo service dse stop

sudo service dse start

sudo service monit start

For Fusion:

sudo service monit stop

sudo service cassandra stop

sudo service cassandra start

sudo service monit start

Note: Avoid re-running the vansetup.py (sudo /opt/versa/scripts/van-scritps/vansetup.py) on the Analytic personality nodes without consultation with Versa TAC and please make sure you don't type "y" when prompted for "delete DB" if you end up running the vansetup.py

To confirm if there are any reachability issues between the nodes in the cluster, trying pinging the "listen" address of all the nodes and confirm if all the required ports are open between the nodes as mentioned in the below documentation

https://docs.versa-networks.com/Getting_Started/Deployment_and_Initial_Configuration/Deployment_Basics/Firewall_Requirements

Especially confirm if 9042 port is reachable between the cassandra nodes using the below

nc -zvw3 <listen-address-of-peer-node> 9042

If you are unable to recover the DB failure with the above steps, you can raise a TAC case and attach the /var/log/cassandra/system.log

Section 2.5.2: Search DB down in DSE

In DSE, the search DB uses Cassandra DB and the troubleshooting steps are similar to the above

Section 2.5.3: Search DB down in Fusion

You can check the status of the search db in fusion by executing "vsh dbstatus" as mentioned in the above section for "checking DB sanity"

If status is not normal you will essentially see the output throwing up an error.

You can also check the cluster_status as below and check if nodes show up as "down" instead of "active" (or if any of the nodes are stuck in "recovering")

sudo su

cd /opt/versa/scripts/van-install

./cluster_install.sh solr cluster_status | python -m json.tool

The most common methods of recovery are as below

Step 1. Make sure you are able to ping the listen address of all the nodes and all the required ports are open between the nodes in the cluster. Refer the below doc (firewall requirements for analytics to know the ports which need to be open between the cluster nodes)

https://docs.versa-networks.com/Getting_Started/Deployment_and_Initial_Configuration/Deployment_Basics/Firewall_Requirements

You can use the below cmd on shell to check if a port is reachable

Also, if any of the nodes are stuck in "recovering" then please check if there is a time lag of more than 10 secs between the search nodes - you can open parallel putty session towards each of the search nodes and execute "date" on the shell of the nodes (execute on all the nodes at the same time), verify if there is a time-lag of more than 10 secs between any of the nodes

Please refer to the below KB before moving on to further steps

https://support.versa-networks.com/support/solutions/articles/23000025633-solr-failure-or-stuck-in-recovering-due-to-zookeeper-connection-timeout

Step 2. Rolling restart of zookeeper followed by solr restart

Rolling restart of zookeeper means that you restart zookeeper on all the nodes in your cluster one-by-one (including the analytic personality nodes) as below

sudo service monit stop

sudo service zookeeper stop

sudo service zookeeper start

sudo service monit start

On the search nodes please execute the below to restart solr along with zookeeper

sudo service monit stop

sudo service solr stop

sudo service zookeeper restart

sudo service solr start

sudo service monit start

Check "vsh dbstatus" on the search nodes post the above activity and check if the state normalizes

If the failure continues please go to step 3

Step 3. Check the mapping in /etc/hosts and confirm if the search node hostname is mapped to the "listen" address and not the rpc address.

If you want to know the "listen" address on the node you can check the below

cat /opt/versa/scripts/van-scripts/vansetup.conf | grep -i listen

ifconfig << the interface holding the listen address should be up

cat /etc/hosts

Make sure the local hostname is mapped to the "listen" address in this listing

If not, kindly modify the /etc/hosts file and ensure that the local hostname is mapped to the listen address

Move to step 4

Step 4. Try re-executing vansetup.py on the "search" nodes (please DON'T execute vanseutp.py on any of the analytic personality nodes)

Open a putty/terminal for all the search nodes in your cluster

Go to the below location

cd /opt/versa/scripts/van-scripts

sudo ./vansetup.py

Note: execute this on all the search nodes in parallel

select "N" at the "delete DB" prompt (if you select "y" the search database will be deleted)

Check the "vsh dbstatus" post executing the above, if the failure continues move to step 5

Step 5. You will have to try deleting the search DB (existing collection) and re-execute vansetup.py in a bid to re-instantiate the solr DB

Note: The search DB usually just have 3/7 days of logs (alarm/firewall/traffic-monitoring logs) so it should ideally not be an issue to delete the search DB. In fact you can check all the search logs easily in the archives (which are stored on the disk - archives are not auto-deleted, so the archive logs forever unless they are deleted manually or moved to a remote server)

You can access any search log from the archive folder as below, please check this on all the log-collectors in your setup (either all the log-forwarders or all the cluster nodes, because the logs from the branch be question could be on any one of them)

sudo su

cd /var/tmp/archive/tenant-xyz/VSN0-abc (where xyz is the tenant name and abc is the branch name in question)

zgrep -a -i "alarm" 202108* (dumps all alarms for Aug 2021)

zgrep -a -i "alarm" 20210810* (dumps all alarms for 10 Aug 2021)

zgrep -a -i "alarm" 2021* (dumps all alarms of 2021 year)

In a similar fashion you can dump firewall logs (accesslog) or traffic monitoring logs (flowlog) as below

zgrep -a -i "accesslog" 202108*

zgrep -a -i "flowlog" 202108*

As you can see the logs are user readable and in the same format as you see in the UI, so you can always dump the search logs that you want to check (alarms/firewall-logs) directly from the archive. So there is no harm in deleting the search DB as a part of recovery

To delete the search DB and re-instantiate solr please follow the below steps

1. Delete collection (you need to execute the below on just one search node - not all)

sudo su

cd /opt/versa/scripts/van-install

./cluster_install.sh solr delete_collection

./cluster_install.sh solr refresh_config

2. Re-execute vansetup.py on all the search nodes in parallel as mentioned in step 2 of point 2 above

3. Select "y" when asked at "delete DB" and "delete search DB" prompts

Post the activity, check "vsh dbstatus" and confirm if the status is normal, if failure continues move to the next step

Step 6. Clean up solr installation and re-install

Execute the below to delete the current solr installation (you can run this just only the search node which has the failure, it's not needed to run it on all search nodes)

sudo service monit stop

sudo service solr stop

sudo kill -9 $(ps -ef | grep solr | grep -v grep | awk '{print $2}')

sudo update-rc.d solr disable

sudo rm -rf /etc/solr*

sudo rm -rf /var/lib/solr

sudo rm /etc/init.d/solr

sudo rm /etc/default/solr.in.sh

Execute the below (select "y" when prompt for delete DB)

sudo /opt/versa/scripts/van-scripts/vansetup.py --force

sudo service monit start

Check the "vsh dbstatus" and confirm if the status is fine

If the failure continues please open a Versa TAC case and attach the below log file

root@Search1:/# locate solr.log | grep solr | grep log

/var/lib/solr/data/logs/solr.log

Section 2.5.4: Zookeeper status down in fusion

Zookeeper is a service that runs on the Analytic and Search nodes, the zookeeper agent on these nodes are channel for solr/search nodes to access/identify each other and the other cluster nodes and vice versa.

Check the output of "vsh dbstatus" on all the nodes and confirm if there are any failures seen under zookeeper status

Make sure port 2181 is reachable between the listen addresses of all the nodes in the cluster

sudo su

nc -zvw3 <listen-address> 2181 (use the listen address of the peer node)

Note: the listen address is nothing but the address that you see in the output of "nodetool status" or "vsh dbstatus" or the output of "cat /opt/versa/scripts/van-scripts/vansetup.conf | grep -i listen"

Usually zookeeper is down due to a reachability issue or if the interface holding the "listen address" was down

You can also try a restart of zookeeper on all the nodes and check if it rectifies the zookeeper status on the problematic node

sudo service monit stop

sudo service zookeeper stop

sudo service zookeeper start

sudo service monit start

Check the below output on all the nodes in your cluster

cat /opt/versa/scripts/van-scripts/vansetup.conf

Confirm if the zookeeper id is unique on each node and zookeeper status is enabled true

If you see any discrepancy in the zookeeper id (if two nodes have the same id configured) or if the zookeeper_node is set to false - please capture these outputs and open a ticket with Versa TAC

You can also check the zookeeper logs as below

/opt/versa_van/apps/zk/logs/zookeeper--server-<hostname>.out

for the following line:

2021-11-09 00:23:38,394 [myid:2] - WARN [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):Follower@170] - Got zxid 0xf00000001 expected 0x1

This indicates that the zookeeper data is not consistent across all the nodes. To resolve this issue please follow the steps below:

Stop zookeeper on all the nodes

sudo service zookeeper stop

Clear the zookeeper data directory on all the nodes

sudo rm /var/tmp/zookeeper/data/version-2/*

Start zookeeper on all the nodes

sudo service zookeeper start

Following the above steps, you can perform the regular Solr reinitialization steps:

sudo service monit stop

sudo service solr stop

sudo kill -9 $(ps -ef | grep solr | grep -v grep | awk '{print $2}')

sudo update-rc.d solr disable

sudo rm -rf /etc/solr*

sudo rm -rf /var/lib/solr

sudo rm /etc/init.d/solr

sudo rm /etc/default/solr.in.sh

sudo /opt/versa/scripts/van-scripts/vansetup.py --force

Check the "vsh dbstatus" post this step to verify the status

Important note: While upgrading to 21.2.x, the upgrade script modifies the vansetup.conf file, on each node, to change the number of zookeeper servers to 3, to create an odd number of zookeeper servers (this is in-line with the recommendation provided in the Solr community to avoid a state where 50% zookeeper nodes are available, refer to the link/screenshot below) – in this case, .12 was removed as a zookeeper server in the vansetup.conf as a part of this change during an upgrade.

https://solr.apache.org/guide/6_6/setting-up-an-external-zookeeper-ensemble.html

However, please note that this change will only come into effect when vansetup.py is “re-executed” on a node, the changes in vansetup.conf and brought into effect only post an execution of vansetup.py (sudo /opt/versa/scripts/van-scripts/vansetup.py)

If you are planning on re-executing vansetup.py, to affect the changes made in vansetup.conf, please ensure that you follow the below steps

1. Please note that executing vansetup.py will restart the DB services on that node

2. Please make sure that you type "N" whenever prompted for "Delete DB" - please don't type "y" as it would end up deleting the DB

3. Execute vansetup.py on one node at a time, and move on to the next node only after a successful completion

4. You can execute vansetup.py as below

cd /opt/versa/scripts/van-scripts

sudo ./vansetup.py

Section 3: Adding a new node to the existing cluster

There are times when an existing node goes down permanently owing to server/bare-metal issues, or some VM disk issue. In such cases, one has to bring up a new VM or Baremetal and add it to the cluster to replace the failed node. You will want to retain the same ip-addresses as the failed node on the new VM/baremetal - make sure the eth0. eth1, eth2 etc are configured with the same ip-addresses as the ones present on the failed node.

Once you have the ip-addresses configured on the new VM/baremetal, make sure you add all the required routes to ensure reachability with the rest of the cluster.

You can check the below output on the existing node

Configure similar routes on the new VM/baremetal (you can add routes as below)

sudo ip route add 10.10.0.0/16 via 10.192.21.1 dev eth0

Note: In order to make routes permanent you will have to add an entry in /etc/network/interface, else the routes will be removed post a reboot

If you had installed the cluster using the installation script on the Director before you can skip to the "add-node using installation script" portion further below

Section 3.1: Add node manually

Also, check the below on an existing node

cat /etc/hosts

Copy all the entries on the new node;s /etc/hosts file and also ensure that you add an entry for the local hostname mapped to the local "listen" address

With the ipaddresses and the routes in place you can move on to the below step of preparing the vansetup.conf as below

1. Copy the vansetup.conf from another node (of similar personality) and copy it to /home/versa directory of the new node

2 Execute the below on the new node

sudo su

cp /home/versa/vansetup.conf /opt/versa/scripts/van-scripts/vansetup.conf

check if the permissions are as below post copying

Now modify vansetup.conf (using vi editor) and ensure that you replace the ip-address with the local ones

For Fusion Analytic node:

On Fusion Analytic personality change the below to match the local addresses and also ensure that the zookeeper id reflects the correct mapping. Configure "seeds" as the listen address of one of the existing "analytic" personality nodes - when you execute the vansetup.py it will push the DB table information from that node.

For DSE Analytic node:

Change the below addresses to the local ones and ensure the seeds is set to an existing analytic personality node's listen address (the address you see in the "nodetool status" output)

For DSE Search node:

Procedure is the same as above, just that the personality would be search in this case

For Fusion Search node:

Update the below aspect to reflect the local addresses and correct "id" - seeds will be 127.0.0.1

3. Before executing vansetup on the new node, you will have to ensure the below step is executed in case you are adding an Analytic personality node (you can skip this step for search node)

Delete the old node's reference from cluster, you can do that using the below steps on any one of the existing analytic personality node

For ex, if .44 was the failed node, you will have to remove its reference first as below using "removenode" (it can take a few mins for the removal to take place, if it takes more than 30 mins, you can try the same with force option at the end)

4. Post deleting the old node as above, you can execute vansetup..py on the new node

sudo su

cd /opt/versa/scripts/van-scripts/

./vansetup.py

You would also have to execute the below on the director to sync the certificates - fill in the correct name (you can get the cluster name from director Administration/

Make sure you put "versa123" as the password in the first prompt and in the second prompt you can put in the password you've set for versa user login on the analytics

In 21.1.x (8443) you will also need to pull the cert from analytics/cluster to the director as below

Section 3.2: Add node using installation script

If you had installed your cluster using the installation script you can simply follow the below KB to add the new node to the existing cluster (you would just need to re-execute the cluster install script with --add-node option)

https://support.versa-networks.com/support/solutions/articles/23000022623-adding-a-new-node-to-an-existing-analytics-cluster-in-20-2-21-x

Section 4. Troubleshooting log-forwarder connectivity

You can also bring up a log-forwarder manually by following the below steps, and if a log-forwarder has been setup you can verify its configuration using the same steps

Open the below file vandriver.conf file, update/check the DB_ADDRESS (analytic nodes) and SEARCH_HOSTS (search node) addresses, and set LOG_COLLECTOR_ONLY as True

Enter config mode and update the below configuration (configure the collector address – this has to be address of the local interface connecting to the controller, port, storage, format). Make sure this address is reachable from controller south-bound interface (the interface connecting to the log-fowarder/cluster) on the configured port (1234) in this case.

Execute a “vsh restart” (this will cause the vandriver.conf to take effect), please note that you should not execute vansetup.py on a log-forwarder
You should also update the UI of the analytic and search nodes, and add the log-forwarders address (of the interface that connects to the DB_ADDRESS and SEARCH_HOSTS) to the “van-driver hosts” listing under admin/settings
Check /var/log/versa/versa-van-driver.log and check if it’s stuck at “validating cluster”, if so there is an issue with reachability from the log-forwarder to the addresses mentioned in DB_ADDRESS and SEARCH_HOSTS – you can troubleshoot as below

Ping all the addresses mentioned in DB_ADDRESS and SEARCH_HOSTS from the log-forwarder
If Ping works, check “nc -zvw3 <address> 9042” (replace address with each address present in DB_ADDRESS)
Also check, “nc -zvw3 <address> 8983” (replace address with each address present in SEARCH_HOSTS)
If ping doesn’t work or if the connection fails (while checking point 2 and 3), please troubleshoot the reachability or port access

Note that all the ip-addresses mentioned in vandriver.conf will be validated by the analytics-driver during a restart/reboot, so if any of the node is down/unreachable it would cause the analytics-driver to be stuck "Validating Cluster"

Side note: You would also need to add the Primary/Secondary Director as “remote-collectors” to ensure that the alarms are exported to the Directors (follow the KB below for the configuration)

https://support.versa-networks.com/support/solutions/articles/23000024840-setting-up-directors-as-remote-collectors-on-analytics

Section 5: Unable to access Analytics via the Director UI

In 16.1R2 and 20.2.x versions, the Director would access the Analytics UI on port 8080 by default, whereas in 21.1.x (and above) the Director access Analytics UI via port 8443 (ssl connection).

There are two aspects required for Director to access the Analytics UI (to be able to open Analytics UI on the Director's Dashboard)

1. The director(s) need to be registered on the Analytics

2. The director needs to access analytics over a https (8443) connection

Each of the above points needs to be validated in order to ensure successful access

The "cluster installation script" available starting 20.2.x/21.1.x and above, takes care of both the above steps, and it's recommended that cluster installation be performed using the cluster installation script.

Please refer to the documentation below

https://docs.versa-networks.com/Getting_Started/Deployment_and_Initial_Configuration/Headend_Deployment/Initial_Configuration/Perform_Initial_Software_Configuration#Set_Up_Analytics

Refer to section below

Side note: Please be careful about using the --secure option, this is only needed if you want to ensure analytics hardening/security if you cluster is exposed to public domain

Ideally the Director UI should be able to access Analytics post the above installation, if you see any errors in the above installation or if the Analytics UI is not accessible via the Director UI (mostly an error "there is a problem logging into analytics" would pop up while trying to access the UI)

Troubleshooting steps are as below

Step 1: Please validate if the certificate on the Director is installed on the Analytics using the script below, this script is present in the /opt/versa/vnms/scripts directory on the Director (run this from the Active VD)

Note: You will need to run the below script using 'versa' user as below

sudo su versa

You will need to enter the Analytics cluster name (get it from the Director UI Administration/Analytics-Cluster tab)

In the above output, a successful match would return "MD5 has matches" as a result

If the above scripts informs that the MD5 sum does not match, please go to step 2

Step 2: Sync the certificate from the Director towards the Analytics as below

Note: Below script needs to be run as "versa" user

versa@Director:~$ /opt/versa/vnms/scripts/vnms-cert-sync.sh --sync

You would need to provide the cluster name as the input (the cluster name can be located on Director UI Administration/Analytics-Cluster

Step 3: Print the certificate on the Director and confirm if the CN and SAN values are being use in the /etc/hosts mapping on the Analytics.

Print the certificate on the Director as below

Check the CN and SAN on the Director certificate

Check the /etc/hosts files on the Analytic nodes (all the analytic and search nodes in your cluster). Make sure that the "names" mapped in /etc/hosts are exactly the same as the CN and SAN names present in the certifiicate and the "ip address" mapped is accurate

Step 4: Verify if 9182 and 9183 ports are accessible on the Directors from the Analytics

Note: Replace <director> with the name present in the /etc/hosts file for the directors

If you get a "connection refused" (or any other error) while executing the above, then please check and make sure there are no firewalls blocking 9182/9183 access and that routes are present to access the director (check "route -n" to confirm the routes on Analytics and Director) - try to ping the Director from Analytics to ensure routing is fine.

Important side note: If you are using a wildcard certificate, for example with CN as *.utt.com. then make sure that you use the full domain name while creating a mapping in the /etc/hosts file (for ex, director1.utt.com and director2.utt.com)

Step 5: Execute the below steps to ensure that Analytic certificates are installed on the Director

[Content taken from https://docs.versa-networks.com/Getting_Started/Release_Notes_for_Secure_SD-WAN/Release_Notes_for_Secure_SD-WAN_Release_21.1/02_Versa_Analytics_Release_Notes_for_Release_21.1 ]

In a HA Director setup, you should select "y" to post-pone the restart as mentioned above. You can then schedule a time to restart the services on your Director, follow the below steps to do so

You will need login to the the Standby/Secondary VD and perform "vsh stop" to stop the services on the Standby VD

Post that you will need to login to the Active/Primary VD and perform "vsh restart" to restart the services on the Active VD

Post that, login to the Standby Director again and perform a "vsh start" to bring up the services on the Standby VD

You can execute the below on the Active and Standby Directors to ensure that the HA is in sync

request vnmsha actions check-sync-status

Step 6: After the above steps are in place try "revoking" and "re-registering" the directors below by accessing the Analytics UI directly (https://<ip-address> or http://<ip-address>:8080 ), of any one of the Analytic nodes

Ideally the register should succeed

If the register fails, please capture the below from the Analytic node on which you were trying the above registration

sudo su

cat /var/log/versa/tomcat/catalina.log

Or if you continue to face issues connecting to Analytics from the Director please follow the below steps

Execute the below on the shell of the analytic node whose UI you are trying to connect and the director node (enable logging on both the putty terminals) and perform multiple attempts to connect to the analytics UI from the director (from the affected user) – as soon as you hit a failure case, execute Ctrl+C on both the terminals and attach the logs to the TAC ticket.

Shell of analytics node:

sudo su

tail -f /var/log/versa/tomcat/*.log

Shell of Director node:

sudo su

cd /var/log/vnms/spring-boot/

tail -f vnms-spring-boot.log /var/log/vnms/web/*.log

Please attach this output while opening the TAC case.

If you need to open a TAC case, please attach all the outputs collected from Step 1 to Step 6. If you have executed the "cluster installation script", please also attach the entire outputs of the script along with the outputs from Step 1 to Step 6

Procedure to update a WAR file

Sometimes a WAR file is provided by engineering as a patch fix, we can update the war file using the below procedure

Take a backup of the existig WAR file:

sudo su

mkdir /tmp/backup-war

mv /opt/versa/bin/versa-1.0.war /tmp/backup-war/

Procedure to update the new WAR file:

cd /opt/versa_van/apps/apache-tomcat/webapps/

pwd

Below action SHOULD BE performed ONLY at the following:

/opt/versa_van/apps/apache-tomcat/webapps/

rm -rf ./*.war

(NOTE: Only at: /opt/versa_van/apps/apache-tomcat/webapps/)

rm -rf ROOT

(NOTE: Only at: /opt/versa_van/apps/apache-tomcat/webapps/)

Copy the new versa-1.0.war file (from the download location) to each of the cluster nodes (to all the cluster nodes) and place it under /opt/versa/bin location

cd /opt/versa/bin

chmod 664 versa-1.0.war

chown root:versa_priv versa-1.0.war

ls -lrth << you should now see the new versa-1.0.war file here

Perform: vsh restart << one node at a time

Verify operations on the UI of this node

Rollback Procedure:

If you face any issues, you can rollback to the old WAR file as below

cp /tmp/backup-war/versa-1.0.war /opt/versa/bin/

vsh restart

Procedure to Install Self-Signed Cert on Analytic/Search nodes

You can install self-signed certificate on the analytic/search nodes by executing the van-cert-install.sh script as shown below

[root@analytics3: certificates]# cd /opt/versa/var/van-app/certificates

[root@analytics3: certificates]# mkdir tempcerts

[root@analytics3: certificates]# mv versa_analytics* tempcerts/

[root@analytics3: certificates]#

[root@analytics3: certificates]# cd /opt/versa/scripts/van-scripts/

[root@analytics3: van-scripts]# sudo ./van-cert-install.sh

[root@analytics3: certificates]# cd /opt/versa/var/van-app/certificates

[root@analytics3: certificates]# ls -lrth | grep -i analytics <<<< you will see new cert created

After new certs are generated on Analytic nodes, this should be updated on Director nodes for secure communication.

At Director node,

admin@VD:~$ sudo su versa

versa@VD:/home/admin$ /opt/versa/vnms/scripts/vd-van-cert-upgrade.sh --pull

Pulling Analytics certificates to Director key store

......

Certificates Imported... Requires restart.. Do you want to post pone restart (y/N): y

At the end Director will ask for a restart, if its HA setup or if you want to perform it later, select y

If its HA setup, you may stop services on secondary, then do a service restart on primary and once its up, bring up secondary director services (this is mainly to prevent unnecesorry failovers between both Director nodes).

Troubleshooting Analytics

Modified on: Thu, 19 Jun, 2025 at 5:21 PM

Section 3: Adding a new node to the existing cluster

Section 3: Adding a new node to the existing cluster

More articles in Onboarding