Introduction
Before getting into the troubleshooting aspect, it's important to have a basic understanding of Analytics architecture - which is what this section aims to cover.
An Analytics node performs 4 functions
1. Log collection and storage
2. Log processing and ingestion in the Database
3. Database management (retention, deletion, syncing of records)
4. Web interface which retrieves data from the Database via API calls
Analytics nodes have 2 personalities
1. Analytic personality
2. Search personality
The personality is defined by the type of logs that are stored in the respective Database.
"Analytic personality" nodes store the "aggregate stats" which are generated every 5 mins by the branch - there are different types of aggregate stats, for ex
bwmonlog - aggregates sdwan, DIA and access-circuit usage
monstatlog - aggregates user and application stats
intfutil - aggregates wan utilizations
qoslog - aggregates qos statistics
slamlog - aggregates sla metrics
All of these logs are stored in the Database of the "Analytic personality" node, where they are subject to the retention limit and summarization (min/hourly/daily) as required. The default retention period is 3 months of "daily" logs and 1 month of "hourly" logs, so you can check hourly granularity for these aggregate stats for a period of 1 month.
"Search personality" nodes store the real-time logs that come in the form of alarmlog, accesslog (firewall rule hits), flowmonlog (traffic-monitoring rule hits), dhcplog, cgnatlog (cgnat rule hits), AV/IDP/threat logs etc
You can also setup a "Log forwarder" node which performs log-collection and processing but ingests these logs to a remote-database (analytics cluster is the remote database for the log-forwarder). So, the "log forwarder" does not have it's own local Database.
For the sake of convenience, we will call the a regular "Analytic/Search personality" node as a "Cluster Node" to differentiate it from a "Log-forwarder node"
A cluster node's architecture is as below
- The LEF logs are received by the "log-collector" task (LCED) running on the cluster-node or log-forwarder
- The Logs are parsed, tenant/appliance-name is determined, and placed under /var/tmp/log/tenant-x/VSN0-branch-x directory
- The Analytics-driver module processes
A Log-forwarder node setup is as below, as can be seen, a log-forwarder does not have a local DB, it has to use a "Cluster node" as the remote-DB. Also the "Cluster node" in this case does not have to perform the "log-collector's" role
The branch sends out these logs to the analytics either directly or via controller ADC load-balancing, the more common design is the latter where the branch sends logs to the controller's ADC VIP and the controller further NATs this traffic using source-ip as the controller's egress interface and destination-ip as the actual log-collector.
So essentially the data-path involved in transporting the logs, sent by the branch, to the Analytics cluster is as below.
In the rest of this document we will discuss the various problematic scenarios that are commonly encountered in the context of Analytics and pointers to troubleshoot the same
Unable to observe Analytic or Search logs
If while checking the dashboard/GUI of Analytics you are unable see the logs for a certain branch, while being able to see the logs/stats for other branches in the same tenant, please follow the below steps to troubleshoot this issue
1. Check the "date" on the branch and confirm if it's shows the current time, the timezone doesn't matter, what's important is that the branch should show the current time
If the date/time on the branch does not reflect the current time, you should either configure "NTP" sync or set the date/time manually as below (setting the time-zone is optional)
1) Set time-zone
sudo timedatectl set-timezone Africa/Cairo
2) Set time
sudo date --set "6 Apr 2021 14:14:00"
Wait for approx 20 mins and check the GUI again to confirm if the logs show up
2. Check the LEF connection from the branch to the log-collector, there can be more than one LEF connection on the branch, at-least one of them should be in established state
If the status shows up as "re-connect", it indicates an issue with the LEF connection where TCP session is not established successfully
If you've enabled ADC load-balancing on the controller (which is the common design), you can check if the nat session is present in the parent org of the controller using the branch's source ip
Make a note of the "source port" that's assigned post the translation, the "nat-destination-ip" is the address of the log-collector towards which the tcp session is being sent - this is the log-collector which will receive this branch's logs
admin@Controller1-cli> show orgs org D1-DIOS sessions nat brief
Access the shell of the concerned log-collector (nat-destination-ip) can check if the connection is received therein, as follows
If you don't see a connection on the concerned log-collector, it would indicate a routing issue between the controller and the log-collector (check if the L2 or L3 devices in between are dropping the packets)
3. Check "netstat -rn" on the log-collector and confirm there is a return route for the controller's south-bound subnet (in the case of ADC) or the branch's subnet (in the case direct connection from branch to log-collector)
4. Check the /var/tmp/log/tenant-x/backup/VSN0-branch-x folder to confirm if logs are being received
Execute the below and confirm if versa-van-driver is processing logs (ideally you should see the below output, with logsRate showing up per tenant, if you see it stuck at "validating cluster" it's in an error state)
tail -f /var/log/versa/versa-van-driver.log
Instead of the above logs, if you see errors or if you see if stuck at 'validating cluster", please follow the below steps
check the contents of the below file
cat /opt/versa/scripts/van-scripts/vandriver.conf
You can also compare the vandriver.conf file for an existing/working node with the non-working node and make sure the contents are similar
Execute the below on the non-working node
ping <ip>(trying pinging all the addresses mentioned under DB_ADDRESS and SEARCH_HOSTS in vandriver.conf file)
You should be able to ping all the addresses, if not please fix the reachability issue
Execute the below
cqlsh -u cassandra -p cassandra <address> (replace <address> with each ip in the DB_ADDRESS list one by one)
Check if cqlsh succeeds towards each of the ip-address listed in DB_ADDRESSS (you should get into cqlsh prompt), if you get any errors there may be a port block or may be cassandra service is down on that node - we will need to fix the cassandra failure on that node (refer to the Fusion troubleshooting further below)
Once the above checks and done, if you still see the logsRate as 0, please check if the local-collector configuration is proper - the collector address, port and directory should be properly configured
Execute the below validate if you have "active connections" on this node
Finally, check the below and confirm if the logsRate is positive (you should see it processing x logs at a positive logsRate)
tail -f /var/log/versa/versa-van-driver.log
5. Check the configuration on the cluster-nodes, or log-forwarders, to confirm if the collector address is configured correctly (the ip-address here should be address of the interface that's supposed get the logs)
6. Confirm if the branch's VoS version matches the version on the Analysis or is "lower" than the version on Analysis. If the branch version is higher than the version on Analytics the logs may get dropped on the Analytics (unable to parse the logs)
7. If specific tabs don't show up data (for ex, you don't see data under SLA-metrics, Application or User tabs), please ensure that the specific data is turned on in the "data configuration" section as seen below
8. Please ensure that the specific data is being sent by branch, as below
Login to the concerned branch cli, go to the shell mode and connect to vsmd prompt
shell
vsh connect vsmd
Check to confirm the "active collector" (this is the collector to which the logs are being sent actively)
Now check the statistics against this collector and confirm if the concerned stats are incrementing. Below are the category of logs which correspond to the tabs seen on the UI
bwmon = sdwan usage and access-circuit usage stats
mon_stats = application and user stats
b2b_slam = SLA metrics and violation stats
acc_ckt_cos = QOS stats
intf_util = system/wan interface logs
flow_mon = traffic-monitoring logs (search log)
access-policy = firewall logs (search log)
alarm_log = alarm logs (serach logs)
Execute the command in a space of 10 mins and confirm if the counters of the concerned logs are incrementing
9. Perform a sanity check on the configuration
Make sure lef logging is enabled and set to the "default logging profile" under the concerned policy/rules
For Security src/dst stats below should be enabled
For monstats (application/user stats) and bwmon stats below should be enabled
Below should be in "checked" state
10. Ensure that the "settings" configuration on the UI has the entires for all the cluster nodes in the relevant boxes. "Driver hosts" should have entries for all the nodes in the cluster, "Search hosts" should have entries for the search nodes and "Analytics hosts" should have entries for the analytic nodes. Check the UI of all the nodes in your cluster for this configuration
Also, check "status" page to confirm if all the cluster nodes display status as "UP" as seen below
11. Validate if the search logs are enabled (in "on" state), click on "save" once to ensure that the configuration is pushed to the nodes
12. Check the "alarms" section to confirm if there are any alarms pertaining to "global daily limit" or "tenant limit" breaches, in which case you will not be receiving search logs till the 24 hr time-block is complete (or you can increase the threshold limits in the "data configuration" section)
13. Look at the section "performing a sanity check on the DB" further below in this KB and perform the checks to ensure that the DB status is fine
14. Perform a "vsh restart" on all the nodes in the cluster to clear up any transient state
15. You can generate a test alarm (it will not impact the node, it will just generate test alarm towards analytics) as shown below on one of the branches (please validate the check provided in step 2 above before proceeding)
shell
vsh connect vsmd
vsm-vcsn0> test vsm trap interface down vni-0/0
vsm-vcsn0> test vsm trap interface up vni-0/0
16. Wait for 5 mins after performing the above step and then login to all the cluster nodes (or log-forwarders) in your setup and check the below on their shell
sudo su
cd /var/tmp/log/tenant-xyz/backup/VSN0-<branchname> (replace xyz with the actual tenant name and "branchname" with the name of the branch on which the above test was performed)
grep -i "alarmlog" *
Confirm if the alarm is displayed here
17. Check on the UI again and confirm if the alarm shows up (make sure you are checking in the current tenant
Issues with Scaling Analytics
There can be various issues, with respect to loss of logs and DB/service failures, that arise owing to scaling issues on Analytics. The basic scaling guideline is as below (refer the link)
Hard-disk as a scaling limitation
For production cluster-nodes, the usual recommendation is to use 1TB or 2TB hard-disk to accommodate the Database (the DB can sometimes extend by 40-50% of its original size during compaction). You can check the hard-disk size as below
It's important to monitor the disk usage periodically to ensure that it doesn't cross 80% (you can monitor the resources on the Analytics gui)
Also monitor the alarms section for any errors relating to memory breach
You can set the threshold at which alarms are generated as below
Please note that these analytic alarms are not sent to the director, you will have to monitor them locally - there may be a feature to export these alarms to a remote server in the later releases.
Check the size of the Database from the shell of the cluster-node as below
Under "Datacenter: Analytics" you can see the analytic database load, you can check the same under /var/lib/cassandra/data where the DB files are stored
In "dse" based Database (you know if the database is on DSE if you see a valid output for "dse -v", it would show some version like 4.5 or 4.8) you can check the load of the search database in the same way as above, by checking the "Datacenter: Search" in the output of "nodetool status"
In "fusion" based on Database, you can check the load using "vsh dbstatus"
If the Database disk usage become a limitation you can increase the Disk size by adding more disk space.
Adding more nodes to the cluster
You can also add more nodes to the cluster, say for example if you have 1 analytic node you can add 2 nodes, if you have 2 nodes you can add 4 nodes.
If you have log-forwarders in your setup, you can add more log-forwarders to handle the increased volume of logs or increased number of connection (discussed further in sections below)
If you use the installation script to bring up the cluster (in 20.x, 21.x or 22.x) you can use the below KB to add new nodes to the existing cluster, or adding new log-forwarders
You can consult with the Versa PS/SE to help you scale your cluster
Database retention
The default retention is set to 90 days for Analytic data (daily data), you can reduce the retention limit to 60 days if you hard-disk is a limitation, this way you limit the amount of data stored in the DB
The same way the default retention value of search data is 3/7 days, if the value is set to be higher it will take up the DB space, you should ideally set the retention limit to 3 days
You should also set a "global daily limit" to limit the number of search logs that are ingested into the DB per day - you don't want millions of logs ingested into the DB per day, it would overwhelm the DB. The safe limit to set is 10 million or 30 million (in the case of heavy volume of logs)
The optimal storage limit for Search nodes is 100 million logs (per node), so you want to set the global daily limit and the retention period towards ensuring that you don't overwhelm the search node beyond this limit
For ex, if the retention for "Access logs" (firewall logs) is 10 days and your global daily limit is 30 million, and you end up receiving 30 million Access logs per day, it would essentially lead into 10*30 = 300 million logs in the DB, which is not optimal (if the are 2 search nodes in your setup, it can optimally accommodate 200 million logs)
You can also set the limit specific to tenant if you are aware of a tenant that's sending in a larger volume of logs compared to other tenants
You can check the log volume sent in per tenant as below
Please follow the best practices listed in the documentation below to ensure the Database is not overwhelmed by logs
Hitting max-connections on the log-collector
By default the configuration allows for 512 incoming connections as seen below
You can check the existing number of connections by checking the below output on all the log-collectors
Please note the each branch has multiple connections (depending on the number of lef connections configured) however it sends log actively to just one connection, the other connection is "passive" but it takes up a connection on the log-collector
You can increase the number of connections on the log-collector, by modifying the max-connections in the configuration
Log volume issues
Note that the scaling limit is not with respect to the max-connections but the volume of incoming logs, in the output of "show lced stats" you can check the global stats, and if you see the number of logs being parsed every second is > 4000 (max/avg) it indicates a heavy volume of logs incoming on the log collector
You can check the output of "show lced stats" a few times and determine the type of logs that are contributing to the log volume
flow_mon_v4_base = traffic monitoring logs (search logs placed in search DB)
access-policy = firewall logs (search logs placed in search DB)
mon_stats = application/user stats (analytic stats, placed in analytic DB)
Usually its the flow-mon, access-policy or mon-stat logs that take up the log volume
If the log volume is high, it would lead to the build up of backlog (which means the rate of incoming logs is much higher than the rate at which logs are being processed by the versa-analytics-driver and pushed into the DB).
LCED parses the incoming logs and places it in /var/tmp/log/tenant-x/VSN0-branchx (respective tenant and branch-name), versa-analytics-driver processes logs from these branch folders and ingests it into the DB - post the processing it moves the logs to the /var/tmp/log/tenant-x/backup/VSN0-branchx folder (the log-archive cron job then moves the logs from the "backup" folder to the /var/tmp/archive/tenant-x/VSN0-branchx folder post compressing it to a .tar.gz file)
Check the below to get an understanding of whether there are backlogs building up - if you see the "branch" directories with utilization > 200 MB it's a clear indication of a backlog buildup.
If you see a backlog buildup, please verify if the versa-analytics driver is running - the logs should show up a valid rate of processing and the logs should be "recent" (if the last log is several hours/days old it means that the versa-analytics-driver has stalled for some reason - please re-start the driver using "sudo service versa-analytics-driver restart")
If the versa-analytics-driver is running fine and the backlogs continue to build up, you can try balancing the connections between the log-collectors, please check the "active connections" on all the log-collectors (usng "show lced stats") - in one of the log-collectors is more heavily loaded than the other, you can try restarting the versa-lced service on this log-collector so that the connections are dropped and re-distributed to other log-collectors
sudo service versa-lced restart
Or you can set the max-connections to a lower value (say 100) to shed the connections until the backlogs clear up, post which you can set the max-connections back to 512
Disk utilization issues
It's important to monitor the disk utilization on the cluster nodes and log-forwards on a periodic basis - you can check this on the "admin/resources" section of the GUI or you can setup a versa analytics platform monitoring server using the information below (it involves setting up a server with promethus/grafana application), it can monitor the analytics has the provision to set up email notifications
If you see the disk utilization going high on the cluster-node or log-forwarder, it's first important to understand the real cause of the disk utilization - there are four factors usually at play
- Archives, where the /var/tmp/archive folder takes up the Disk space
- Backup, where the /var/tmp/log/tenant-x/backup takes up the Disk space
- Backlogs, where the /var/tmp/log/tenant-x folder takes up the Disk space (without the "backup" folder)
- Database inflation, where the /var/lib folder takes up the Disk space (where DB records are saved)
Please execute the below to find out which folder is taking up the disk space
Archives:
cd /var/tmp/archive
du -sh .
Backup:
cd /var/tmp/log/tenant-x/backup (check for all tenants)
du -sh .
Backlogs:
cd /var/tmp/log/tenant-x
du -ch . (check the utilization against all branch folders /VSN0-branchx)
you can also execute the below to check the utilization per tenant folder, and per node (useful when you have multiple tenants)
du -ch /var/tmp/log/tenant-*/VSN0-*
Database:
cd /var/lib/
du -ch .
There are different solutions to address each of the above case
Clearing up Archives
The /var/tmp/archive folder holds the raw logs compresses in .tar.gz file. This has nothing to do with the Database - the versa-analytics-driver processes the logs and ingests them to the DB, at the same time it also moves the log to the /var/tmp/log/tenant-x/backup folder, there is a log-archive crond job that running every hour which archives all these logs (in the backup folder) into a .tar.gz file and places it in /var/tmp/archive folder
There is no inherent mechanism to automatically delete the files in the archive folder, this is because different customers have a different use for the archives - some customer want to retain the archive logs for auditing purposes, since these archives contain the flow-mon (netflow/traffic-monitoring), access-logs (firewall logs), cgnat logs, dhcp logs, threat logs, which are useful for auditing purposes (especially for service providers or financial institutions) if they want to check a user's record for a specific date/time, one can grep for logs using the below from the archive
cd /var/tmp/archive/tenant-x/VSN0-branchx
zgrep -a -i "access" 202109* (to dump all the access logs from this branch for the month of Sep 2021)
zgrep -a -i "flow" 202109* (for traffic monitoring logs)
zgrep -a -i "alarm" 202109* (for alarm logs)
As you can see, archive logs are useful if you want to look into the "search" logs from a past date/time - it's not useful for analytics data unless you want to restore data from the archives back into the DB.
The below link has the details on how to delete or transfer or restore archive files as the need may be
In 21.2.1, we have GUI options available to manage the archives, however the same can be accomplished from the cli as well as the shell.
You can refer to the KB below if you want to delete archive files for a specific period of time
You can refer to the KB below if you want to set up a cron job to delete archives periodically
Or, if you just want to clear up the space held by the archive folder on immediate basis, you can simply delete all the files stored under /var/tmp/archive using the below
cd /var/tmp/archive
rm -rf *
To re-iterate, archives are not related to "database" logs. The archives are raw logs which are stored in .tar.gz files - it does not have a default retention period, so the archive will stay around until they are deleted - whereas the Database has these logs stored in its tables for specific duration of time, 90 days by default of Analytics and 3/7 days for search logs.
Deleting the archives does not affect the "Database" logs
Clearing up the backup folder
The backup folder should only retain the logs for an hour before they are moved to the /var/tmp/archive folder by the log-archive crond job.
If the log-archive cron job is not running or if there is an issue with the archiving, it can cause a build up of logs in the backup folder.
Please check the below
tail -f /var/log/versa/versa-log-archive.log
If the log-archive job is running properly you will see logs as above for the past hour.
If the log-archive job is not running you will not see any logs or you would see logs from an old date/time
If the log-archive job is running but unable to archive, you will see "starting archive" logs every hour but won't see the above logs showing branch folders being archived or "finished log-archiving" log
For the above two problematic scenarios, you can execute the below steps to recover from the conditoin
cd /var/tmp
ls -lrt | grep archive
<sample>
[versa@DEL-VAN01-SRV: tmp] # ls -lrt
total 44
-rwxrwxrwx 1 root root 5606 Oct 26 2017 postinst.sh
-rwxrwxrwx 1 root root 3600 Oct 26 2017 postinst-utils.sh
-rw------- 1 root root 0 Oct 30 2018 nohup.out
-rw-r----- 1 root root 0 Oct 23 14:00 logarchive.pid
If you see a logarchive.pid from an old date, delete the same as below
rm -rf logarchive.pid
Now go to the /etc/cron.d folder and delete the log-archive cron job (if it's present), if it's not present it would me that the log-archive cron did not get created and archiving has never been activated (while bringing up log-forwarder you have to manually instantiate archives - details further below)
cd /etc/cron.d
ls -lrt | grep log
rm -rf log-archive
Instantiate a new log-archive cron job as below
sudo su
cd /opt/versa/scripts/van-scripts
./log-archive-start /var/tmp/log /var/tmp/archive hourly
Check if the log-archive cron has been created as below
cd /etc/cron.d
ls -lrt
Check the /var/log/versa/versa-log-archive.log after an hour to see if the archiving has started
Clearing the Backlogs
Backlog build up refers to logs building up in the /var/tmp/log/tenant-x/VSN0-branchx folder, there are two reasons for backlog build up
1. The incoming log volume is much higher than the rate at which the versa-analytics-driver can process the logs
2. The versa-analytics-driver is stuck
3. Sufficient resources are not allocated to the analytics node (mostly in the case of VMs)
4. Too many active sessions pinned to one log-collector
Discussion on each of the above points is as below
Point1:
For point 1 refer to the "Scaling" section discussed above in the document, the log-volume can be reduced by following the best practices for scaling the logs and also by re-distributing the connections to other log-collectors
Point 2:
For point 2. you can check /var/log/versa/versa-van-driver.log to check if the logs show up against the current time or if the logs are stuck (on an old time) - if they are stuck you can try re-starting the versa-analytics-driver as below
sudo service versa-analytics-driver restart
Point 3:
For point 3 please check if enough cores and memory are assigned to the VM in-line with the recommendation (refer the section above on "Issues with scaling Analytics"). Also, in case of a VM please ensure the below
- hyperthreading should be disabled on the host sever on which the VM is activated
- The cores assigned to the VM should be dedicated (1:1) there should not be any over-subscription of cores on the server (for example, if the server is 32-core and there are 3 VMs each assigned 16-cores, there would be an over-subscription since cores would be shared between VMs instead of being dedicated)
- The disk should be SSD (HDD disks can result in slow/sluggish DB read/write issues)
- The disk controller setting on the host server should be SCSI and not IDE
Below are the most common symptoms seen when the VMs are not able to access sufficient cpu cycles or ram from the host server
- When you execute "vsh dbstatus" there is a huge lag in displaying the entire output
- Cli/Shell access is found to be slow
- You see a huge lag in the time taken to process logs which checking /var/log/versa/versa-van-driver.log. For example, lets say in the output below you see logs=3000 and time=10 secs it would indicate a definite sluggishness, usually 3000 logs should be cleared out in < 1 sec
You can get an idea about the amount of backlogs by checking the utilization of the /var/tmp/log/ folder
cd /var/tmp/log
du -sh .
If the values is in Giga-bytes (say >10 GB) it would indicate a significant build up of backlogs (it could also be because the "backup" folder is building up owing to a error/failure in the archive cron - refer to the section above on clearing backup logs)
You can determine the tenant that has the most backlogs by checking the below
cd /var/tmp/log
du -ch . --max-depth = 1
You can then access the tenant directory and figure out which CPE(s) are contributing to the backlogs
cd /var/tmp/log/tenant-<tenant-name>
du -ch . --max-depth =1 --exclude backup
For example, if you see >1G against a CPE it would indicate backlogs (built up for several hours)
Check specific CPEs folder and determine if there files that have >3M of data and if you consistently see these files being created
cd /var/tmp/log/tenant-xyz/VSN0-<branch-name> (tenant-xyz - replace xyz with the tenant-name)
ls -lrth | less (use this to check the oldest date of logs that are present in the folder)
Look for files that are > 3M in size and check their contents as below
Check the log-type that's most prevalent in the above output, the likely logs are below
monstatlogs - These logs contain applicaiton/user stats in the form of "session" stats, it has information the src/dst ip and application pertaining to all the session on the branch. On hubs, or branches, which are subject to huge number of sessions (say <50K) would end up generating huge volume of monstatlogs
The below setting (which is default) generates the monstatlogs with the session aggregate stats (ideally the below configuration should be disabled as it will cause application/user stats to no longer be visible on Analytics)
In 20.x there is a feature to generate top-N session logs instead of generating logs for all the sessions (16.1R2) which reduces the volume of monstatlogs - by default it's top 50 in 20.x as seen below
accesslog/flowlog - these are the firewall logs, or traffic monitoring logs, usually these logs can be voluminous again depending on the number of sessions and the number of rules that have logging enabled (please refer to the section on "scaling analytics" above, one has to be judicious in enabling logging on the firewall rules - if you enable logging on the default rule, all logs would be sent over to Analytics causing a huge influx of accesslogs)
Sometimes the backlogs can build up to a level where it's becomes difficult to clear (let's say the versa-van-driver was stuck for a few days and the backlogs keep building), for example if the /var/tmp/log directory utilization (du -sh , ) is >100 G.
If there are a lot of backlogs to be processed, it's a good idea to set the max-connections to 0 so that this node does not have to process as new logs
<sample>
versa@Analytics16% set log-collector-exporter local collectors collector1 max-connections 0
versa@Analytics16% commit
However, for backlogs as huge as 100G or above it will take 48-72 hours to clear up if there are no issues with cpu-resources available to the versa-van-driver
The other option is to delete all the backlogs and start afresh - you would indeed lose out on the logs/stats for the period of the time for which logs have been deleted but the advantage is that you will start receiving the current logs
If you are looking to "delete" the backlogs, please follow the below procedure
- "vsh stop" to ensure there are new logs incoming and the driver is not processing the logs in the /var/tmp/log directory
- Execute the below on shell of the node. The below execution will delete all the backlogs present in /var/tmp/log, under all the CPEs, for this year (2021). It can take a few minutes for this execution to complete depending on the backlog volume
find /var/tmp/log/tenant-*/VSN0-* -name "2021*.txt" -delete
You can work with different variations of the above cmd, for example if you want to delete the backlogs from a certain branch - you can execute the below
find /var/tmp/log/tenant-ABC/VSN0-XYZ -name "2021*.txt" -delete (where ABC is tenant and XYZ is the client)
Point 4:
It important to ensure that a single log-collector does not get hogged by active connections. Controllers have ADC load-balancer configuration,where the log-collectors are added as seen below
Make sure all the log-collectors are "UP" (reachable) and are configured to as "enabled" - so that the controller can load-balance the incoming connections towards these collectors
Log in to each of the log-collectors and check the active sessions as below
Check the "Clients active" counter on all the log-collectors and confirm if there are no log-collectors which are taking a relatively higher volume of clients.
For ex, if you see a log-collector taking 300 connections while other log-collectors are at 10 or 50, you can offload connections on the tasked collector by executing the below on that log-collector
sudo service versa-lced restart
Above will restart the lced service (which caters to incoming connections), this will cause connections to be cleared off for this log-collector (and disperse to the other log-collectors). It takes a second usually for the restart to complete, so new connections will be re-created on this log-collector too.
You can also use other methods like disabling a log-collector on the controller (hit the "disable" check agains the log-collector under ADC lb server configuration) which is highly loaded or setting max-connections to 0 in the cli configuration of the log-collector (for sometime) and re-enable again once the backlogs have cleared up to a certain extent.
Make sure you set the max-connections back to 512 after the backlogs have cleared up and disk utilization has returned to normal
Note: the "active connections" that you see on the log-collectors include the "passive" connections along with the connections which are used to send logs actively. Each CPE sends logs actively on a single LEF connections, while the other LEF connections are passive (but they show up as "active connection" the log-collector and they take up a connection).
When you see "active clients" on a log-collector you can't make out if the connection is actively recieveing logs or if it's a passive connection.
An option is to enable "backup" LEF configuration on the CPEs, this way the passive connections are placed in"suspend" state and they don't take up a connection of the log-collectors and it will be easier to know if a log-collector is actually inundated with connections that actually take up connections
Clearing the Database overload
Database load is dictated by two factors
1. Incoming log volume
2. Retention period
Few best practices to manage the Database load
1. You should look to control the log volume using the best practices guidelines mentioned in the documentation link provided in the "Scaling" section above.
2.The hard-disk should be large enough to support the production load, we recommend 1TB/2TB hard-disk for the same - please refer to the "Scaling" section
3. The retention period by default is 90 days for Analytics data and 3/7 days for Search data (you can check the "admin/settings/data-configuration") - be very cautious about increasing these values as it has a direct impact on the Database inflation, especially if you already have an insufficient hard-disk
4. On Analytics you can check the /var/lib/cassandra/data folder to check the table(s) that are most voluminous and look to "truncate" tables that are not really necessary - more details on this further below
5. Take a judgment call on increasing the number of nodes in the cluster to meet the requirements of your network. If your "search DB" load is not much (let's you've not enabled a lot of firewall logging or traffic-monitoring), you can just look to increase the "Analytic personality" cluster-nodes - it's best to increase it by a factor of 2, please consult Versa SE/PS for such projects as they can provide you a clear evaluation based on your network size.
6. Make sure the vandb-repair cron job is present under /etc/cron.d directory - this cron job executes a repair function to ensure proper clearing of truncated records.
You can check your DB load by the following method
1. Check "nodetool status", this will give you an idea of the DB load on "Analytics personality" nodes (as well as "Search personality" node in the case of DSE schema)
2. Check the size of the /var/lib folder (either /var/lib/solr/data or /var/lib/cassandra/data would hold the table data as the case may be)
sudo su
cd /var/lib
du -ch ,
Clearing up Analytics DB
Whether it's DSE or Fusion schema, the "Analytics DB" is Cassandra, and you can check the DB tables under the below folder
sudo su
cd /var/lib/cassandra/data/van_analytics
ls -lrt
The tables that usually take up space are as below, also mentioned is the GUI context that are populated by these tables
tenantsrcfacts - Security/Firewall/Source
tenantdestfacts - Security/Firwall/Destination
sdwanappsubscriber - Sdwan/Sites/Application/Users
sdwansite2siteslapathstatus - Sdwan/Sites/Sla metrics
sdwansite2siteslam_1 - Sdwan/Sites/Sla metrics
sdwansite2siteslamrt2 - Sdwan/Sites/Sla metrics
sdwansite2siteslaviolation - Sdwan/Sites/Sla violations
You can look turn-off a table by using the below method, once done there will be no further DB table population for this type of data (please note that this will mean that you can no longer check the GUI data for these tables)
You can look to reduce the retention period if you don't really need 90 days data or 7 days of search data
Note: Changing the "retention period" will only impact the new data
The fastest way to clear up the DB load is to "drop" (truncate) a table - when you drop a table you will lose all the data for this table from current time to "retention time" , you will only be able to see new data that's incoming.
Note: When you drop a table, it will just erase the data from the current time to the past retention-time (say 3 months worth of data), but it will continue to be populated with new data that's incoming
The procedure to drop a table is a below
1. Find out the tablename that you want to want to drop
sudo su
cd /var/lib/cassandra/data/van_analytics
ls -lrth
For example, let's say you want to drop tenantsrcfacts, the table name is tenantsrcfacts as can be seen below
[versa@Analytics16: van_analytics] $ ls -lrth | grep tenantsrc
drwxr-xr-x 3 cassandra cassandra 4.0K Sep 27 12:32 tenantsrcfacts-c6243c407b6c11ebac1977d4669a7d1e
2. Stop the services and disable compaction on the concerned tabled
vsh stop
sudo nodetool disableautocompaction van_analytics tenantsrcfacts
3. Login to the database
In dse:
cqlsh
In fusion:
cqlsh -u cassandra -p cassandra
4. Truncate the table - this will drop all the data from this table from current time to retention time
[versa@Analytics16: van_analytics] $ cqlsh -u cassandra -p cassandra
Connected to D5-VAN1 at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.9 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cassandra@cqlsh>
cassandra@cqlsh>
cassandra@cqlsh> truncate van_analytics.tenantsrcfacts ;
5. Re-enable compaction, on the table, clear the snapshots and start the services
sudo nodetool clearsnapshot
sudo nodetool enableautocompaction van_analytics tenantsrcfacts
vsh start
You can check the GUI to confirm if the data has been deleted (for ex, check Sdwan/Security/Firewall/Source tab to check if the data has been cleared up in the case of "tenantsrcfacts")
Check "nodetool status" to confirm if the memory has been cleared up, you can verify the same using "df -kh"
Clearing up Search DB
For all purpose the easiest way to clear up Search DB is to reduce the retention period, say from 3/7 days to 1 day for all the relevant tables like Access logs, Cgnat, Flow logs etc.
The other option is to set "global daily limit" - usually 10Million is optimal, so that the number of search records pushed into the Search DB does not exceed 10 Million logs
You can also set limits agains specific tenants
Note: Once you set the global daily limit you will not be able see any further search logs (such as firewall/flow-logs) are the global daily limit is breached for the day - the logs will only start after the next day.
Global daily limit follows the UTC time, so the limit gets applied at UTC 00:00 time each day
You can also consider 30M global daily limit if you have a large network with a huge volume of search logs, but make sure the Search DB is large enough to handle this influx (at-least 2TB in size)
You can also start afresh by deleting the entire "Search DB" to clear up all the past data (please check with TAC before undertaking the below step)
sudo su
cd /opt/versa/scripts/van-scripts/
cat vansetup.conf <<<<<<< verify if the vansetup.conf is setup properly
./vansetup.py <<< choose "y" when asked to confirm on "Delete DB"
Issues with DB - DSE/Fusion
DSE based database was used previously however starting 20.x Fusion based database support was incorporated and over time DSE based database will no longer be supported - Fusion based database is opensource and will be used is the default DB schema during fresh installation of cluster in 20.x (or above)
Customers who are on DSE based database are urged to migrate to the Fusion based database - the basic steps involved in migrating from DSE to Fusion are as below
Note: To confirm if your Database is DSE based, just execute "dse -v" on the Analytics shell, if it returns a valid value (like dse-4.5 or dse-4.8.x) it would mean that it's using DSE schema, if return "no command dse found" it implies that the database is "Fusion" based
- If you are on dse-4.5 (as seen in "dse -v" output), you will first need to migrate to dse-4.8 before upgrading the image
- Please follow the below KB to migrate from dse-4.5 to dse-4.8
https://support.versa-networks.com/a/solutions/articles/23000019690
- Once you've migrated to dse-4.8 you can upgrade the image to 20.x or 21.x or 22.x
- Post image upgrade you can migrate from DSE to Fusion using the procedure below
https://support.versa-networks.com/a/solutions/articles/23000021015
Some of the common issues encountered with respect to the DB are as below
- Cassandra state of one, or many, nodes transitions from UN to DN (this can happen on DSE or Fusion)
- Solr failure (search node down) on Fusion
- Log-forwarder is unable to connect to database
Any issues with the DB can be determined by executing the below two commands
nodetool status << will work on DSE and Fusion on Analytic personality
vsh dbstatus <<< only works on Fusion
Performing a Sanity check on the DB
1. Check the status of the DB
DSE based Database:
For DSE based database, you can execute nodetool status on all the nodes in the cluster and it should ideally show up as below, the status should be UN for the Analytic and Search personality node (in DSE both Analytic and Search nodes use Cassandra hence their status can be checked via "nodetool status"
You should also be able to login successfully to the database from the shell of the cluster node as below
Fusion based Database:
For Fusion based database, you can execute "nodetool status" on the Analytics personality node, but not on the Search personality node (because Search nodes do-not use Cassandra database, they use a Solr database).
Instead, you can simply execute "vsh dbstatus" on all the cluster nodes (Analytic and Search nodes) and it should ideally return the status below
On Analytic personality you should see the below status - it will only list Analytic personality nodes (it won't show the Search personality nodes)
On Search personality node you should see the below status
Also check the below status on the search nodes, they should show up "active" (all the nodes should ideally show up "active" in the below listing
You should be in root prompt while executing the output (sudo su)
2. Check the memory usage (disk utilization) on the nodes
This is also covered in the disk utilization troubleshooting section above, basically the database files are stored in the "data" directory of the respective DB (cassandra or solr) as below
You can find the disk utilization of cassandra DB (Analytic personality) as below (same would also work for search personality in DSE based database)
/var/lib/cassandra/data
du -ch .
In Fusion based database you can find the utilization of Search DB as below, use the address present in "cat /etc/hosts" mapped against the search node's hostname in the below command.
[versa@Analytics17: ~] $ curl -u "cassandra:cassandra" 'http://localhost:8983/solr/admin/metrics?nodes=192.10.10.57:8983_solr&prefix=CONTAINER.fs,org.eclipse.jetty.server.handler.DefaultHandler.get-requests,INDEX.sizeInBytes,SEARCHER.searcher.numDocs,SEARCHER.searcher.deletedDocs,SEARCHER.searcher.warmupTime&wt=json'
If the DB usage is close to 70% of the disk usage, it's important to take actions towards reducing the DB size (discussed in the "disk utilization" section above.)
Troubleshooting DB issues
1. Analytic DB down in DSE/Fusion
2. Search DB down in DSE
3. Search DB down in Fusion
4. Zookeeper status down (in Fusion)
5. Adding a new node to existing cluster
6. Log-forwarder connectivity to the cluster
1. Analytic DB down in DSE/Fusion
The Analytic personality nodes use "Cassandra" database, in DSE as well as Fusion setup. There are usually two issues seen on the Cassandra DB
- DB crashes owing to memory issues (not enough space left on the node)
- DB fails owing to a transient error
- Cluster status stuck in DN state owing to reachability issues
You can check "nodetool status" or "vsh dbstatus" (in the case of Fusion) on the Analytic personality nodes and if you see the status as DN
You can check the /var/log/cassandra/system.log file to check for any errors (attach the system.log file to the ticket if you open a case with Versa TAC)
If it's owing to a memory issue (if "df -kh" output indicates that the memory on the disk is at 70-80% it likely that the DB crashed because of lack of memory while performing compaction/repair which is a periodic task which usually causes the DB to swell up by 20-30% temporarily)
Once you've cleared up the disk space issues, either by adding more disk or clearing up the disk space (as discussed in the "disk space utilization" section) you can re-start the DB as below
In fact, you can try the below steps if the DB state is "DN", irrespective of the reason, as a recovery step
For DSE:
(on the shell of the node)
sudo service monit stop
sudo service dse stop
sudo service dse start
sudo service monit start
For Fusion:
sudo service monit stop
sudo service cassandra stop
sudo service cassandra start
sudo service monit start
Wait for a few minutes (sometimes >20 mins) after executing the above and check the nodetool status or vsh dbstatus again
If the failure continues you can try the below
cd /var/lib/cassandra/commitlog
rm -rf *
cd /var/lib/cassandra/saved_caches
rm -rf *
<screenshot for reference>
The above steps will clear any pending commit on the DB (in case some corruption in the commits was causing the DB failure).
Once done you can execute the below steps again and check if the DB recovers
For DSE:
(on the shell of the node)
sudo service monit stop
sudo service dse stop
sudo service dse start
sudo service monit start
For Fusion:
sudo service monit stop
sudo service cassandra stop
sudo service cassandra start
sudo service monit start
Note: Avoid re-running the vansetup.py (sudo /opt/versa/scripts/van-scritps/vansetup.py) on the Analytic personality nodes without consultation with Versa TAC and please make sure you don't type "y" when prompted for "delete DB" if you end up running the vansetup.py
To confirm if there are any reachability issues between the nodes in the cluster, trying pinging the "listen" address of all the nodes and confirm if all the required ports are open between the nodes as mentioned in the below documentation
Especially confirm if 9042 port is reachable between the cassandra nodes using the below
nc -zvw3 <listen-address-of-peer-node> 9042
If you are unable to recover the DB failure with the above steps, you can raise a TAC case and attach the /var/log/cassandra/system.log
2. Search DB down in DSE
In DSE, the search DB uses Cassandra DB and the troubleshooting steps are similar to the above
3. Search DB down in Fusion
You can check the status of the search db in fusion by executing "vsh dbstatus" as mentioned in the above section for "checking DB sanity"
If status is not normal you will essentially see the output throwing up an error.
You can also check the cluster_status as below and check if nodes show up as "down" instead of "active" (or if any of the nodes are stuck in "recovering")
sudo su
cd /opt/versa/scripts/van-install
./cluster_install.sh solr cluster_status | python -m json.tool
The most common methods of recovery are as below
Step 1. Make sure you are able to ping the listen address of all the nodes and all the required ports are open between the nodes in the cluster. Refer the below doc (firewall requirements for analytics to know the ports which need to be open between the cluster nodes)
You can use the below cmd on shell to check if a port is reachable
Also, if any of the nodes are stuck in "recovering" then please check if there is a time lag of more than 10 secs between the search nodes - you can open parallel putty session towards each of the search nodes and execute "date" on the shell of the nodes (execute on all the nodes at the same time), verify if there is a time-lag of more than 10 secs between any of the nodes
Please refer to the below KB before moving on to further steps
Step 2. Rolling restart of zookeeper followed by solr restart
Rolling restart of zookeeper means that you restart zookeeper on all the nodes in your cluster one-by-one (including the analytic personality nodes) as below
sudo service monit stop
sudo service zookeeper stop
sudo service zookeeper start
sudo service monit start
On the search nodes please execute the below to restart solr along with zookeeper
sudo service monit stop
sudo service solr stop
sudo service zookeeper restart
sudo service solr start
sudo service monit start
Check "vsh dbstatus" on the search nodes post the above activity and check if the state normalizes
If the failure continues please go to step 3
Step 3. Check the mapping in /etc/hosts and confirm if the search node hostname is mapped to the "listen" address and not the rpc address
If you want to know the "listen" address on the node you can check the below
cat /opt/versa/scripts/van-scripts/vansetup.conf | grep -i listen
ifconfig << the interface holding the listen address should be up
cat /etc/hosts
Make sure the local hostname is mapped to the "listen" address in this listing
If not, kindly modify the /etc/hosts file and ensure that the local hostname is mapped to the listen address
Move to step 4
Step 4. Try re-executing vansetup.py on the "search" nodes (please DON'T execute vanseutp.py on any of the analytic personality nodes)
Open a putty/terminal for all the search nodes in your cluster
Go to the below location
cd /opt/versa/scripts/van-scripts
sudo ./vansetup.py
Note: execute this on all the search nodes in parallel
select "N" at the "delete DB" prompt (if you select "y" the search database will be deleted)
Check the "vsh dbstatus" post executing the above, if the failure continues move to step 5
Step 5. You will have to try deleting the search DB (existing collection) and re-execute vansetup.py in a bid to re-instantiate the solr DB
Note: The search DB usually just have 3/7 days of logs (alarm/firewall/traffic-monitoring logs) so it should ideally not be an issue to delete the search DB. In fact you can check all the search logs easily in the archives (which are stored on the disk - archives are not auto-deleted, so the archive logs forever unless they are deleted manually or moved to a remote server)
You can access any search log from the archive folder as below, please check this on all the log-collectors in your setup (either all the log-forwarders or all the cluster nodes, because the logs from the branch be question could be on any one of them)
sudo su
cd /var/tmp/archive/tenant-xyz/VSN0-abc (where xyz is the tenant name and abc is the branch name in question)
zgrep -a -i "alarm" 202108* (dumps all alarms for Aug 2021)
zgrep -a -i "alarm" 20210810* (dumps all alarms for 10 Aug 2021)
zgrep -a -i "alarm" 2021* (dumps all alarms of 2021 year)
In a similar fashion you can dump firewall logs (accesslog) or traffic monitoring logs (flowlog) as below
zgrep -a -i "accesslog" 202108*
zgrep -a -i "flowlog" 202108*
As you can see the logs are user readable and in the same format as you see in the UI, so you can always dump the search logs that you want to check (alarms/firewall-logs) directly from the archive. So there is no harm in deleting the search DB as a part of recovery
To delete the search DB and re-instantiate solr please follow the below steps
1. Delete collection (you need to execute the below on just one search node - not all)
sudo su
cd /opt/versa/scripts/van-install
./cluster_install.sh solr delete_collection
./cluster_install.sh solr refresh_config
2. Re-execute vansetup.py on all the search nodes in parallel as mentioned in step 2 of point 2 above
3. Select "y" when asked at "delete DB" and "delete search DB" prompts
Post the activity, check "vsh dbstatus" and confirm if the status is normal, if failure continues move to the next step
Step 6. Clean up solr installation and re-install
Execute the below to delete the current solr installation (you can run this just only the search node which has the failure, it's not needed to run it on all search nodes)
sudo service monit stop
sudo service solr stop
sudo kill -9 $(ps -ef | grep solr | grep -v grep | awk '{print $2}')
sudo update-rc.d solr disable
sudo rm -rf /etc/solr*
sudo rm -rf /var/lib/solr
sudo rm /etc/init.d/solr
sudo rm /etc/default/solr.in.sh
Execute the below (select "y" when prompt for delete DB)
sudo /opt/versa/scripts/van-scripts/vansetup.py --force
sudo service monit start
Check the "vsh dbstatus" and confirm if the status is fine
If the failure continues please open a Versa TAC case and attach the below log file
root@Search1:/# locate solr.log | grep solr | grep log
/var/lib/solr/data/logs/solr.log
4. Zookeeper status down in fusion
Zookeeper is a service that runs on the Analytic and Search nodes, the zookeeper agent on these nodes are channel for solr/search nodes to access/identify each other and the other cluster nodes and vice versa.
Check the output of "vsh dbstatus" on all the nodes and confirm if there are any failures seen under zookeeper status
Make sure port 2181 is reachable between the listen addresses of all the nodes in the cluster
sudo su
nc -zvw3 <listen-address> 2181 (use the listen address of the peer node)
Note: the listen address is nothing but the address that you see in the output of "nodetool status" or "vsh dbstatus" or the output of "cat /opt/versa/scripts/van-scripts/vansetup.conf | grep -i listen"
Usually zookeeper is down due to a reachability issue or if the interface holding the "listen address" was down
You can also try a restart of zookeeper on all the nodes and check if it rectifies the zookeeper status on the problematic node
sudo service monit stop
sudo service zookeeper stop
sudo service zookeeper start
sudo service monit start
Check the below output on all the nodes in your cluster
cat /opt/versa/scripts/van-scripts/vansetup.conf
Confirm if the zookeeper id is unique on each node and zookeeper status is enabled true
If you see any discrepancy in the zookeeper id (if two nodes have the same id configured) or if the zookeeper_node is set to false - please capture these outputs and open a ticket with Versa TAC
You can also check the zookeeper logs as below
/opt/versa_van/apps/zk/logs/zookeeper--server-<hostname>.out
for the following line:
2021-11-09 00:23:38,394 [myid:2] - WARN [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):Follower@170] - Got zxid 0xf00000001 expected 0x1
This indicates that the zookeeper data is not consistent across all the nodes. To resolve this issue please follow the steps below:
- Stop zookeeper on all the nodes
sudo service zookeeper stop
- Clear the zookeeper data directory on all the nodes
sudo rm /var/tmp/zookeeper/data/version-2/*
- Start zookeeper on all the nodes
sudo service zookeeper start
Following the above steps, you can perform the regular Solr reinitialization steps:
sudo service monit stop
sudo service solr stop
sudo kill -9 $(ps -ef | grep solr | grep -v grep | awk '{print $2}')
sudo update-rc.d solr disable
sudo rm -rf /etc/solr*
sudo rm -rf /var/lib/solr
sudo rm /etc/init.d/solr
sudo rm /etc/default/solr.in.sh
sudo /opt/versa/scripts/van-scripts/vansetup.py --force
Check the "vsh dbstatus" post this step to verify the status
Important note: While upgrading to 21.2.x, the upgrade script modifies the vansetup.conf file, on each node, to change the number of zookeeper servers to 3, to create an odd number of zookeeper servers (this is in-line with the recommendation provided in the Solr community to avoid a state where 50% zookeeper nodes are available, refer to the link/screenshot below) – in this case, .12 was removed as a zookeeper server in the vansetup.conf as a part of this change during an upgrade.
https://solr.apache.org/guide/6_6/setting-up-an-external-zookeeper-ensemble.html
However, please note that this change will only come into effect when vansetup.py is “re-executed” on a node, the changes in vansetup.conf and brought into effect only post an execution of vansetup.py (sudo /opt/versa/scripts/van-scripts/vansetup.py)
If you are planning on re-executing vansetup.py, to affect the changes made in vansetup.conf, please ensure that you follow the below steps
1. Please note that executing vansetup.py will restart the DB services on that node
2. Please make sure that you type "N" whenever prompted for "Delete DB" - please don't type "y" as it would end up deleting the DB
3. Execute vansetup.py on one node at a time, and move on to the next node only after a successful completion
4. You can execute vansetup.py as below
cd /opt/versa/scripts/van-scripts
sudo ./vansetup.py
Adding a new node to the existing cluster
There are times when an existing node goes down permanently owing to server/bare-metal issues, or some VM disk issue. In such cases, one has to bring up a new VM or Baremetal and add it to the cluster to replace the failed node. You will want to retain the same ip-addresses as the failed node on the new VM/baremetal - make sure the eth0. eth1, eth2 etc are configured with the same ip-addresses as the ones present on the failed node.
Once you have the ip-addresses configured on the new VM/baremetal, make sure you add all the required routes to ensure reachability with the rest of the cluster.
You can check the below output on the existing node
Configure similar routes on the new VM/baremetal (you can add routes as below)
sudo ip route add 10.10.0.0/16 via 10.192.21.1 dev eth0
Note: In order to make routes permanent you will have to add an entry in /etc/network/interface, else the routes will be removed post a reboot
If you had installed the cluster using the installation script on the Director before you can skip to the "add-node using installation script" portion further below
Add node manually
Also, check the below on an existing node
cat /etc/hosts
Copy all the entries on the new node;s /etc/hosts file and also ensure that you add an entry for the local hostname mapped to the local "listen" address
With the ipaddresses and the routes in place you can move on to the below step of preparing the vansetup.conf as below
1. Copy the vansetup.conf from another node (of similar personality) and copy it to /home/versa directory of the new node
2 Execute the below on the new node
sudo su
cp /home/versa/vansetup.conf /opt/versa/scripts/van-scripts/vansetup.conf
check if the permissions are as below post copying
Now modify vansetup.conf (using vi editor) and ensure that you replace the ip-address with the local ones
For Fusion Analytic node:
On Fusion Analytic personality change the below to match the local addresses and also ensure that the zookeeper id reflects the correct mapping. Configure "seeds" as the listen address of one of the existing "analytic" personality nodes - when you execute the vansetup.py it will push the DB table information from that node.
For DSE Analytic node:
Change the below addresses to the local ones and ensure the seeds is set to an existing analytic personality node's listen address (the address you see in the "nodetool status" output)
For DSE Search node:
Procedure is the same as above, just that the personality would be search in this case
For Fusion Search node:
Update the below aspect to reflect the local addresses and correct "id" - seeds will be 127.0.0.1
3. Before executing vansetup on the new node, you will have to ensure the below step is executed in case you are adding an Analytic personality node (you can skip this step for search node)
Delete the old node's reference from cluster, you can do that using the below steps on any one of the existing analytic personality node
For ex, if .44 was the failed node, you will have to remove its reference first as below using "removenode" (it can take a few mins for the removal to take place, if it takes more than 30 mins, you can try the same with force option at the end)
4. Post deleting the old node as above, you can execute vansetup..py on the new node
sudo su
cd /opt/versa/scripts/van-scripts/
./vansetup.py
You would also have to execute the below on the director to sync the certificates - fill in the correct name (you can get the cluster name from director Administration/
Make sure you put "versa123" as the password in the first prompt and in the second prompt you can put in the password you've set for versa user login on the analytics
In 21.1.x (8443) you will also need to pull the cert from analytics/cluster to the director as below
Add node using installation script
If you had installed your cluster using the installation script you can simply follow the below KB to add the new node to the existing cluster (you would just need to re-execute the cluster install script with --add-node option)
6. Troubleshooting log-forwarder connectivity
You can also bring up a log-forwarder manually by following the below steps, and if a log-forwarder has been setup you can verify its configuration using the same steps
- Open the below file vandriver.conf file, update/check the DB_ADDRESS (analytic nodes) and SEARCH_HOSTS (search node) addresses, and set LOG_COLLECTOR_ONLY as True
- Enter config mode and update the below configuration (configure the collector address – this has to be address of the local interface connecting to the controller, port, storage, format). Make sure this address is reachable from controller south-bound interface (the interface connecting to the log-fowarder/cluster) on the configured port (1234) in this case.
- Execute a “vsh restart” (this will cause the vandriver.conf to take effect), please note that you should not execute vansetup.py on a log-forwarder
- You should also update the UI of the analytic and search nodes, and add the log-forwarders address (of the interface that connects to the DB_ADDRESS and SEARCH_HOSTS) to the “van-driver hosts” listing under admin/settings
- Check /var/log/versa/versa-van-driver.log and check if it’s stuck at “validating cluster”, if so there is an issue with reachability from the log-forwarder to the addresses mentioned in DB_ADDRESS and SEARCH_HOSTS – you can troubleshoot as below
- Ping all the addresses mentioned in DB_ADDRESS and SEARCH_HOSTS from the log-forwarder
- If Ping works, check “nc -zvw3 <address> 9042” (replace address with each address present in DB_ADDRESS)
- Also check, “nc -zvw3 <address> 8983” (replace address with each address present in SEARCH_HOSTS)
- If ping doesn’t work or if the connection fails (while checking point 2 and 3), please troubleshoot the reachability or port access
Note that all the ip-addresses mentioned in vandriver.conf will be validated by the analytics-driver during a restart/reboot, so if any of the node is down/unreachable it would cause the analytics-driver to be stuck "Validating Cluster"
Side note: You would also need to add the Primary/Secondary Director as “remote-collectors” to ensure that the alarms are exported to the Directors (follow the KB below for the configuration)
Unable to access Analytics via the Director UI
In 16.1R2 and 20.2.x versions, the Director would access the Analytics UI on port 8080 by default, whereas in 21.1.x (and above) the Director access Analytics UI via port 8443 (ssl connection).
There are two aspects required for Director to access the Analytics UI (to be able to open Analytics UI on the Director's Dashboard)
1. The director(s) need to be registered on the Analytics
2. The director needs to access analytics over a https (8443) connection
Each of the above points needs to be validated in order to ensure successful access
The "cluster installation script" available starting 20.2.x/21.1.x and above, takes care of both the above steps, and it's recommended that cluster installation be performed using the cluster installation script.
Please refer to the documentation below
Refer to section below
Side note: Please be careful about using the --secure option, this is only needed if you want to ensure analytics hardening/security if you cluster is exposed to public domain
Ideally the Director UI should be able to access Analytics post the above installation, if you see any errors in the above installation or if the Analytics UI is not accessible via the Director UI (mostly an error "there is a problem logging into analytics" would pop up while trying to access the UI)
Troubleshooting steps are as below
Step 1: Please validate if the certificate on the Director is installed on the Analytics using the script below, this script is present in the /opt/versa/vnms/scripts directory on the Director (run this from the Active VD)
Note: You will need to run the below script using 'versa' user as below
sudo su versa
You will need to enter the Analytics cluster name (get it from the Director UI Administration/Analytics-Cluster tab)
In the above output, a successful match would return "MD5 has matches" as a result
If the above scripts informs that the MD5 sum does not match, please go to step 2
Step 2: Sync the certificate from the Director towards the Analytics as below
Note: Below script needs to be run as "versa" user
versa@Director:~$ /opt/versa/vnms/scripts/vnms-cert-sync.sh --sync
You would need to provide the cluster name as the input (the cluster name can be located on Director UI Administration/Analytics-Cluster
Step 3: Print the certificate on the Director and confirm if the CN and SAN values are being use in the /etc/hosts mapping on the Analytics.
Print the certificate on the Director as below
Check the CN and SAN on the Director certificate
Check the /etc/hosts files on the Analytic nodes (all the analytic and search nodes in your cluster). Make sure that the "names" mapped in /etc/hosts are exactly the same as the CN and SAN names present in the certifiicate and the "ip address" mapped is accurate
Step 4: Verify if 9182 and 9183 ports are accessible on the Directors from the Analytics
Login to the shell of the Analytics node and execute "nc -zvw3 <director> 9182" (also check 9183)
Note: Replace <director> with the name present in the /etc/hosts file for the directors
If you get a "connection refused" (or any other error) while executing the above, then please check and make sure there are no firewalls blocking 9182/9183 access and that routes are present to access the director (check "route -n" to confirm the routes on Analytics and Director) - try to ping the Director from Analytics to ensure routing is fine.
Important side note: If you are using a wildcard certificate, for example with CN as *.utt.com. then make sure that you use the full domain name while creating a mapping in the /etc/hosts file (for ex, director1.utt.com and director2.utt.com)
Step 5: Execute the below steps to ensure that Analytic certificates are installed on the Director
[Content taken from https://docs.versa-networks.com/Getting_Started/Release_Notes_for_Secure_SD-WAN/Release_Notes_for_Secure_SD-WAN_Release_21.1/02_Versa_Analytics_Release_Notes_for_Release_21.1 ]
In a HA Director setup, you should select "y" to post-pone the restart as mentioned above. You can then schedule a time to restart the services on your Director, follow the below steps to do so
You will need login to the the Standby/Secondary VD and perform "vsh stop" to stop the services on the Standby VD
Post that you will need to login to the Active/Primary VD and perform "vsh restart" to restart the services on the Active VD
Post that, login to the Standby Director again and perform a "vsh start" to bring up the services on the Standby VD
You can execute the below on the Active and Standby Directors to ensure that the HA is in sync
<on the cli>
request vnmsha actions check-sync-status
Step 6: After the above steps are in place try "revoking" and "re-registering" the directors below by accessing the Analytics UI directly (https://<ip-address> or http://<ip-address>:8080 ), of any one of the Analytic nodes
Ideally the register should succeed
If the register fails, please capture the below from the Analytic node on which you were trying the above registration
sudo su
cat /var/log/versa/tomcat/catalina.log
Or if you continue to face issues connecting to Analytics from the Director please follow the below steps
Execute the below on the shell of the analytic node whose UI you are trying to connect and the director node (enable logging on both the putty terminals) and perform multiple attempts to connect to the analytics UI from the director (from the affected user) – as soon as you hit a failure case, execute Ctrl+C on both the terminals and attach the logs to the TAC ticket.
Shell of analytics node:
sudo su
tail -f /var/log/versa/tomcat/*.log
Shell of Director node:
sudo su
cd /var/log/vnms/spring-boot/
tail -f vnms-spring-boot.log /var/log/vnms/web/*.log
Please attach this output while opening the TAC case.
If you need to open a TAC case, please attach all the outputs collected from Step 1 to Step 6. If you have executed the "cluster installation script", please also attach the entire outputs of the script along with the outputs from Step 1 to Step 6
Procedure to update a WAR file
Sometimes a WAR file is provided by engineering as a patch fix, we can update the war file using the below procedure
Take a backup of the existig WAR file:
sudo su
mkdir /tmp/backup-war
mv /opt/versa/bin/versa-1.0.war /tmp/backup-war/
Procedure to update the new WAR file:
cd /opt/versa_van/apps/apache-tomcat/webapps/
pwd
Below action SHOULD BE performed ONLY at the following:
/opt/versa_van/apps/apache-tomcat/webapps/
rm -rf ./*.war
(NOTE: Only at: /opt/versa_van/apps/apache-tomcat/webapps/)
rm -rf ROOT
(NOTE: Only at: /opt/versa_van/apps/apache-tomcat/webapps/)
Copy the new versa-1.0.war file (from the download location) to each of the cluster nodes (to all the cluster nodes) and place it under /opt/versa/bin location
cd /opt/versa/bin
chmod 664 versa-1.0.war
chown root:versa_priv versa-1.0.war
ls -lrth << you should now see the new versa-1.0.war file here
Perform: vsh restart << one node at a time
Verify operations on the UI of this node
Rollback Procedure:
If you face any issues, you can rollback to the old WAR file as below
cp /tmp/backup-war/versa-1.0.war /opt/versa/bin/
vsh restart
Procedure to Install Self-Signed Cert on Analytic/Search nodes
You can install self-signed certificate on the analytic/search nodes by executing the van-cert-install.sh script as shown below
[root@analytics3: certificates]# cd /opt/versa/var/van-app/certificates
[root@analytics3: certificates]# mkdir tempcerts
[root@analytics3: certificates]# mv versa_analytics* tempcerts/
[root@analytics3: certificates]#
[root@analytics3: certificates]# cd /opt/versa/scripts/van-scripts/
[root@analytics3: van-scripts]# sudo ./van-cert-install.sh
[root@analytics3: certificates]# cd /opt/versa/var/van-app/certificates
[root@analytics3: certificates]# ls -lrth | grep -i analytics <<<< you will see new cert created