Table of Contents
1. Troubleshooting High availability issues between Versa Director Nodes
1.1 Split brain and service are not running
1.2. Check connectivity between Master and Slave VD
1.3 How to restore Versa director split brain
2. Postgres and NCS database Out-of-Sync issues.
5. Latency/Throughput issue between the Directors
Purpose
Versa Director is work active/standby mode. If master Versa Director goes down, Slave director will take
over as Master. If due to any reason both Versa Director become master or go out of sync, this document will help debug these kind of issues and fix restore the same.
1. Troubleshooting High availability issues between Versa Director Nodes:
Please check the HA status on both Master/Slave node respectively:
request vnmsha actions status
On Master VD
On Slave VD
This command will display the status of versa director. Versa director nodes can be seen in master status on primary and backup nodes due to couple is issues. Run following command to check vnmsha details from both nodes.
Note: Designated master address should be showing same IP address on both directors.
request vnmsha actions get-vnmsha-details fetch-peer-vnmsha-details true
Make sure all services are in running state on both VD
If any service stop, please check respective service-related logs and check the next section for restoration
1.1 Split brain and service are not running.
a) Spring-boot is not up :
If spring boot is not up in VD then check below logs
If spring boot is not up in VD then check below logs
/var/log/vnms/spring-boot/vnms-spring-boot.log.
Example:
If any HA related error is present,first take snapshot of Master/Slave VD then disable the HA and do vsh restart on problematic VD. Once all service is fine, then re-enable HA.It can be done from GUI as well as CLI.
Snapshot of VD
Disable HA
Note: In case there was a split brain, then you may need to disable the HA individually on each of the Director.
request vnmsha actions disable-ha
vsh restart
Once all services up, re-enable HA.
b) Postgre is not up or postgre is not sync: Please check following logs
/var/log/postgresql/postgresql-9.5-main.log
/var/log/vnms/ha/postgre-ha.log
/var/log/vnms/ha/sync-status.log
Error logs:
/var/log/vnms/ha/postgre-ha.log
If Postgre is not up then disable/re-enable step mention in 1.1.
2. If /var/lib/postgresql/9.5 is present or not, if not present then please follow below
cp /opt/versa/vnms/scripts/rem-postgre.sh /tmp
sudo dpkg –purge vnms
sudo ./tmp/rem-postgre.sh
Install the old bin file directly (sudo ./versa-director-xxx.bin)
3. Check the permission of postgre user, if it is not correct then HA will not come up.
Execute the above command on both Director nodes.
sudo usermod -aG versa postgres
1.2. Check connectivity between Master and Slave VD
1) Please check ping response between master and slave VD.
2) Following required port should be open.
Communication Type | Protocol | Port | Source and Purpose |
SSH | TCP | 22 | Allows SSH shell access of Versa Director from any machine and from Versa Analytics. Additionally, have this port enabled for communication between Versa Directors to enable High Availability replication. |
HTTPS | TCP | 9182 | Allows REST API access of Versa Director from Versa Analytics and any host using basic or SSO authentication. |
HTTPS | TCP | 9183 | Allows REST API access of Versa Director from Versa Analytics and any host using OAuth based authentication. |
HTTPS | TCP | 443 | Allows Versa Director Web UI access from any host. |
Custom TCP and UDP | TCP and UDP | 20514 | Allows access from Versa Analytics to receive alarms. |
Custom TCP | TCP | 4566 | Allows access between Active and Standby Versa Director for communicating High Availability related information of NCS database. |
Custom TCP | TCP | 4570 | Allows access between Active and Standby Versa Director for communicating High Availability related information of NCS database. |
Custom TCP | TCP | 5432 | Allows access between Active and Standby Versa Director for exchanging High Availability related information of PostgreSQL DB. |
Custom TCP | TCP | 9090 | Allows VNF proxy access ( uCPE deployment ) from Versa Director UI from any host. |
Custom TCP | TCP | 4949 | Allows Munin Agent access if enabled from Versa Director UI from any host. |
Custom TCP | TCP | 6080 | Allows uCPE VM console access from Versa Director UI from any host. |
3) Hostname and hosts entry should be correct on both VD
Note : In some scenarios it is necessary to enter the hostname of the peer node in /etc/hosts before configuring HA
4) Check the iptables rules on both VD
5) Check if enough BW between Master and Backup Director (atleast ~10 Mbps)
Transfer big size file and make sure transfer speed is good. If it’s too slow, then DB will not sync.
6). Check platform - disk space/memory/CPU utilization.
Check if any of the partition is full or have high usage under “Use%”.
Use “sudo du -csh *” find which dir or file using the most partition. If file under “/var/logs/vmns”, “/var/log/vnms/karaf”,”/var/log/vnms/spring-boot” is showing high disk usage, then we may have some debug enabled causing excessive logging.
df -kH
Check “used” is over 80% of total “Mem” and there is enough free memory under “Mem” & “Swap”.
free -h (Go to VD shell and run).
Check the CPU utilization on Versa Director shell. If VD is on Hypervisor make sure Hypervisor is not oversubscribed.
#htop
top -H In this output check which process is consuming most CPU/Memory
For more extensive output enter top and press “1”.
6) Check if all Northbound and Southbound IP address is properly configured via vnms-startup script, this can be verified via vnms.properties.
7) File permission of below of pg-exec script
1.3 How to restore Versa director split brain.
If both Versa Director become master/master and all vsh service is running. This could also happen if there is a connectivity loss between both directors. Please ensure IP reachability between both nodes.
Step1) Slave Versa Director is problematic: Identify the Versa Director, which was master before the split-brain state suppose VD1, then do vsh restart on other Versa Director (VD2) which was slave earlier.
Following HA logs can be checked for HA communication between both nodes:
/var/log/postgresql/postgresql-9.5-main.log /var/log/vnms/ha/postgre-ha.log
Step2) Disable the HA and re-enable it. You can do it via cli and GUI.
This command will return success if HA is removed from Master/Slave. Please verify from below commands
Once disable, please verify it.
Also check /var/versa/vnms/data/conf/vnms.properties after disabling the HA.
If HA is not disabled due to some problem between both VD, please change vnms.properties and disable it followed by vsh restart.
Re-enabling HA:
Once HA is enabled, please verify below
request vnmsha actions get-vnmsha-details fetch-peer-vnmsha-details true request vnmsha actions get-vnmsha-postgres-status request vnmsha actions status
2. Postgres and NCS database out-of-Sync issues
Master and slave directors are not in sync, please follow below checks to restore the status.
request vnmsha actions get-vnmsha-postgres-status
Possible errors
1) If Postgre is not showing any output or showing error as below
Administrator@versa-director-master> request vnmsha actions get-vnmsha-postgres-status
Error: application communication failure
Administrator@versa-director-master> request vnmsha actions get-vnmsha-postgres-status
status ID | Name | Role | Status | Upstream | Location | Connection string
----+----------------+---------+---------------+----------------+----------+----------------------------------------------
1 | director-node1 | primary | * running | | default | host=10.142.254.13 user=repmgr dbname=repmgr
2 | director-node2 | standby | ? unreachable | director-node1 | default | host=10.142.254.7 user=repmgr dbname=repmgr
WARNING: following issues were detected
- unable to connect to node 'director-node2' (ID: 2)
- node 'director-node2' (ID: 2) is registered as an active standby but is unreachable
Action: Check the services, postgres logs and vnms properties (NB,SB correctly configured)
2) NCS/Postgres is not sync
NCS and postgres should be IN_SYNC, both any of not in sync then please check NCS transaction id and postgres service.
Administrator@Director1> request vnmsha actions check-sync-status
postgres-status OUT_OF_SYNC
ncs-status OUT_OF_SYNC
Note: This command is not display correct output in 16.1R2S7/S8 due to Bug ID: 39849, resolve in S9.
NCS transaction, it should be same on both VD.
If it’s different, please check below
- Manually check recent changes in VD GUI.
- Recent commit list.
- Login into postgres and check if metadata is matching on both VD
sudo su - postgres psql -d vnms select * from template_metadata; select * from template_binddata;
3. Versa Director HA timer
There are various timer for HA
Failover Timeout. Timeout period (in seconds), before the slave node promotes itself to the master state.
Slave Start Timeout. When the service starts, the non-designated master node waits for a period of three times the slave start timeout (in seconds), before the designated master promotes itself to the master state.
Auto Switchover Timeout. Wait timeout period (in seconds) before the designated master promotes itself to the master state.
By default, Director HA implementation is non-revertive. In the sense, if designated-master is up and running (after recovery) it will not be promoted as master unless “Enable Auto Switchover” is enabled/checked. If enabled, designated-master will be promoted as Master (revertive) after configured “Auto Switchover Timeout (in secs)” is elapsed.
HA timer calculation:
In the Versa Director Failover scenario, three attempts are made to establish the Master-Slave switch-over.
For the Slave to become Master, the time taken is 3 times the failover timeout (default setting).
The actual switch-over time = [ failover-timeout + (number of attempts x failover-timeout) ]
Therefore, if you have set failover-timeout of 5 minutes (300 seconds), then:
Switch-over time is: [ 5 + ( 3 x 5 ) ] = 20 minutes. Check the following API and response time from VD shell
4.Versa Director HA REST API
You can refer following docs for HA related API
https://docs.versa-networks.com/Management_and_Orchestration/Versa_Director/Director_REST_APIs/01_Versa_Director_REST_API_Overview
5. Latency/Throughput issue between the Directors
Bi-directionally transfer the file between the Directors to check if there is any latency or throughput issue. If throughput/latency issue exist, kindly resolve before enabling HA otherwise HA will fail as it has to transfer the snapshots,postgres DB from Master to Backup while enabling HA
You can refer following doc for Bandwidth and latency requirement between the Directors,
https://support.versa-networks.com/a/solutions/articles/23000023324
6. Contact Support
After performing above step still issue is not resolve, then open a case with Versa TAC and include all the output from above steps along with full tech-support dump from both directors.
.