Troubleshooting and Recovering Versa Director HA Issue : Versa Support

1. Troubleshooting High availability issues between Versa Director Nodes

1.1 Split brain and service are not running

1.2. Check connectivity between Master and Slave VD

1.3 How to restore Versa director split brain

2. Postgres and NCS database Out-of-Sync issues.

3. Versa Director HA timer

4.Versa Director HA REST API

5. Latency/Throughput issue between the Directors

6. Check whether trace is enabled on global settings.

7. Contact Support

Purpose

Versa Director is work active/standby mode. If master Versa Director goes down, Slave director will take

over as Master. If due to any reason both Versa Director become master or go out of sync, this document will help debug these kind of issues and fix restore the same.

1. Troubleshooting High availability issues between Versa Director Nodes:

Please check the HA status on both Master/Slave node respectively:

request vnmsha actions status

On Master VD

On Slave VD

This command will display the status of versa director. Versa director nodes can be seen in master status on primary and backup nodes due to couple is issues. Run following command to check vnmsha details from both nodes.

Note: Designated master address should be showing same IP address on both directors.

request vnmsha actions get-vnmsha-details fetch-peer-vnmsha-details true

Make sure all services are in running state on both VD

If any service stop, please check respective service-related logs and check the next section for restoration

1.1 Split brain and service are not running.

a) Spring-boot is not up :

If spring boot is not up in VD then check below logs

If spring boot is not up in VD then check below logs
/var/log/vnms/spring-boot/vnms-spring-boot.log.

Example:

If any HA related error is present,first take snapshot of Master/Slave VD then disable the HA and do vsh restart on problematic VD. Once all service is fine, then re-enable HA.It can be done from GUI as well as CLI.

Snapshot of VD

Disable HA

Note: In case there was a split brain, then you may need to disable the HA individually on each of the Director.

request vnmsha actions disable-ha

vsh restart

Once all services up, re-enable HA.

b) Postgre is not up or postgre is not sync: Please check following logs

/var/log/postgresql/postgresql-9.5-main.log
/var/log/vnms/ha/postgre-ha.log
/var/log/vnms/ha/sync-status.log

Error logs:

/var/log/vnms/ha/postgre-ha.log

If Postgre is not up then disable/re-enable step mention in 1.1.

2. If /var/lib/postgresql/9.5 is present or not, if not present then please follow below

cp /opt/versa/vnms/scripts/rem-postgre.sh /tmp   
sudo dpkg –purge vnms  
sudo ./tmp/rem-postgre.sh  
Install the old bin file directly (sudo ./versa-director-xxx.bin)

3. Check the permission of postgre user, if it is not correct then HA will not come up.

Execute the above command on both Director nodes.

sudo usermod -aG versa postgres

1.2. Check connectivity between Master and Slave VD

1) Please check ping response between master and slave VD.

2) Following required port should be open.

Communication Type	Protocol	Port	Source and Purpose
SSH	TCP	22	Allows SSH shell access of Versa Director from any machine and from Versa Analytics. Additionally, have this port enabled for communication between Versa Directors to enable High Availability replication.
HTTPS	TCP	9182	Allows REST API access of Versa Director from Versa Analytics and any host using basic or SSO authentication.
HTTPS	TCP	9183	Allows REST API access of Versa Director from Versa Analytics and any host using OAuth based authentication.
HTTPS	TCP	443	Allows Versa Director Web UI access from any host.
Custom TCP and UDP	TCP and UDP	20514	Allows access from Versa Analytics to receive alarms.
Custom TCP	TCP	4566	Allows access between Active and Standby Versa Director for communicating High Availability related information of NCS database.
Custom TCP	TCP	4570	Allows access between Active and Standby Versa Director for communicating High Availability related information of NCS database.
Custom TCP	TCP	5432	Allows access between Active and Standby Versa Director for exchanging High Availability related information of PostgreSQL DB.
Custom TCP	TCP	9090	Allows VNF proxy access ( uCPE deployment ) from Versa Director UI from any host.
Custom TCP	TCP	4949	Allows Munin Agent access if enabled from Versa Director UI from any host.
Custom TCP	TCP	6080	Allows uCPE VM console access from Versa Director UI from any host.

Sample example:

Login to the real Master Director shell and run the below commands towards the real Standby:

Below IP: 10.70.92.12 is my lab standby director IP.

admin@Snehal-2214-Master:~$ nc -vz 10.70.92.12 22
Connection to 10.70.92.12 22 port [tcp/ssh] succeeded!

admin@Snehal-2214-Master:~$ nc -vz 10.70.92.12 4566
Connection to 10.70.92.12 4566 port [tcp/*] succeeded!

admin@Snehal-2214-Master:~$ nc -vz 10.70.92.12 5432
Connection to 10.70.92.12 5432 port [tcp/postgresql] succeeded!

admin@Snehal-2214-Master:~$ nc -vz 10.70.92.12 9182
Connection to 10.70.92.12 9182 port [tcp/*] succeeded!

Login to the Standby Director shell and run the below commands towards the actual master:

Below IP: 10.70.92.11 is my lab Master director IP.

admin@Snehal-2214-Standby:~$ nc -vz 10.70.92.11 22
Connection to 10.70.92.11 22 port [tcp/ssh] succeeded!

admin@Snehal-2214-Standby:~$ nc -vz 10.70.92.11 4566
Connection to 10.70.92.11 4566 port [tcp/*] succeeded!

admin@Snehal-2214-Standby:~$ nc -vz 10.70.92.11 5432
Connection to 10.70.92.11 5432 port [tcp/postgresql] succeeded!

admin@Snehal-2214-Standby:~$ nc -vz 10.70.92.11 9182
Connection to 10.70.92.11 9182 port [tcp/*] succeeded!

admin@Snehal-2214-Standby:~$ nc -vz 10.70.92.11 4570
Connection to 10.70.92.11 4570 port [tcp/*] succeeded!

3) Hostname and hosts entry should be correct on both VD

Note : In some scenarios it is necessary to enter the hostname of the peer node in /etc/hosts before configuring HA

4) Check the iptables rules on both VD

5) Check if enough BW between Master and Backup Director (atleast ~10 Mbps)

Transfer big size file and make sure transfer speed is good. If it’s too slow, then DB will not sync.

6). Check platform - disk space/memory/CPU utilization.

Check if any of the partition is full or have high usage under “Use%”.

Use “sudo du -csh *” find which dir or file using the most partition. If file under “/var/logs/vmns”, “/var/log/vnms/karaf”,”/var/log/vnms/spring-boot” is showing high disk usage, then we may have some debug enabled causing excessive logging.

df -kH

Check “used” is over 80% of total “Mem” and there is enough free memory under “Mem” & “Swap”.

free -h (Go to VD shell and run).

Check the CPU utilization on Versa Director shell. If VD is on Hypervisor make sure Hypervisor is not oversubscribed.

#htop

top -H In this output check which process is consuming most CPU/Memory

For more extensive output enter top and press “1”.

6) Check if all Northbound and Southbound IP address is properly configured via vnms-startup script, this can be verified via vnms.properties.

7) File permission of below of pg-exec script

1.3 How to restore Versa director split brain.

If both Versa Director become master/master and all vsh service is running. This could also happen if there is a connectivity loss between both directors. Please ensure IP reachability between both nodes.

Step1) Slave Versa Director is problematic: Identify the Versa Director, which was master before the split-brain state suppose VD1, then do vsh restart on other Versa Director (VD2) which was slave earlier.

Following HA logs can be checked for HA communication between both nodes:

/var/log/postgresql/postgresql-9.5-main.log
/var/log/vnms/ha/postgre-ha.log

Step2) Disable the HA and re-enable it. You can do it via cli and GUI.

This command will return success if HA is removed from Master/Slave. Please verify from below commands

Once disable, please verify it.

Also check /var/versa/vnms/data/conf/vnms.properties after disabling the HA.

If HA is not disabled due to some problem between both VD, please change vnms.properties and disable it followed by vsh restart.

Re-enabling HA:

Once HA is enabled, please verify below

request vnmsha actions get-vnmsha-details fetch-peer-vnmsha-details true
request vnmsha actions get-vnmsha-postgres-status
request vnmsha actions status

2. Postgres and NCS database out-of-Sync issues

Master and slave directors are not in sync, please follow below checks to restore the status.

request vnmsha actions get-vnmsha-postgres-status

Possible errors

1) If Postgre is not showing any output or showing error as below

Administrator@versa-director-master> request vnmsha actions get-vnmsha-postgres-status

Error: application communication failure

Administrator@versa-director-master> request vnmsha actions get-vnmsha-postgres-status                      
status   ID | Name           | Role    | Status        | Upstream       | Location | Connection string                           
----+----------------+---------+---------------+----------------+----------+----------------------------------------------

1  | director-node1 | primary | * running     |                | default  | host=10.142.254.13 user=repmgr dbname=repmgr

2  | director-node2 | standby | ? unreachable | director-node1 | default  | host=10.142.254.7 user=repmgr dbname=repmgr

WARNING: following issues were detected

- unable to connect to node 'director-node2' (ID: 2)

- node 'director-node2' (ID: 2) is registered as an active standby but is unreachable

Action: Check the services, postgres logs and vnms properties (NB,SB correctly configured)

2) NCS/Postgres is not sync

NCS and postgres should be IN_SYNC, both any of not in sync then please check NCS transaction id and postgres service.

Administrator@Director1> request vnmsha actions check-sync-status
postgres-status   OUT_OF_SYNC
ncs-status   OUT_OF_SYNC

Note: This command is not display correct output in 16.1R2S7/S8 due to Bug ID: 39849, resolve in S9.

NCS transaction, it should be same on both VD.

If it’s different, please check below

Manually check recent changes in VD GUI.
Recent commit list.
Login into postgres and check if metadata is matching on both VD

sudo su - postgres
psql -d vnms
select * from template_metadata;
select * from template_binddata;

3. Versa Director HA timer

There are various timer for HA

Failover Timeout. Timeout period (in seconds), before the slave node promotes itself to the master state.

Slave Start Timeout. When the service starts, the non-designated master node waits for a period of three times the slave start timeout (in seconds), before the designated master promotes itself to the master state.

Auto Switchover Timeout. Wait timeout period (in seconds) before the designated master promotes itself to the master state.

By default, Director HA implementation is non-revertive. In the sense, if designated-master is up and running (after recovery) it will not be promoted as master unless “Enable Auto Switchover” is enabled/checked. If enabled, designated-master will be promoted as Master (revertive) after configured “Auto Switchover Timeout (in secs)” is elapsed.

HA timer calculation:

In the Versa Director Failover scenario, three attempts are made to establish the Master-Slave switch-over.

For the Slave to become Master, the time taken is 3 times the failover timeout (default setting).

The actual switch-over time = [ failover-timeout + (number of attempts x failover-timeout) ]

Therefore, if you have set failover-timeout of 5 minutes (300 seconds), then:

Switch-over time is: [ 5 + ( 3 x 5 ) ] = 20 minutes.

4.Versa Director HA REST API

You can refer following docs for HA related API

https://docs.versa-networks.com/Management_and_Orchestration/Versa_Director/Director_REST_APIs/01_Versa_Director_REST_API_Overview

5. Latency/Throughput issue between the Directors

Bi-directionally transfer the file between the Directors to check if there is any latency or throughput issue. If throughput/latency issue exist, kindly resolve before enabling HA otherwise HA will fail as it has to transfer the snapshots,postgres DB from Master to Backup while enabling HA.

You can refer following doc for Bandwidth and latency requirement between the Directors,

https://support.versa-networks.com/a/solutions/articles/23000023324

6. Check whether trace is enabled on global settings

Make sure trace is disabled(trace should be set to false) on both Directors otherwise this could cause HA issues or NCS cli hung issues.

7.Contact Support

After performing above step still issue is not resolve, then open a case with Versa TAC and include all the output from above steps along with full tech-support dump from both directors.

Troubleshooting and Recovering Versa Director HA Issue

Table of Contents

Purpose

1. Troubleshooting High availability issues between Versa Director Nodes:

1.1 Split brain and service are not running.

1.2. Check connectivity between Master and Slave VD

1.3 How to restore Versa director split brain.

2. Postgres and NCS database out-of-Sync issues

3. Versa Director HA timer

4.Versa Director HA REST API

5. Latency/Throughput issue between the Directors

6. Check whether trace is enabled on global settings

7.Contact Support

More articles in Versa Director