Table of Contents 



1. Troubleshooting High availability issues between Versa Director Nodes

1.1 Split brain and service are not running

1.2. Check connectivity between Master and Slave VD

1.3 How to restore Versa director split brain

2. Postgres and NCS database Out-of-Sync issues

3. Versa Director HA timer 

4.Versa Director HA REST API 

5. Latency/Throughput issue between the Directors

6. Contact Support 

 

 

 

Purpose

 

Versa Director is work active/standby mode. If master Versa Director goes down, Slave director will take 

over as Master. If due to any reason both Versa Director become master or go out of sync, this document will help debug these kind of issues and fix restore the same. 

 

1. Troubleshooting High availability issues between Versa Director Nodes:

 

Please check the HA status on both Master/Slave node respectively:


request vnmsha actions status


On Master VD
 

 

On Slave VD

 


This command will display the status of versa director. Versa director nodes can be seen in master status on primary and backup nodes due to couple is issues. Run following command to check vnmsha details from both nodes. 

Note: Designated master address should be showing same IP address on both directors.

request vnmsha actions get-vnmsha-details fetch-peer-vnmsha-details true



 



 

  


 




 



Make sure all services are in running state on both VD

 









 

 




If any service stop, please check respective service-related logs and check the next section for restoration









 



  


 















1.1 Split brain and service are not running.

a) Spring-boot is not up :


If spring boot is not up in VD then check below logs

If spring boot is not up in VD then check below logs
/var/log/vnms/spring-boot/vnms-spring-boot.log.


Example:

If any HA related error is present,first take snapshot of Master/Slave VD then disable the HA and do vsh restart on problematic VD. Once all service is fine, then re-enable HA.It can be done from GUI as well as CLI.  

Snapshot of VD

 

 

Disable HA 

 

Note: In case there was a split brain, then you may need to disable the HA individually on each of the Director.

request vnmsha actions disable-ha


vsh restart

 

Once all services up, re-enable HA.


b) Postgre is not up or postgre is not sync: Please check following logs  

/var/log/postgresql/postgresql-9.5-main.log
/var/log/vnms/ha/postgre-ha.log
/var/log/vnms/ha/sync-status.log


Error logs:

/var/log/vnms/ha/postgre-ha.log

 

If Postgre is not up then disable/re-enable  step mention in 1.1.


2. If /var/lib/postgresql/9.5 is present or not, if not present then please follow below

cp /opt/versa/vnms/scripts/rem-postgre.sh /tmp   
sudo dpkg –purge vnms
sudo ./tmp/rem-postgre.sh
Install the old bin file directly (sudo ./versa-director-xxx.bin)


 3. Check the permission of postgre user, if it is not correct then HA will not come up.

Execute the above command on both Director nodes.

sudo usermod -aG versa postgres


 

1.2. Check connectivity between Master and Slave VD

 

1) Please check ping response between master and slave VD.

 

2) Following required port should be open.

 

Communication Type

Protocol

Port

Source and Purpose

SSH

TCP

22

Allows SSH shell access of Versa Director from any machine and from Versa Analytics. Additionally, have this port enabled for communication between Versa Directors to enable High Availability replication.

HTTPS

TCP

9182

Allows REST API access of Versa Director from Versa Analytics and any host using basic or SSO authentication.

HTTPS

TCP

9183

Allows REST API access of Versa Director from Versa Analytics and any host using OAuth based authentication.

HTTPS

TCP

443

Allows Versa Director Web UI access from any host.

Custom TCP and UDP

TCP and UDP

20514

Allows access from Versa Analytics to receive alarms.

Custom TCP

TCP

4566

Allows access between Active and Standby Versa Director for communicating High Availability related information of NCS database.

Custom TCP

TCP

4570

Allows access between Active and Standby Versa Director for communicating High Availability related information of NCS database.

Custom TCP

TCP

5432

Allows access between Active and Standby Versa Director for exchanging High Availability related information of PostgreSQL DB.

Custom TCP

TCP

9090

Allows VNF proxy access ( uCPE deployment ) from  Versa Director UI from any host.

Custom TCP

TCP

4949

Allows Munin Agent access if enabled from Versa Director UI from any host.

Custom TCP

TCP

6080

Allows uCPE VM console access from Versa Director UI from any host.

 

 

3) Hostname and hosts entry should be correct on both VD



Note : In some scenarios it is necessary to enter the hostname of the peer node in /etc/hosts before configuring HA

 

4) Check the iptables rules on both VD

 

 

5) Check if enough BW between Master and Backup Director (atleast ~10 Mbps)

 

Transfer big size file and make sure transfer speed is good. If it’s too slow, then DB will not sync.

 

6). Check platform - disk space/memory/CPU utilization.

 

Check if any of the partition is full or have high usage under “Use%”. 

 

Use “sudo du -csh *” find which dir or file using the most partition. If file under “/var/logs/vmns”, “/var/log/vnms/karaf”,”/var/log/vnms/spring-boot” is showing high disk usage, then we may have some debug enabled causing excessive logging. 

 

df -kH

 

Check “used” is over 80% of total “Mem” and there is enough free memory under “Mem” & “Swap”.

 

free -h (Go to VD shell and run). 

 

 

 

Check the CPU utilization on Versa Director shell. If VD is on Hypervisor make sure Hypervisor is not oversubscribed.

 

#htop

   

 

top -H  In this output check which process is consuming most CPU/Memory 


 

For more extensive output enter top and press “1”.

 

 

 

6) Check if all Northbound and Southbound IP address is properly configured via vnms-startup script, this can be verified via vnms.properties.


 

 

 

7) File permission of below of pg-exec script

 

 

1.3 How to restore Versa director split brain.

 

If both Versa Director become master/master and all vsh service is running. This could also happen if there is a connectivity loss between both directors. Please ensure IP reachability between both nodes.

 

Step1) Slave Versa Director is problematic:  Identify the Versa Director, which was master before the split-brain state suppose VD1, then do vsh restart on other Versa Director (VD2) which was slave earlier.

Following HA logs can be checked for HA communication between both nodes:

/var/log/postgresql/postgresql-9.5-main.log
/var/log/vnms/ha/postgre-ha.log

 

Step2) Disable the HA and re-enable it. You can do it via cli and GUI.

This command will return success if HA is removed from Master/Slave. Please verify from below commands

Once disable, please verify it.

Also check /var/versa/vnms/data/conf/vnms.properties after disabling the HA.

If HA is not disabled due to some problem between both VD, please change vnms.properties and disable it followed by vsh restart.

Re-enabling HA:

Once HA is enabled, please verify below

request vnmsha actions get-vnmsha-details fetch-peer-vnmsha-details true
request vnmsha actions get-vnmsha-postgres-status
request vnmsha actions status


2. Postgres and NCS database out-of-Sync issues

 

 Master and slave directors are not in sync, please follow below checks to restore the status.

request vnmsha actions get-vnmsha-postgres-status


Possible errors

1) If Postgre is not showing any output or showing error as below

Administrator@versa-director-master> request vnmsha actions get-vnmsha-postgres-status

Error: application communication failure

Administrator@versa-director-master> request vnmsha actions get-vnmsha-postgres-status
status ID | Name | Role | Status | Upstream | Location | Connection string
----+----------------+---------+---------------+----------------+----------+----------------------------------------------

1 | director-node1 | primary | * running | | default | host=10.142.254.13 user=repmgr dbname=repmgr

2 | director-node2 | standby | ? unreachable | director-node1 | default | host=10.142.254.7 user=repmgr dbname=repmgr


 WARNING: following issues were detected

  - unable to connect to node 'director-node2' (ID: 2)

  - node 'director-node2' (ID: 2) is registered as an active standby but is unreachable

Action: Check the services, postgres logs and vnms properties (NB,SB correctly configured)

 

 

2) NCS/Postgres is not sync

NCS and postgres should be IN_SYNC, both any of not in sync then please check NCS transaction id and postgres service.

Administrator@Director1> request vnmsha actions check-sync-status
postgres-status OUT_OF_SYNC
ncs-status OUT_OF_SYNC


 Note: This command is not display correct output in 16.1R2S7/S8 due to Bug ID: 39849, resolve in S9.

NCS transaction, it should be same on both VD.

If it’s different, please check below

  • Manually check recent changes in VD GUI.
  • Recent commit list.
  • Login into postgres and check if metadata is matching on both VD
sudo su - postgres
psql -d vnms
select * from template_metadata;
select * from template_binddata;



3. Versa Director HA timer


There are various timer for HA

Failover Timeout. Timeout period (in seconds), before the slave node promotes itself to the master state.

Slave Start Timeout. When the service starts, the non-designated master node waits for a period of three times the slave start timeout (in seconds), before the designated master promotes itself to the master state.

Auto Switchover Timeout. Wait timeout period (in seconds) before the designated master promotes itself to the master state.

 By default, Director HA implementation is non-revertive. In the sense, if designated-master is up and running (after recovery) it will not be promoted as master unless “Enable Auto Switchover” is enabled/checked. If enabled, designated-master will be promoted as Master (revertive) after configured “Auto Switchover Timeout (in secs)” is elapsed.

 HA timer calculation:

In the Versa Director Failover scenario, three attempts are made to establish the Master-Slave switch-over.

For the Slave to become Master, the time taken is 3 times the failover timeout (default setting).

The actual switch-over time = [ failover-timeout + (number of attempts x failover-timeout) ]

Therefore, if you have set failover-timeout of 5 minutes (300 seconds), then:

Switch-over time is: [ 5 + ( 3 x 5 ) ] = 20 minutes. Check the following API and response time from VD shell

 

4.Versa Director HA REST API

 

You can refer following docs for HA related API

 

https://docs.versa-networks.com/Management_and_Orchestration/Versa_Director/Director_REST_APIs/01_Versa_Director_REST_API_Overview


5. Latency/Throughput issue between the Directors


Bi-directionally transfer the file between the Directors to check if there is any latency or throughput issue. If throughput/latency issue exist, kindly resolve before enabling HA otherwise HA will fail as it has to transfer the snapshots,postgres DB from Master to Backup while enabling HA


You can refer following doc for Bandwidth and latency requirement between the Directors, 

https://support.versa-networks.com/a/solutions/articles/23000023324


6. Contact Support

 

 After performing above step still issue is not resolve, then open a case with Versa TAC and include all the output from above steps along with full tech-support dump from both directors.

 

.