High Availability Setup

Infrastructure

Production

IP/hostname: netxms-prod

PostgreSQL version: 14.3

PostgreSQL systemd service name: postgresql-14.service

PostgreSQL data directory: /u0fs1/pg-data/14

PostgreSQL port: 5432

NetXMS installation prefix: /opt/netxms

NetXMS system service names: netxmsd.service, nxagentd.service, nxreportd.service

DR

IP/hostname: netxms-dr

PostgreSQL version: 14.2

PostgreSQL systemd service name: postgresql-14.service

PostgreSQL data directory: /u0fs1/pg-data/14

PostgreSQL port: 5432

NetXMS installation prefix: /opt/netxms

NetXMS system service names: netxmsd.service, nxagentd.service, nxreportd.service

Switchover procedure

Switchover steps:

Confirm which node is currency active

Process “netxmsd” should be running only on active node (check with “ps” or “pgrep”)

Run “pg_replica_state” to get the current state of the database on this server. Active node will be marked as “Sender / Primary”.

Stop netxmsd on active node:

Run “systemctl stop netxmsd”

Make sure it’s stopped (with “ps” or “pgrep”)

Switch active database instance to standby (read-only) mode:

Run “sudo -u postgres touch /u0fs1/pg-data/14/standby.signal”

Run “systemctl restart postgresql-14”

Check logs (/u0fs1/pg-data/14/log/postgresql-*.log), it should contain records:

“starting PostgreSQL…”

“consistent recovery state reached at…”

“database system is ready to accept read only connections”

Promote another node as new PostgreSQL sender node:

On second node run sudo -u postgres psql -c ‘select pg_promote()’

Check log file for following records:

“…received promote request”

“selected new timeline ID: …”

“archive recovery complete”

“database system is ready to accept connections” (non-readonly!)

Start netxmsd on another node

Switchover procedure is identical when switching from PROD to DR and from DR to PROD.

Failover procedure

Follow the switchover procedure from item 4 onwards.

Failover recovery

Once a failed server (which was sender before the failover) is up and running, you need to switch it to the replica mode.

Stop PostgreSQL (“systemctl stop postgresql-14”) on the failed node

Run “sudo -u postgres touch /u0fs1/pg-data/14/standby.signal” to switch it to the replica mode

Unwind this DB instance to the state where it’s in sync with the current sending server:

run sudo -u postgres /usr/pgsql-14/bin/pg_rewind –target-pgdata=/u0fs1/pg-data/14 –source-server=”host=ACTIVE_DB user=postgres password=PASSWORD””.

ACTIVE_DB should point to the current sender instance (netxms-prod or netxms-dr).

Start PostgreSQL instance with “systemctl start postgresql-14”

Check logs and make sure that database is started and it’s in read only mode. Once recovery is completed, a switchover procedure might be performed