High Availability Setup

Infrastructure

Production

IP/hostname: netxms-prod

PostgreSQL version: 14.3

PostgreSQL systemd service name: postgresql-14.service

PostgreSQL data directory: /u0fs1/pg-data/14

PostgreSQL port: 5432

NetXMS installation prefix: /opt/netxms

NetXMS system service names: netxmsd.service, nxagentd.service, nxreportd.service

DR

IP/hostname: netxms-dr

PostgreSQL version: 14.2

PostgreSQL systemd service name: postgresql-14.service

PostgreSQL data directory: /u0fs1/pg-data/14

PostgreSQL port: 5432

NetXMS installation prefix: /opt/netxms

NetXMS system service names: netxmsd.service, nxagentd.service, nxreportd.service

Switchover procedure

Switchover steps:

  1. Confirm which node is currency active

    1. Process “netxmsd” should be running only on active node (check with “ps” or “pgrep”)

    2. Run “pg_replica_state” to get the current state of the database on this server. Active node will be marked as “Sender / Primary”.

  2. Stop netxmsd on active node:

    1. Run “systemctl stop netxmsd”

    2. Make sure it’s stopped (with “ps” or “pgrep”)

  3. Switch active database instance to standby (read-only) mode:

    1. Run “sudo -u postgres touch /u0fs1/pg-data/14/standby.signal”

    2. Run “systemctl restart postgresql-14”

    3. Check logs (/u0fs1/pg-data/14/log/postgresql-*.log), it should contain records:

      1. “starting PostgreSQL…”

      2. “consistent recovery state reached at…”

      3. “database system is ready to accept read only connections”

  4. Promote another node as new PostgreSQL sender node:

    1. On second node run sudo -u postgres psql -c ‘select pg_promote()’

    2. Check log file for following records:

      1. “…received promote request”

      2. “selected new timeline ID: …”

      3. “archive recovery complete”

      4. “database system is ready to accept connections” (non-readonly!)

  5. Start netxmsd on another node

Switchover procedure is identical when switching from PROD to DR and from DR to PROD.

Failover procedure

Follow the switchover procedure from item 4 onwards.

Failover recovery

Once a failed server (which was sender before the failover) is up and running, you need to switch it to the replica mode.

  1. Stop PostgreSQL (“systemctl stop postgresql-14”) on the failed node

  2. Run “sudo -u postgres touch /u0fs1/pg-data/14/standby.signal” to switch it to the replica mode

  3. Unwind this DB instance to the state where it’s in sync with the current sending server:

    run sudo -u postgres /usr/pgsql-14/bin/pg_rewind –target-pgdata=/u0fs1/pg-data/14 –source-server=”host=ACTIVE_DB user=postgres password=PASSWORD””.

    ACTIVE_DB should point to the current sender instance (netxms-prod or netxms-dr).

  4. Start PostgreSQL instance with “systemctl start postgresql-14”

  5. Check logs and make sure that database is started and it’s in read only mode. Once recovery is completed, a switchover procedure might be performed