partially removed pve node / proxmox cluster

partially removed pve node / proxmox cluster

  • Written by
    Walter Doekes
  • Published on

The case of the stale (removed but not removed) PVE node in our Proxmox cluster.

On one of our virtual machine clusters, a node — pve3 — had been removed on purpose, yet is was still visible in the GUI with a big red cross (because it was unavailable). This was not only ugly, but also caused problems for the node enumeration done by proxmove.

The node had been properly removed, according to the removing a cluster node documentation. Yet it was apparently still there.

# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pve1 (local)
         2          1 pve2
         3          1 pve4
         5          1 pve5

This listing looked fine: pve3 (nodeid 4) was absent. And all remaining nodes showed the same info.

But, a quick grep through /etc did turn up some references to pve3:

# grep pve3 /etc/* -rl
/etc/corosync/corosync.conf
/etc/pve/.version
/etc/pve/.members
/etc/pve/corosync.conf

Those two corosync.conf config files are in sync. Both between themselves and equal to those files on the other three nodes. But they did contain a reference to the removed node:

nodelist {
...
  node {
    name: pve3
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.x.x.x
  }

The .version and .members json files were different, albeit similar on all nodes. They all included the 5 nodes (one too many):

# cat /etc/pve/.members
{
"nodename": "pve1",
"version": 77,
"cluster": { "name": "my-clustername", "version": 6, "nodes": 5, "quorate": 1 },
"nodelist": {
  "pve1": { "id": 1, "online": 1, "ip": "10.x.x.x"},
  "pve2": { "id": 2, "online": 1, "ip": "10.x.x.x"},
  "pve3": { "id": 4, "online": 0, "ip": "10.x.x.x"},
  "pve4": { "id": 3, "online": 1, "ip": "10.x.x.x"},
  "pve5": { "id": 5, "online": 1, "ip": "10.x.x.x"}
  }
}

The document versions were all a bit different, but the cluster versions were the same between the nodes. Except for one node, on which the cluster version was 5 instead of 6.

Restarting corosync on that node fixed that problem: the cluster versions were now 6 everywhere.

With that problem tackled, it was a matter of:

# pvecm expected 4
# pvecm delnode pve3
Killing node 4

All right! Even though it did not list nodeid 4 in the pvecm nodes output, delnode did find the right one. And this properly removed all traces of pve3 from the remaining files, making the cluster happy again.


Back to overview Newer post: supermicro / ikvm / sslv3 alert bad certificate Older post: enable noisy build / opensips