Resurrect broken cluster

I have a small 3-node cluster (v 2.1.4) which had a broken monitoring for a long while,
and I’ve just noticed that it is broken. Only one node is operational. What are my options to recover the data (or most of the data) from that single node? Can I copy the data directory to a different node and join the cluster from that?

Hi @syntern,

Welcome to Cockroach Forum!

With your 3 node cluster, you have only 1 node running and the other 2 nodes are dead for whatever reason. With a 3 node cluster, the cluster and thus the admin-UI would still be available with only 2 live nodes. However when the 2nd node died, the cluster no longer worked and the admin-UI monitoring no longer responded.

Please do NOT copy the data directory from the 1 live node to any of the dead nodes. Please try re-starting 1 of the other 2 nodes with the same settings that you initially started it. Once that 2nd node is live, the admin-UI monitoring should respond. Also, the data on the 1 live node, will automatically be replicated to the 2nd live node.

If the 2nd node does NOT restart, please try restarting the other dead node instead. I would like you to get to the state of having 2 live nodes and a working cluster. However, you will still have Under-Replicated Ranges.

If you can get a 2nd node working, please try restarting the remaining dead node. If you get this last node working, it should automatically replicate the ranges to this last node, so that Under-Replicated Ranges should equal zero.

Please keep me posted with the status of your cluster and let me know if you have any questions.

Regards,
Florence
Technical Support Engineer

The other two nodes were restarted separately, and they are still not able to connect. What are my options? Can I restart the working one with replication factor of one?

Hi @syntern,

By “the other two nodes, were restarted separately, and they are still not able to connect,” do you mean that the 2 nodes started successfully, but where not able to join the cluster of node 1 as described here? Or do you mean something else? Would you be able to provide any messages (error or info) that were returned when the nodes were restarted?

Would you be able to run cockroach node status as described here and post the results?

CockroachDB is meant to be run as a multi-node cluster; however if you would like to work with only a 1 node cluster, please refer to this page.

Regards,
Florence

Hey @syntern,

If the two other nodes have restarted and are unable to connect, then either:

  1. You have intact stores from your old cluster and a network partition, and resolving the network partition would allow the cluster to get back into a healthy state.
  2. You have empty stores, in which case the cluster can’t initiate because we cannot recover from a single replica.

To confirm which, inspect the logs. If case 1, your errant nodes would have a lot of log errors related to failed connections and you’d need to inspect your join flags and network settings to determine why the nodes cannot establish a bilateral connection. If case 2, you would see all three nodes connect successfully, followed by a variety of other failures.

If case 2, your only recourse is to wipe the cluster entirely or rebuild from a backup. You can’t switch a 3 node cluster to a 1 node cluster after initializing a 3 node cluster.

Best,

Tim