Thanks for your request. I am not sure, I will investigate this.
A few questions:
- which version of CockroachDB are you using?
- is this a production system? how many running applications/users are impacted?
- do you have clean backups you can recover from?
I’ve been restarting the offline node many times and The cluster has been balancing the node replica partition data and is now stable, but I don’t know if the data is missing
I’m not using a commercial cockroach ,version is 2.0.0.
Is it because the interval between my upgrade nodes is too short?
An inconsistency error is what it says: it means one of the replicas has been found to contain different data from the other replicas. This can be an error in CockroachDB, an error in your hardware RAM modules or an error in the underlying storage system. The error is here to protect your data.
There are two ways forward from here:
- either the error occurs only on 1 node and you have 3 or more nodes in total. In that case, if you have quorum (a majority of good replicas)
you can simply remove that faulty node(see my other response below), create a fresh new node and join it to the cluster, and CockroachDB will recover.
- the error occurs on a majority of nodes in the Cluster (e.g. 2 out of 3). In that case it will be more difficult to recover and I will need to investigate further.
In any case the 2.0.0 version is very old and contains major bugs. You should not be using that version. Please at least upgrade to 2.0.4. If you had been using CockroachDB 1.1 before, also please ensure you have finalized the upgrade process with these instructions: https://www.cockroachlabs.com/docs/v2.0/upgrade-cockroach-version#finalize-the-upgrade
In my previous answer I have advised to “simply remove the faulty node”. This is not entirely correct.
The problem is that if you have 3 nodes X, Y and Z, when you observe the error message on node X, it does not mean that the faulty copy is on node X itself. It could be on node Y and X is simply observing that it has a different copy than Y.
So the correct way to recover is as follows:
- try stopping one node. Do not remove the data files / configuration. Wait for the consistency checker to verify the remaining copies are OK.
- if the remaining copies are OK, then the node removed in step 1 was indeed the one with the faulty copy. In that case (and in that case only!) you can fully decommission that node and start a new node anew.
- if the remaining copies are not OK, then the inconsistency was on one of the other nodes. In that case, restart the node stopped on step 1, wait for the cluster to stabilize again, then choose another node to stop, and start again at step 1.
Eventually you’ll find the node with the faulty copy so you can decommission it.
Thank you very much for your advice
My cluster has five nodes, can I delete the data of one of them to ensure consistency.
@chaihuo please create a new forum post with your question
Hi, we have encountered the same problem, I replied here mainly because I have questions about the above recovery method.
1.Say we have more than 3 node in the cluster and have a replica refactor of 3. If unluckily we stop the node that contains correct data, then the 2 remaining replica is one good one bad. How will cockroachdb decide which one to replicate?
2. If there will be a consistency check happen before replicating, will this check crash another node when it detect inconsistency? Given now there’s only 2 replica( not replicated yet ), crash another node mains there will be no majority for these data?
To answer your first question, CockroachDB will not replicate in that scenario.
The consistency check will not crash another node, as it would then lose a majority of its replicas.
Thanks for your explain!
What version of CRDB are you using? Also were you still having this consistency errors?
If you are, would you be willing to provide us with a data dump of your stores?