Node decommissioned but 2 ranges are broken/unavailable


i have a 6-nodes cluster, with 22637 ranges and (just for the moment) a replication factor of 1.

i tried to decommission the 2nd node. Draining goes well until i reach the last 160 ranges. The draining process increases in time (the node drained something like 3000 ranges in two hours, then it hung for an hour). several restarts of the node and the draining procedure continues, honestly i do not know why. At the very end of the process, the node says the decommissioning is complete, but i see there are 2 ranges still in the node.

Now it begins the weirdness. If i stop the node, the two ranges are marked as unavailable. And even with the node up, if i try to access to some of the values stored inside of one of these ranges, the query hangs indefinitely.

I identified the two ranges in #/reports/problemranges . It looks to me the two ranges exists in at least one other node, but these ranges are marked as invalid lease. As you can see from these two screenshots:

Apparently i cannot fix the problem using the sql console. At least i don’t know how. Is there any way to fix this?

Hey @leonardo - this sounds like a bug. We should not report a node as decommissioned and leave it in a bad state. I’ve created a github issue for this here. Can you add the logs from the troubled node there?

We’ve also been talking internally, and we really shouldn’t allow a multi-node cluster to be set to a replication factor of 1. I created a separate feature request to disallow this.

Could you let me know what you’re trying to do here by decreasing the replication factor? There might be a safer way of going about this.

Hi @tim-o

I’ve created a github issue for this here. Can you add the logs from the troubled node there?

Sure, i’ll add it as soon as possible

Could you let me know what you’re trying to do here by decreasing the replication factor? There might be a safer way of going about this.

well, i was just trying to understand what was happening since the first time a query hung. After the realization there were still more than one replica in the cluster, i thought that decreasing the replication factor could myabe force the nodes to decide which one of the replicas should survive and which one should be really removed. I also tried to recommission the node n2, stop the node n4 and try to decommission again the node n2, hoping that this could lead to some fix by means of rebalance. I clearly failed.

My first concern right now is to return to a correct state, even if it means delete the corrupted ranges, since i can easily reimport it. Problem is, i don’t know how to do anything else. Is there a proper way to do it?

Hey @leonardo,

Unfortunately, at the moment, the only solution to recovering from a corrupted range is to rebuild the cluster. At the moment, there isn’t a valid lease on the range or a method of forcing one. This does seem to have been caused by the downreplication; we’ll definitely add some preventative measures to keep a cluster from getting into this state in the future.

I’m happy to advise on any questions you might have about how to go about the rebuild; let me know how I can help.

Oh boy. This is unexpected.

However. Every node of my cluster is now running on docker. I suppose i can bring a second instance of cockroachdb along the existent one running in every machine. Is there any way to dump the ranges stored on a node and reload it in the new one? Alternatively, do i have to reload all the data from scratch? I can eventually waste some disk space if necessary.

Hi Leonardo,

If you can reload the data from scratch, that would be the safest option.

If you need to dump data from the old cluster, there is documentation for dumping tables and databases here:

I would only use this on tables that are unaffected by the problem ranges (e.g. those you can run a SELECT COUNT(*) on).

Hi @radu,

yeah, i ended up starting to rebuild the cluster from zero, since i am not sure which tables are ok.

I was a bit depressed about reloading the tables from zero because the first time it occurred me two weeks to populate my database transferring all the data from mysql to CRDB. I have to say i was unaware of the RETURNING NOTHING clause, and now that i perfected my scripts using it (and moving to a faster machine) i am confident it would take less time. Of course if any of you have suggestions about how to speed up the entire process let me know. Right now i’m using a python2+mysqlclient+psycopg2cffi multithreaded script, using a pool of 100 connections and 80 threads which generates 1000 queries per transaction.

That said, i still think that CRDB should provide some debugging tools to manually operate in the cloud. In my case, it was a bit frustrating knowing that i could have a functioning database by removing just 40.000 records over the ~100 millions stored in the cluster, and then reimport it. It’d be easier if there was a command that sends a message to the nodes like “fellas, if by any instance any of you are mantaining a copy of the range <x>, please remove it and let us forget about its existence, thanks”. Is this a challenging request due to the distributed nature of CRDB?

I filed to track implementing functionality to help deal with such problems.

Regarding copying data, are you aware of the IMPORT statement? If you are able to put your data in CSV or TSV format, using IMPORT should be much faster than doing inserts.