Node replaced but some ranges are not replicated

I replaced a node with another one in a 3-node cluster and everything went well. Ranges replicated completely. Then I replaced another node, and most of the ranges replicated but now 19 ranges are remained in UNDER REPLICATED state.

I don’t know if it is related or not, but I get lots of logs like the following:

storage/raft_transport.go:281 unable to accept Raft message from (n1,s1):?: no handler registered for (n2,s2):?

What is the problem? What can I do to fix it?

Thanks

Hey @newcockroacher,

How did you replace the node? Are you following this document?

What is your replication factor and what version of CRDB are you running?

Thanks,
Matt

Replication factor is the default of 3. CRDB version is v19.1.4.

I had 3 nodes running on normal HDD. I stopped one and then brought up another node on an SSD machine. Everything worked perfectly and under replicated ranges replicated to new node after I think 30 minutes. Then I continued with the next node, stopped the next HDD machine, brought up a new instance on another SSD machine, replicating under replicated ranges started but it didn’t finish. 19 ranges remain under replicated. I don’t know what to do next.

Hey @newcockroacher,

How are you stopping the nodes?

Just stopped the docker container for that node. It worked for the first node, but not the second time.

I suggest you use cockroach quit --decommission in the future to stop the nodes.

This will allow all replicas (including system replicas) to be dispersed to the other nodes in your cluster before shutting down the node.

Unfortunately, I believe stopping the docker container will lead to complications within the system ranges.

Can you query your cluster?

Yes I can query.

The replicas are currently on 2 nodes, but are not propagating to the 3rd node.

This is the main use case of Cockroach, and I think it doesn’t do what it was supposed to do.

Hi @newcockroacher,

There are best practices for performing node replacements like this.

Were you using external storage for your nodes store directory in your docker container?

If you’re able to query, it might be best to perform export/import your data into a fresh cluster.

What version of CRDB are you running?

Thanks,
Matt

Thanks Matt for trying to help. Each time I try to reply, my messages are waited for a moderator to confirm them. This is killing me. This makes my experience a lot worse.

I read best practices. It was said that when decommissioning, I have to first have another node and then decommission the old one. I didn’t and I assumed that this is exactly like a situation that I lost one of the nodes. And it worked for the first node.

I use a bind mount for docker containers.

CRDB@v19.1.4

Anyway, now it is very slow and this error is killing me:
pq: transaction is too large to complete; try splitting into pieces

I searched for it and it seems that it is fixed but will be shipped in v19.2. This is an alpha version and I can’t switch to it.

Right now I have a big problem with DB and I’m thinking of migrating my data to PostgreSQL.

Hi Sam,

My apologies for the forum experience, the approval system is for new users to prevent spam. I have modified your trust level and you should not run into the post confirmation anymore.

Could you elaborate on what in particular is slow and what statements are you issuing when you run into that error?

Thanks,
Matt

Dear Matt,

I searched the error message I gave above, and ultimately I found that it is agreed in Cockroach dev team that this error was not supposed to be thrown that much and they agreed to remove it. However, since the fix did not went to beta releases of v19.1, team lead didn’t allow it to be released, and it postponed to v19.2. v19.2 is currently in alpha and will be released in 3 months.

I can’t wait for 3 months, I can’t update to v19.2 alpha, and I get lots of errors with this error reason. Then these requests are retried and most of them again get this error and hence retry. This makes my requests to DB more than what has to be, making everything slower. And really, my transactions are not that big and I believe it should not happen.

About the forum experience, because of great difference in timezones, when I postted a message, no moderator was around. When they come online and approve my message, I was not online. This makes this conversation over many days, slowing everything even more.

This is just a feedback for the Cockroach team.

Thanks.

Hey Sam,

Thank you for the feedback.

Like I said before, I’m happy to help you troubleshoot your slow behavior but I need more information.

I understand your transactions are not that large, however in the context of distributed SQL, there are certain limitations that are imposed in order to maintain integrity of the data. If you could provide more insight into what operations your transactions are performing I can do my best to find a suitable workaround.

With that said, 19.2 is scheduled for a mid-october release.

Regarding the forum experience, I understand your frustration and where you are coming from. We are working on providing faster support for other time zones. You should not need approval for your posts moving forward as of my pervious reply.

Thanks,
Matt

Thanks Matt,

Here is the rough structure of my transactions which gets lots of errors:

begin;
select … from table1 where (x, y, z) = (X, Y, Z) and w > W order by w desc limit 1;
insert into table2 …;
insert into table 1 …;
insert into table 3 … (select … from table 4 …);
insert into table 3 … (select … from table 4 …);
update table4 …;
update table4 …;
select … from table4 …;
select … from table4 …;
select … from table4 …;
select … from table4 …;
commit;