Restarting dead node does'n work

I have a cluster of three nodes. ( version 19.1)
Availability test is in progress.

In the situation where 1000 TPS(insert 10%, select 90%) load occurs, one node was killed.
Three minutes after the node died, he was judged dead.
Two minutes later, the dead node was restarted.
However, the node did not change to live node as the following log was recorded.

W190626 04:46:02.219662 240 storage/node_liveness.go:523 [n3,hb] slow heartbeat took 4.5s
W190626 04:46:02.219693 240 storage/node_liveness.go:463 [n3,hb] failed node liveness heartbeat: operation “node liveness heartbeat” timed out after 4.5s

If it is restarted immediately after the dead node is determined, it changes quickly to the live node.
It is started without any problem when adding a new node.

In the previous version, the dead node survived well.
Is there a problem with this version?

set cluster setting server.time_until_store_dead=‘3m’;
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate=‘32MiB’
SET CLUSTER SETTING kv.snapshot_recovery.max_rate=‘32MiB’

Hi @yeri,

I’m not quite sure I understand your question.

The node died, and it was restarted, did the Admin UI show that it was still dead?

Could you check on the health of the node using this endpoint: curl <HOST>:<PORT>/health as well as curl <HOST>:<PORT>/_admin/v1/health

Also, is there any reason you changed the server.time_until_store_dead settings to 3m? We advise care when changing this setting as setting it too low causes increased network and disk I/O costs, as CockroachDB rebalances data around temporary outages.



Hi Ron.
Yes. Admin UI show that it was still dead.
So I check the url as you mentioned. It also doesn’t work.
And I stop the stress test.
I kill the node and restart dead node again. Then Admin UI show that it was live node.
Do you understand?
What was that log? (as I mentioned the first question)

and the setting was just for test quickly.
I know the meaning of that setting as you mentioned.
But I don’t think it would have affected the dead node’s failure to revive.

Have you tested availability with 3 nodes with this version? Did you live well?

Hi @yeri,

This does seem to be a bug as I was able to reproduce what you’ve reported, I’ve created a github issue so that our engineering team can take a look.