I have a cluster of three nodes. ( version 19.1)
Availability test is in progress.
In the situation where 1000 TPS(insert 10%, select 90%) load occurs, one node was killed.
Three minutes after the node died, he was judged dead.
Two minutes later, the dead node was restarted.
However, the node did not change to live node as the following log was recorded.
W190626 04:46:02.219662 240 storage/node_liveness.go:523 [n3,hb] slow heartbeat took 4.5s
W190626 04:46:02.219693 240 storage/node_liveness.go:463 [n3,hb] failed node liveness heartbeat: operation “node liveness heartbeat” timed out after 4.5s
If it is restarted immediately after the dead node is determined, it changes quickly to the live node.
It is started without any problem when adding a new node.
In the previous version, the dead node survived well.
Is there a problem with this version?
set cluster setting server.time_until_store_dead=‘3m’;
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate=‘32MiB’
SET CLUSTER SETTING kv.snapshot_recovery.max_rate=‘32MiB’