Replace dead node with a new node while keeping the IP address

I have a 19.2.0 cluster with 9 nodes. Today one of the nodes died due to HW failure and shows up as dead node. I want to decommission the node and add a new node with same hostname and IP as the dead node (background is that we have lots of “application” nodes with IP tables rules that I do not want to update… therefore I prefer to add a node with the exact same IP like the dead one).

Is this approach possible or is the IP in any way “blocked” when it shows up in the decommissioned list or is reusing the same IP a bad practise?

thx for any advise before I start fixing the situation

Hi there,

The IP should not be blocked, you can actually recommission the node.

See this.

thx for fast reply… recommission seems to be no option as the new machine will be a blank fresh machine that just shares hostname/ip with the old one…
I will just add this machine to cluster and assume it will show up with the next free node id…

Hi @honne,

Sounds good.

Can you share a debug.zip with me over https://cockroachlabs.sendsafely.com/u/mvardi after you have added the new node that shares the hostname/ip with the old node?

Thank you!

Matt

Hi @mattvardi,
decommissioned node and added fresh node with same name and (OpenStack floating) IP.
All worked fine … just made a typo and added node with wrong locality, restarted node with updated locality and all is fine now.
debug.zip is uploaded as requested
cheers
Heiko

P.S. happy the cluster is in a good shape again as it might be one of the longest running CockroachDB clusters on this planet … :wink:

That is awesome to hear.

Thanks for sending the debug.zip. I’m particularly interested in the logs after the new node joins and notice that the debug.zip you sent only has 1 node and no logs for that node.

Are you logging to stderr? Is there a way you can get the logs from all of the nodes to me?

Thank you!
Matt

just checked … looks like the debug zip creation had errors.

debug/nodes/1/status.json
debug/nodes/1/crdb_internal.feature_usage.txt
^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
debug/nodes/1/crdb_internal.gossip_alerts.txt


^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
debug/nodes/1/details.json
^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
debug/nodes/1/gossip.json
^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
debug/nodes/1/enginestats.json

The IP its trying to access is the one of a node that was decomissioned

The cluster itself has 9 nodes across 3 DCs

did not find a flag to bypass it …

Is the debug.zip process hanging? or has it just timed out on those particular steps and moved on?

its retrying and than terminating without any further info…
Here the full log :

root@api-prd-2:/cockroach# ./cockroach debug zip --certs-dir=/cockroach debug.zip
writing debug.zip
  debug/events.json
  debug/rangelog.json
  debug/liveness.json
  debug/settings.json
  debug/reports/problemranges.json
  debug/crdb_internal.cluster_queries.txt
  debug/crdb_internal.cluster_sessions.txt
  debug/crdb_internal.cluster_settings.txt
  debug/crdb_internal.jobs.txt
  debug/system.jobs.txt
  debug/system.descriptor.txt
  debug/system.namespace.txt
  debug/crdb_internal.kv_node_status.txt
  debug/crdb_internal.kv_store_status.txt
  debug/crdb_internal.schema_changes.txt
  debug/crdb_internal.partitions.txt
  debug/crdb_internal.zones.txt
  debug/nodes/1/status.json
  debug/nodes/1/crdb_internal.feature_usage.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.gossip_alerts.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.gossip_liveness.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.gossip_network.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.gossip_nodes.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.leases.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.node_build_info.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.node_metrics.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.node_queries.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.node_runtime_info.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.node_sessions.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.node_statement_statistics.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/crdb_internal.node_txn_stats.txt
  ^- resulted in dial tcp 10.96.214.109:26257: connect: no route to host
  debug/nodes/1/details.json
  ^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
  debug/nodes/1/gossip.json
  ^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
  debug/nodes/1/enginestats.json
  ^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
  debug/nodes/1/stacks.txt
  ^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
  debug/nodes/1/heap.pprof
  ^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
  debug/nodes/1/heapprof
  ^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
  /goroutines
  ^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1

same behaviour when creating debug zip on other cluster node …

This is interesting, the next step would be to ssh to the machines directly and grab the logs. Can you do this?