Node has connected to cluster via gossip BUT unable to contact the other nodes

Hi everybody,

As a result of out of memory 1 of a 3 cluster nodes was rebooted. After the no matter what I try to join the node I always get:

2018-07-09 08:32:52.141592 I180709 08:32:52.141572 125 gossip/client.go:129  [n?] started gossip client to ip.ip.ip.ip:26257
2018-07-09 08:33:06.278205 I180709 08:33:06.278154 74 storage/store.go:1260  [n1,s1] ran legacy tombstone migration in 13.612404387s
2018-07-09 08:33:22.123994 W180709 08:33:22.123969 412 server/server.go:1267  The server appears to be unable to contact the other nodes in the cluster. Please try
2018-07-09 08:33:22.124020 - starting the other nodes, if you haven't already
2018-07-09 08:33:22.124023 - double-checking that the '--join' and '--host' flags are set up correctly
2018-07-09 08:33:22.124025 - running the 'cockroach init' command if you are trying to initialize a new cluster
2018-07-09 08:33:22.124028 If problems persist, please see https://www.cockroachlabs.com/docs/v2.0/cluster-setup-troubleshooting.html.
2018-07-09 08:33:23.628137 I180709 08:33:23.628122 27 gossip/gossip.go:1306  [n1] node has connected to cluster via gossip

and no more messages.

  • node is started correct --host and --advertise-host
  • node is started with correct --join for each of the other two nodes
  • I can telnet to any of the other nodes on tcp ip.ip.ip.ip:26257

On the other node that is first in the join list I see

2018-07-09 10:20:18.680161 I180709 10:20:18.680152 4415653 storage/raft_transport.go:459  [n3] raft transport stream to node 1 established
2018-07-09 10:20:19.155176 W180709 10:20:19.155152 2775 storage/store.go:3920  [n3,s18] handle raft ready: 0.5s [processed=0]

cockroach node status executed on 2 of the live nodes is showing node with id 1 (problematic node) as is_live false.

As you can imagine I can’t connect to cockroachdb on the console as the command is freezing (probably due to unsuccessful startup)

All three nodes are running cockroach db v2.0.2

cockroachdb is started like that

/usr/local/bin/cockroach start \
  --certs-dir=/etc/cockroach/ssl \
  --http-host=127.0.0.1 \
  --http-port=8080 \
  --cache=5% \
  --max-sql-memory=80% \
  --store=path=/mnt/sda4,attrs=ssd --store=path=/mnt/sdb4,attrs=ssd --store=path=/mnt/sdc,attrs=ssd --store=path=/mnt/sdd,attrs=ssd --store=path=/mnt/sde,attrs=ssd --store=path=/mnt/sdf,attrs=ssd \
  --locality=region=US,datacenter=CHI \
 --port=26257 \
  --join=ip.ip.ip.2:26257 --join=ip.ip.ip.3:26257 \
  --logtostderr=INFO \
  --host=ip.ip.ip.1 \
  --advertise-host=ip.ip.ip.1

It has been running like that for the past 2 hours and the last message shown is below and nothing new after that:

2018-07-09 08:33:23.628137 I180709 08:33:23.628122 27 gossip/gossip.go:1306 [n1] node has connected to cluster via gossip

node_exporter and grafana are showing high abnormal disk read? on ip.ip.ip.2:26257 which is the first --join ip even our setup is write mostly.

Any clues how to debug this issue further?

P.S. I tried decomission and recomission but without any luck. The only way I managed to return the failed not back to the cluster was to decomission it, erase all the data from the storage dirs and start cockroach again which effectively re-synchronized all the data as the old node now appears like a new node to the cluster.

Hey @valski - seems like whatever is happening isn’t making it into the logs. Can you add --vmodule=rocksdb=3 to the start command and see if anything informative starts showing up in the logs?

Hey @tim-o. Thank you for getting back to me. Unfortunately I have decommissioned the node, erased the data and joined it as a new one (new node id) so we can keep the cluster steady and I can’t test with --vmodule=rocksdb=3 as advised (for now). decommission and recommission did not help so the only option was a fresh start.

However during the time this issue was present I increased the cockroach verbosity level at one of the re-starts of the problematic node and below are the log I got from the moment cockroach started. Not really sure if they will help you with something but you can take a look and let me know if they are of any use.

*snip logs from metrics boot*
2018-07-09 08:06:03.889834 I180709 08:06:03.889830 23 util/metric/registry.go:77  Added metric: queue.replicate.rebalancereplica (*metric.Counter)
2018-07-09 08:06:03.889847 I180709 08:06:03.889837 23 util/metric/registry.go:77  Added metric: queue.replicate.transferlease (*metric.Counter)
2018-07-09 08:06:03.889919 I180709 08:06:03.889914 23 util/metric/registry.go:77  Added metric: queue.replicagc.removereplica (*metric.Counter)
2018-07-09 08:06:03.985703 I180709 08:06:03.985667 23 base/node_id.go:60  [n1,s1] NodeID set to 1
2018-07-09 08:06:04.131040 I180709 08:06:04.131022 74 gossip/client.go:230  [n1] received [store:13 distsql-draining:3 liveness:2 store:10 liveness:1 distsql-draining:2 distsql-version:2 distsql-version:3 liveness:3 store:11 cluster-id store:7 distsql-version:1 node:3 store:16 store:9 sentinel store:8 first-range store:14 store:15 store:17 store:18 system-db distsql-draining:1 store:12 node:2] from node 2 (1 fresh)
2018-07-09 08:06:04.131063 I180709 08:06:04.131056 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:04.154553 I180709 08:06:04.154474 74 gossip/client.go:230  [n1] received [distsql-version:3 sentinel store:14 store:9 distsql-draining:2 distsql-draining:1 store:15 distsql-draining:3 distsql-version:1 distsql-version:2 liveness:1 node:2 store:11 store:12 cluster-id store:17 system-db liveness:3 node:3 store:13 store:18 store:7 first-range liveness:2 store:10 store:16 store:8] from node 2 (3 fresh)
2018-07-09 08:06:04.154606 I180709 08:06:04.154579 263 gossip/gossip.go:758  [n1] updateStoreMap called on "store:13" with desc {StoreID:13 Attrs:ssd Node:{NodeID:3 Address:{NetworkField:tcp AddressField:ip.ip.ip.1:26257} Attrs: Locality:region=US,datacenter=CHI ServerVersion:2.0} Capacity:disk (capacity=782 GiB, available=742 GiB, used=104 MiB, logicalBytes=36 KiB), ranges=3, leases=3, writes=0.08, bytesPerReplica={p10=0.00 p25=0.00 p50=250.00 p75=36830.00 p90=36830.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.08 p90=0.08}}
2018-07-09 08:06:04.154655 I180709 08:06:04.154647 74 gossip/client.go:230  [n1] received [distsql-draining:3 liveness:2 node:2 store:15 store:12 store:14 distsql-draining:2 distsql-version:1 store:11 store:9 distsql-draining:1 store:16 cluster-id liveness:1 liveness:3 store:10 store:13 distsql-version:2 sentinel store:17 store:8 store:18 store:7 system-db distsql-version:3 first-range node:3] from node 2 (1 fresh)
2018-07-09 08:06:04.154708 I180709 08:06:04.154699 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:04.154723 I180709 08:06:04.154706 263 gossip/gossip.go:758  [n1] updateStoreMap called on "store:14" with desc {StoreID:14 Attrs:ssd Node:{NodeID:3 Address:{NetworkField:tcp AddressField:ip.ip.ip.1:26257} Attrs: Locality:region=US,datacenter=CHI ServerVersion:2.0} Capacity:disk (capacity=782 GiB, available=742 GiB, used=1.0 MiB, logicalBytes=0 B), ranges=3, leases=3, writes=0.21, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.21 p90=0.21}}
2018-07-09 08:06:04.154732 I180709 08:06:04.154725 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:04.154781 I180709 08:06:04.154773 263 gossip/gossip.go:758  [n1] updateStoreMap called on "store:15" with desc {StoreID:15 Attrs:ssd Node:{NodeID:3 Address:{NetworkField:tcp AddressField:ip.ip.ip.1:26257} Attrs: Locality:region=US,datacenter=CHI ServerVersion:2.0} Capacity:disk (capacity=880 GiB, available=835 GiB, used=205 KiB, logicalBytes=9.5 KiB), ranges=3, leases=3, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=9711.00 p90=9711.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
2018-07-09 08:06:04.154847 I180709 08:06:04.154835 264 gossip/gossip.go:758  [n1] updateStoreMap called on "store:16" with desc {StoreID:16 Attrs:ssd Node:{NodeID:3 Address:{NetworkField:tcp AddressField:ip.ip.ip.1:26257} Attrs: Locality:region=US,datacenter=CHI ServerVersion:2.0} Capacity:disk (capacity=880 GiB, available=835 GiB, used=32 MiB, logicalBytes=204 KiB), ranges=3, leases=3, writes=1.50, bytesPerReplica={p10=0.00 p25=0.00 p50=30349.00 p75=178743.00 p90=178743.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.32 p75=1.18 p90=1.18}}
2018-07-09 08:06:04.156268 I180709 08:06:04.156207 74 gossip/client.go:230  [n1] received [store:7 cluster-id distsql-draining:1 store:10 store:8 distsql-draining:3 distsql-version:2 liveness:1 node:3 distsql-draining:2 store:12 store:18 system-db first-range store:17 store:9 node:2 sentinel store:11 store:16 distsql-version:1 store:13 store:14 distsql-version:3 liveness:2 liveness:3 store:15] from node 2 (1 fresh)
2018-07-09 08:06:04.156279 I180709 08:06:04.156248 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:04.156350 I180709 08:06:04.156340 307 gossip/gossip.go:758  [n1] updateStoreMap called on "store:17" with desc {StoreID:17 Attrs:ssd Node:{NodeID:3 Address:{NetworkField:tcp AddressField:ip.ip.ip.1:26257} Attrs: Locality:region=US,datacenter=CHI ServerVersion:2.0} Capacity:disk (capacity=880 GiB, available=833 GiB, used=2.7 GiB, logicalBytes=13 GiB), ranges=352, leases=118, writes=256.11, bytesPerReplica={p10=33490423.00 p25=33554819.00 p50=36678184.00 p75=49674631.00 p90=61561589.00}, writesPerReplica={p10=0.00 p25=0.00 p50=1.01 p75=1.21 p90=1.21}}
2018-07-09 08:06:04.171889 I180709 08:06:04.171813 17 storage/engine/rocksdb.go:80  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1153] [default] [JOB 3] Generated table #12150: 122909 keys, 13808773 bytes
2018-07-09 08:06:04.171956 I180709 08:06:04.171945 17 storage/engine/rocksdb.go:80  EVENT_LOG_v1 {"time_micros": 1531123564171897, "cf_name": "default", "job": 3, "event": "table_file_creation", "file_number": 12150, "file_size": 13808773, "table_properties": {"data_size": 13624496, "index_size": 44920, "filter_size": 153669, "raw_key_size": 3731492, "raw_average_key_size": 30, "raw_value_size": 30557798, "raw_average_value_size": 248, "num_data_blocks": 1039, "num_entries": 122909, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "8", "kMergeOperands": "18"}}
2018-07-09 08:06:04.172434 I180709 08:06:04.172426 17 storage/engine/rocksdb.go:80  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1227] [default] [JOB 3] Compacted 2@0 + 1@5 files to L5 => 13808773 bytes
2018-07-09 08:06:04.172805 I180709 08:06:04.172795 17 storage/engine/rocksdb.go:80  (Original Log Time 2018/07/09-03:06:04.172718) [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:632] [default] compacted to: base level 5 max bytes base 67108864 files[0 0 0 0 0 4 4] max score 0.96, MB/sec: 21.5 rd, 21.5 wr, level 5, files in(2, 1) out(1) MB in(0.0, 13.2) out(13.2), read-write-amplify(10848.4) write-amplify(5423.7) OK, records in: 122911, records dropped: 2
2018-07-09 08:06:04.172822 I180709 08:06:04.172816 17 storage/engine/rocksdb.go:80  (Original Log Time 2018/07/09-03:06:04.172743) EVENT_LOG_v1 {"time_micros": 1531123564172730, "job": 3, "event": "compaction_finished", "compaction_time_micros": 643093, "output_level": 5, "num_output_files": 1, "total_output_size": 13808773, "num_input_records": 122911, "num_output_records": 122909, "num_subcompactions": 1, "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 4, 4]}
2018-07-09 08:06:04.172957 I180709 08:06:04.172948 17 storage/engine/rocksdb.go:80  EVENT_LOG_v1 {"time_micros": 1531123564172928, "job": 3, "event": "table_file_deletion", "file_number": 12145}
2018-07-09 08:06:04.173015 I180709 08:06:04.173008 17 storage/engine/rocksdb.go:80  EVENT_LOG_v1 {"time_micros": 1531123564172994, "job": 3, "event": "table_file_deletion", "file_number": 12142}
2018-07-09 08:06:04.174717 I180709 08:06:04.174682 17 storage/engine/rocksdb.go:80  EVENT_LOG_v1 {"time_micros": 1531123564174668, "job": 3, "event": "table_file_deletion", "file_number": 12140}
2018-07-09 08:06:04.556418 I180709 08:06:04.556375 74 gossip/client.go:230  [n1] received [node:2 store:13 store:14 cluster-id distsql-draining:3 first-range distsql-draining:2 store:12 store:16 store:7 store:9 store:11 distsql-draining:1 distsql-version:2 liveness:3 store:17 system-db liveness:1 node:3 sentinel store:15 store:8 distsql-version:1 distsql-version:3 liveness:2 store:18 store:10] from node 2 (1 fresh)
2018-07-09 08:06:04.556445 I180709 08:06:04.556417 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:04.556553 I180709 08:06:04.556545 221 gossip/gossip.go:758  [n1] updateStoreMap called on "store:18" with desc {StoreID:18 Attrs:ssd Node:{NodeID:3 Address:{NetworkField:tcp AddressField:ip.ip.ip.1:26257} Attrs: Locality:region=US,datacenter=CHI ServerVersion:2.0} Capacity:disk (capacity=880 GiB, available=376 GiB, used=459 GiB, logicalBytes=1.1 TiB), ranges=37251, leases=13826, writes=1877.85, bytesPerReplica={p10=33554323.00 p25=33554364.00 p50=33554431.00 p75=33554499.00 p90=33554540.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}}
2018-07-09 08:06:05.532359 I180709 08:06:05.532318 74 gossip/client.go:230  [n1] received [sentinel first-range liveness:2 distsql-draining:2 store:10 store:12 store:18 node:2 store:17 system-db liveness:3 store:14 cluster-id distsql-draining:1 distsql-version:1 store:16 store:8 store:15 node:3 store:11 store:13 store:7 store:9 distsql-draining:3 distsql-version:2 distsql-version:3 liveness:1] from node 2 (1 fresh)
2018-07-09 08:06:05.532385 I180709 08:06:05.532360 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:06.689499 I180709 08:06:06.689441 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=708.14??s, err=355.544??s, at=2018-07-09 08:06:06.68938153 +0000 UTC
2018-07-09 08:06:06.689512 I180709 08:06:06.689497 152 rpc/clock_offset.go:231  [n1] 1 of 1 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:06.798237 I180709 08:06:06.798201 311 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=-57.963458ms, err=58.949099ms, at=2018-07-09 08:06:03.798194399 +0000 UTC
2018-07-09 08:06:08.630797 I180709 08:06:08.630759 74 gossip/client.go:230  [n1] received [store:17 store:7 liveness:2 store:12 distsql-draining:3 distsql-version:2 store:11 store:13 store:14 node:2 sentinel distsql-draining:1 distsql-draining:2 node:3 store:16 cluster-id distsql-version:1 liveness:3 store:8 store:9 first-range liveness:1 store:18 system-db distsql-version:3 store:10 store:15] from node 2 (1 fresh)
2018-07-09 08:06:08.630824 I180709 08:06:08.630802 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:09.691064 I180709 08:06:09.691046 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=708.14??s, err=355.544??s, at=2018-07-09 08:06:06.68938153 +0000 UTC
2018-07-09 08:06:09.691083 I180709 08:06:09.691078 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:09.799405 I180709 08:06:09.799369 301 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=684.97??s, err=436.089??s, at=2018-07-09 08:06:06.799329185 +0000 UTC
2018-07-09 08:06:10.032770 I180709 08:06:10.032732 74 gossip/client.go:230  [n1] received [liveness:2 store:15 sentinel distsql-draining:1 distsql-draining:3 first-range node:3 cluster-id distsql-version:3 store:18 store:9 node:2 store:17 store:12 store:16 store:8 system-db store:10 store:7 liveness:1 liveness:3 store:11 store:14 distsql-draining:2 distsql-version:1 distsql-version:2 store:13] from node 2 (1 fresh)
2018-07-09 08:06:10.032808 I180709 08:06:10.032801 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:12.477927 I180709 08:06:12.477915 26 storage/raft_transport.go:251  [n1] stats:
2018-07-09 08:06:12.477944          qlen   qmax   qdropped client-sent client-recv server-sent server-recv
2018-07-09 08:06:12.691833 I180709 08:06:12.691818 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=632.59??s, err=264.42??s, at=2018-07-09 08:06:12.691792048 +0000 UTC
2018-07-09 08:06:12.691893 I180709 08:06:12.691889 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:12.800056 I180709 08:06:12.800044 303 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=423.925??s, err=362.531??s, at=2018-07-09 08:06:09.800162848 +0000 UTC
2018-07-09 08:06:13.130926 I180709 08:06:13.130895 74 gossip/client.go:230  [n1] received [distsql-version:1 node:2 store:12 cluster-id liveness:2 node:3 store:11 store:14 store:16 distsql-version:3 store:9 distsql-version:2 store:10 store:8 distsql-draining:2 store:7 system-db first-range sentinel store:13 store:17 distsql-draining:3 liveness:1 liveness:3 store:15 store:18 distsql-draining:1] from node 2 (1 fresh)
2018-07-09 08:06:13.130974 I180709 08:06:13.130953 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:14.535625 I180709 08:06:14.535610 74 gossip/client.go:230  [n1] received [cluster-id distsql-version:2 store:11 store:9 store:12 store:16 distsql-draining:2 liveness:1 liveness:2 distsql-draining:3 liveness:3 store:13 store:8 node:2 store:18 system-db node:3 sentinel store:14 store:17 store:10 distsql-draining:1 first-range store:15 distsql-version:1 distsql-version:3 store:7] from node 2 (1 fresh)
2018-07-09 08:06:14.535704 I180709 08:06:14.535678 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:15.692729 I180709 08:06:15.692706 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=632.59??s, err=264.42??s, at=2018-07-09 08:06:12.691792048 +0000 UTC
2018-07-09 08:06:15.692751 I180709 08:06:15.692745 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:15.800782 I180709 08:06:15.800764 305 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=467.822??s, err=238.277??s, at=2018-07-09 08:06:12.800759324 +0000 UTC
2018-07-09 08:06:17.631029 I180709 08:06:17.631018 74 gossip/client.go:230  [n1] received [distsql-draining:2 store:13 distsql-version:1 store:14 liveness:2 node:3 sentinel store:18 distsql-draining:3 liveness:1 store:15 store:16 store:17 store:8 liveness:3 store:11 distsql-version:3 first-range node:2 store:12 cluster-id distsql-draining:1 distsql-version:2 store:9 store:10 store:7 system-db] from node 2 (1 fresh)
2018-07-09 08:06:17.631056 I180709 08:06:17.631045 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:18.693660 I180709 08:06:18.693605 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=632.59??s, err=264.42??s, at=2018-07-09 08:06:12.691792048 +0000 UTC
2018-07-09 08:06:18.693680 I180709 08:06:18.693650 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:18.801656 I180709 08:06:18.801620 327 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=467.822??s, err=238.277??s, at=2018-07-09 08:06:12.800759324 +0000 UTC
2018-07-09 08:06:19.032630 I180709 08:06:19.032589 74 gossip/client.go:230  [n1] received [store:16 store:18 system-db liveness:2 sentinel store:13 distsql-draining:1 distsql-draining:2 distsql-version:1 liveness:1 node:2 store:17 store:7 store:9 store:8 node:3 liveness:3 store:10 store:12 distsql-version:2 store:11 distsql-draining:3 cluster-id first-range store:14 store:15 distsql-version:3] from node 2 (1 fresh)
2018-07-09 08:06:19.032671 I180709 08:06:19.032664 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:21.694589 I180709 08:06:21.694575 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=632.59??s, err=264.42??s, at=2018-07-09 08:06:12.691792048 +0000 UTC
2018-07-09 08:06:21.694607 I180709 08:06:21.694601 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:21.802545 I180709 08:06:21.802508 306 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=467.822??s, err=238.277??s, at=2018-07-09 08:06:12.800759324 +0000 UTC
2018-07-09 08:06:22.130780 I180709 08:06:22.130744 74 gossip/client.go:230  [n1] received [node:2 store:10 store:15 liveness:2 store:9 liveness:1 cluster-id distsql-draining:1 distsql-version:2 node:3 system-db sentinel store:8 store:11 store:17 store:18 distsql-version:3 first-range distsql-version:1 liveness:3 store:12 store:13 distsql-draining:2 store:14 store:16 distsql-draining:3 store:7] from node 2 (1 fresh)
2018-07-09 08:06:22.130817 I180709 08:06:22.130810 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:22.477960 I180709 08:06:22.477938 26 storage/raft_transport.go:251  [n1] stats:
2018-07-09 08:06:22.477991          qlen   qmax   qdropped client-sent client-recv server-sent server-recv
2018-07-09 08:06:23.532950 I180709 08:06:23.532935 74 gossip/client.go:230  [n1] received [distsql-version:2 liveness:3 sentinel store:11 distsql-draining:3 distsql-draining:2 node:3 store:12 store:14 store:15 store:16 store:18 first-range liveness:2 store:9 store:10 distsql-version:3 store:7 cluster-id store:17 store:8 system-db liveness:1 store:13 node:2 distsql-draining:1 distsql-version:1] from node 2 (1 fresh)
2018-07-09 08:06:23.532980 I180709 08:06:23.532971 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:24.656817 I180709 08:06:24.656759 74 gossip/client.go:230  [n1] received [liveness:2 sentinel store:15 store:10 store:13 store:17 store:8 system-db cluster-id node:2 store:14 distsql-draining:1 store:11 store:12 distsql-draining:2 liveness:3 store:16 distsql-draining:3 distsql-version:1 first-range node:3 store:18 distsql-version:3 store:7 store:9 distsql-version:2 liveness:1] from node 2 (1 fresh)
2018-07-09 08:06:24.656828 I180709 08:06:24.656825 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:24.695303 I180709 08:06:24.695264 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=644.696??s, err=236.963??s, at=2018-07-09 08:06:24.695244133 +0000 UTC
2018-07-09 08:06:24.695313 I180709 08:06:24.695298 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:24.803458 I180709 08:06:24.803421 332 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=467.822??s, err=238.277??s, at=2018-07-09 08:06:12.800759324 +0000 UTC
2018-07-09 08:06:26.632960 I180709 08:06:26.632948 74 gossip/client.go:230  [n1] received [store:16 cluster-id node:3 store:10 distsql-draining:1 first-range liveness:3 store:12 distsql-version:3 liveness:2 store:7 store:8 distsql-draining:2 liveness:1 node:2 sentinel store:11 store:17 store:9 distsql-version:2 store:14 distsql-draining:3 distsql-version:1 store:13 store:15 store:18 system-db] from node 2 (1 fresh)
2018-07-09 08:06:26.632983 I180709 08:06:26.632977 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:27.696016 I180709 08:06:27.696004 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=644.696??s, err=236.963??s, at=2018-07-09 08:06:24.695244133 +0000 UTC
2018-07-09 08:06:27.696028 I180709 08:06:27.696023 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:27.807795 I180709 08:06:27.807749 333 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=467.822??s, err=238.277??s, at=2018-07-09 08:06:12.800759324 +0000 UTC
2018-07-09 08:06:28.032782 I180709 08:06:28.032744 74 gossip/client.go:230  [n1] received [node:3 store:10 liveness:1 store:16 store:14 system-db store:15 cluster-id store:11 store:13 distsql-version:1 distsql-version:2 distsql-version:3 first-range store:18 store:8 distsql-draining:2 distsql-draining:3 liveness:2 store:12 store:7 liveness:3 store:17 store:9 distsql-draining:1 node:2 sentinel] from node 2 (1 fresh)
2018-07-09 08:06:28.032834 I180709 08:06:28.032784 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:30.697069 I180709 08:06:30.697056 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=644.696??s, err=236.963??s, at=2018-07-09 08:06:24.695244133 +0000 UTC
2018-07-09 08:06:30.697082 I180709 08:06:30.697076 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:30.808500 I180709 08:06:30.808463 349 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=467.822??s, err=238.277??s, at=2018-07-09 08:06:12.800759324 +0000 UTC
2018-07-09 08:06:31.130896 I180709 08:06:31.130884 74 gossip/client.go:230  [n1] received [store:15 node:3 store:13 distsql-draining:2 store:14 system-db store:18 store:7 store:10 store:17 distsql-draining:1 distsql-version:1 store:9 cluster-id distsql-version:2 store:12 liveness:1 liveness:2 liveness:3 store:16 store:8 first-range node:2 sentinel distsql-draining:3 distsql-version:3 store:11] from node 2 (1 fresh)
2018-07-09 08:06:31.130919 I180709 08:06:31.130913 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:32.477905 I180709 08:06:32.477870 26 storage/raft_transport.go:251  [n1] stats:
2018-07-09 08:06:32.477925          qlen   qmax   qdropped client-sent client-recv server-sent server-recv
2018-07-09 08:06:32.532888 I180709 08:06:32.532793 74 gossip/client.go:230  [n1] received [distsql-version:2 distsql-draining:2 distsql-version:3 liveness:2 sentinel store:10 store:18 store:11 store:8 node:3 store:15 store:17 distsql-version:1 node:2 store:12 store:16 system-db liveness:3 distsql-draining:1 distsql-draining:3 first-range liveness:1 store:13 store:7 store:9 cluster-id store:14] from node 2 (1 fresh)
2018-07-09 08:06:32.532900 I180709 08:06:32.532890 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:33.675675 W180709 08:06:33.675636 352 server/server.go:1267  The server appears to be unable to contact the other nodes in the cluster. Please try
2018-07-09 08:06:33.675684 - starting the other nodes, if you haven't already
2018-07-09 08:06:33.675687 - double-checking that the '--join' and '--host' flags are set up correctly
2018-07-09 08:06:33.675688 - running the 'cockroach init' command if you are trying to initialize a new cluster
2018-07-09 08:06:33.675690 If problems persist, please see https://www.cockroachlabs.com/docs/v2.0/cluster-setup-troubleshooting.html.
2018-07-09 08:06:33.698060 I180709 08:06:33.698047 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=644.696??s, err=236.963??s, at=2018-07-09 08:06:24.695244133 +0000 UTC
2018-07-09 08:06:33.698075 I180709 08:06:33.698069 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:33.811070 I180709 08:06:33.811049 360 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=467.822??s, err=238.277??s, at=2018-07-09 08:06:12.800759324 +0000 UTC
2018-07-09 08:06:35.630988 I180709 08:06:35.630969 74 gossip/client.go:230  [n1] received [first-range liveness:3 node:3 store:16 distsql-version:2 liveness:1 store:13 system-db sentinel store:10 store:11 store:12 store:15 store:7 distsql-draining:3 store:17 cluster-id store:18 distsql-draining:2 distsql-version:1 distsql-version:3 liveness:2 node:2 distsql-draining:1 store:14 store:8 store:9] from node 2 (1 fresh)
2018-07-09 08:06:35.631012 I180709 08:06:35.631005 165 gossip/gossip.go:1249  [n1] distantHops: 2 from 3
2018-07-09 08:06:36.699004 I180709 08:06:36.698990 152 rpc/clock_offset.go:176  [n1] update offset: ip.ip.ip.2:26257 off=644.696??s, err=236.963??s, at=2018-07-09 08:06:24.695244133 +0000 UTC
2018-07-09 08:06:36.699016 I180709 08:06:36.699011 152 rpc/clock_offset.go:231  [n1] 2 of 2 nodes are within the maximum clock offset of 500ms
2018-07-09 08:06:36.811936 I180709 08:06:36.811925 319 rpc/clock_offset.go:176  update offset: ip.ip.ip.1:26257 off=467.822??s, err=238.277??s, at=2018-07-09 08:06:12.800759324 +0000 UTC
2018-07-09 08:06:37.033009 I180709 08:06:37.032997 74 gossip/client.go:230  [n1] received [distsql-draining:3 distsql-version:1 first-range store:10 store:7 liveness:1 store:14 store:18 distsql-draining:2 sentinel node:2 store:12 store:17 store:8 distsql-version:3 liveness:2 liveness:3 node:3 store:9 system-db cluster-id distsql-version:2 store:11 distsql-draining:1 store:13 store:15 store:16] from node 2 (1 fresh)

Hey @valski - is this the full log after the problematic restart? Nothing in there jumps out to me. If this is the tail of the log and it was up long past 08:06:33, this just confirms that the signal we’re looking for isn’t present. The earlier chatter doesn’t look problematic at first pass, but I’ll leave this open for others to comment in case I’ve missed something.

Glad to hear wiping the node removed the immediate blocker to a clean restart, though I’m concerned we haven’t found the underlying cause of the issue in the first place.

@tim-o No. This is not the full log as I cut it. After “If problems persist” log lines looked identical to me and I had the impression that the process was printing identical messages at regular intervals.

Alas I can’t give you the full one any more. It turned out that logs were rotated this morning and I did not make a copy. Sorry, my bad :frowning:

BTW this is actually the second time this is happening on the same cluster. My colleague was chasing the very same problem after machine reboot several weeks ago and he fixed it by the same manners. If it happens a gain I will take my notes and report in this issue. Sorry for the incomplete info.

Once again thank you for your help.

My pleasure! Sorry we weren’t able to get to the bottom of it. If it happens again, collecting logs from all nodes and running --vmodule=rocksdb=3 on the node having issues might give us a better clue as to what’s gone wrong.

@tim-o

Hi there,

Node1 was failing to join after restart but instead I was only getting

2018-07-31 07:59:25.687304 I180731 07:59:25.687260 4276 storage/store.go:3514 [n4,s21,r3/10:/System/NodeLiveness{-Max}] added to replica GC queue (peer suggestion)

When I restarted Node2 and Node3 after that cluster went dead and failed to synchronize and form membership since then. Restarts did not help and since then all the data is permanently inaccessible.

Logs below:

node1 https://pastebin.com/raw/9TgfiePQ
node2 https://pastebin.com/raw/KQqeQSs3
node3 https://pastebin.com/raw/ETJHxKsH

Not sure if it is exactly the same issue

The log files suggest that you have decommissioned (and then replaced) 2 nodes out of 3, but without waiting for the cluster to stabilize/recover after each decommission. It looks like there are now permanently unavailable ranges.

Is that possible? What was the detail of your decommissioning steps?

(In general we advise to wait until the “underreplicated” range count drops to zero in the web UI before decommissioning more nodes. In a future version of CockroachDB we’ll automate that further.)

Thank you for getting back to me so quick.

I actually did not decommission or replaced any of the nodes.

Rather the events looked like that:

  • there were under replicated ranges (from previous outage once again called by disk full. see below.)

  • one of the disks on node 2 was almost 100% again and cockroach was failing to rebalance data probably due to Storage on multiple disks. I restarted it several times specifying storage limit in % or in GB but cockroach did not migrate any data. (logs does not have info from this run)

  • I restarted node 1 with 2.0.4 hopefully that “Prevented a situation in which ranges repeatedly fail to perform a split. #26944” quoted in the change log will make the balance working.

  • When node1 was restarted database was inaccessible from it but was running flawlessly from node2 and node3. At that time node1 was spilling multiple messages like this one but db was inaccessible. I have been waiting it for 10 minutes at least and probably hit https://github.com/cockroachdb/cockroach/issues/26952

    2018-07-30 19:47:50.855444 I180730 19:47:50.855431 100 storage/store.go:1303 [n4,s19] [n4,s19,r14903/?:/Table/51/1/359032{291???-490???}]: added to replica GC queue

  • I decided that at that time I can replace other binaries with 2.0.4 as well so I SIGTERMed all cockroach db on all nodes one after another.

  • When the pids finished, I replaced binaries from 2.0.2 to 2.0.4 one after another

  • Started instances one after another

It seems that at this state they forget who was leader and which is the consistent state which lead to the loss of the entire data.

I do not know if this is of any help.

Hey @valski,

I think there are a couple concerns we want to address:

  • Unavailability of data
  • Balancing of replicas & data across stores

It looks like @knz was addressing the balancing of data over here, so I’ll leave that question alone. Regarding unavailability: the logs indicate that the node count incremented (n2, n3, and n4 are mentioned in the logs). We’d only count up to n4 if one of the first three nodes was decommissioned, or if a node joined the cluster with an empty store. Based on the timeline you described, it seems like you were working with only three nodes, and took one down. Presumably up until that point you were only seeing n1 - n3 in the logs. Afterward, we start seeing n4 and data unavailability. Is it possible that n1 was restarted with the store flags in a different order, with different stores specified, or that the store was wiped in the restart?

It’s a bit difficult to piece together a full timeline from the logs, but the logs indicate there are a number of issues with the cluster in its current state. n2 and n1/n4 are having issues keeping up with their workloads (context deadline exceeded errors), and can’t find themselves in the cluster (can't determine lease status due to node liveness error: node not in the liveness table).

We did notice that there are a lot more ranges than are typical in a three node cluster. As a next step, would you be open to a quick conversation to discuss your use case so we can come back with suggestions?

About balancing - check.

n1 was decommissioned some time ago when I initially opened this case. I rejoined it as n4 around Jul 9. So the cluster is 3 node and pastes should read like that. sorry for confusion

node1 -> n4
node2 -> n2
node3 -> n3

The other sequence of events remains the same.

They failed to form a membership so they were unable to handle any work at all :slight_smile:

Despite my efforts I was unable to find a way to edit this table in any way or find a tooling for recovery in such situations. I think I am not suppose to do that at all but under-replicated ranges and sequence of restarts lead to a data loss. It can be totally a mistake in the way I operated all those events but I decided to just share this with you in case it is more serious issue with the service itself.

Sure. Please advise a channel that we can use. It is mainly, a write oriented workload, for storing normalized log data. You know, GDPR is all around :smiley: So far we built a solution that works for us but I do not mind discussing in case you are interested or my insight can help you make CRDB more robust.