Node crashed in the early stage of booting up

bug

(Andrew D) #1

I have 3 nodes cluster running CRDB 2.0.6, but after one of the nodes reboot, it was struggling to come back. The whole log was as below. What catches my eye is: “0 of 1 are within the offset”. It seems it recognize itself as the only one node in the cluster.

I190130 22:50:54.461610 15 server/server.go:784  [n?] Sleeping till wall time 1548888654461587609 to catches up to 1548888654926557650 to ensure monotonicity. Delta: 464.970041ms
F190130 22:50:54.470201 141 server/server.go:219  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
goroutine 141 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc420546400, 0xc420546420, 0x3328800, 0x10)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:959 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x3bd7aa0, 0xc400000004, 0x33288ef, 0x10, 0xdb, 0xc4200ea510, 0x88)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:835 +0x753
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2787660, 0xc4207ec030, 0x4, 0x2, 0x0, 0x0, 0xc420cb1c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:154 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2787660, 0xc4207ec030, 0x1, 0x4, 0x0, 0x0, 0xc420cb1c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:55 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2787660, 0xc4207ec030, 0xc420cb1c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:158 +0x6c
github.com/cockroachdb/cockroach/pkg/server.NewServer.func1()
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:219 +0xb0
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).runHeartbeat(0xc42035f680, 0xc4203f1630, 0xc420284e60, 0x16, 0xc420547860, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:698 +0x667
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDial.func1.1.1(0x2787660, 0xc420bdd1a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:564 +0x74
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42027bd20, 0xc4205cc2d0, 0xc420bdd170)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xad

Here is another full log for this issue. I originally thought the issue might be related to delta for ensuring monotonicity exceeding the time offset threshold. But the in log below, the delta is 245.793767ms which is supposed to within the threshold.

W190203 09:40:07.188022 26 gossip/gossip.go:1293  [n?] no incoming or outgoing connections
I190203 09:40:07.188109 26 server/server.go:784  [n?] Sleeping till wall time 1549186807188082639 to catches up to 1549186807433876406 to ensure monotonicity. **Delta: 245.793767ms**
F190203 09:40:07.197481 81 server/server.go:219  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
goroutine 81 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4200a4b00, 0xc4200a4b40, 0x3328800, 0x10)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:959 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x3bd7aa0, 0xc400000004, 0x33288ef, 0x10, 0xdb, 0xc420d06090, 0x88)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:835 +0x753
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2787660, 0xc42093c7b0, 0x4, 0x2, 0x0, 0x0, 0xc420d19c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:154 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2787660, 0xc42093c7b0, 0x1, 0x4, 0x0, 0x0, 0xc420d19c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:55 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2787660, 0xc42093c7b0, 0xc420d19c58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:158 +0x6c
github.com/cockroachdb/cockroach/pkg/server.NewServer.func1()
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:219 +0xb0
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).runHeartbeat(0xc4200e5680, 0xc4208f4960, 0xc420330460, 0x16, 0xc4204c51a0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:698 +0x667
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDial.func1.1.1(0x2787660, 0xc420b7aae0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:564 +0x74
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202fdcf0, 0xc42082e090, 0xc420b7aab0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xad

Why the node went to crash so early in this case? I have checked my ntp setting, they are all good in sync with the time server.

Thank you!


(Tim O'Brien) #2

Hi @AndrewJd,

I see you found the solution on your github issue. Sounds like it was due to ntp issues, but let me know if you need further help.


(Andrew D) #3

Hi Tim, it was resolved by using a different ntp server.

Thanks!