Upgrade to 20.2 gets repetitive batch job error: context deadline exceeded

I upgraded my 6 node cluster from 20.2.9 in a rolling upgrade to 21.1.0

After the 6th node was upgraded, the Batch Job display in the admin website has this error many times:
running migration for 20.2-22: running iterate callback: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Any ideas?

Thanks…

Log says:
I210518 00:49:33.579982 212 2@server/status/runtime.go:553 ⋮ [n6] 1468 runtime stats: 619 MiB RSS, 342 goroutines (stacks: 4.7 MiB), 80 MiB/228 MiB Go alloc/total (heap fragmentation: 16 MiB, heap reserved: 109 MiB, heap released: 47 MiB), 296 MiB/372 MiB CGO alloc/total (9.8 CGO/sec), 13.7/5.8 %(u/s)time, 0.0 %gc (0x), 17 MiB/1.6 MiB (r/w)net
I210518 00:49:43.493201 145 2@gossip/gossip.go:567 ⋮ [n6] 1469 gossip status (ok, 6 node‹s›)
I210518 00:49:43.493201 145 2@gossip/gossip.go:567 ⋮ [n6] 1469 +gossip client (1/3 cur/max conns)
I210518 00:49:43.493201 145 2@gossip/gossip.go:567 ⋮ [n6] 1469 + 2: ‹172.22.130.110:26257› (33m0s: infos 5878/9305 sent/received, bytes 1178765B/2043168B sent/received)
I210518 00:49:43.493201 145 2@gossip/gossip.go:567 ⋮ [n6] 1469 +gossip server (0/3 cur/max conns, infos 5878/9305 sent/received, bytes 1178765B/2043168B sent/received)
I210518 00:49:43.578193 212 2@server/status/runtime.go:553 ⋮ [n6] 1470 runtime stats: 620 MiB RSS, 341 goroutines (stacks: 5.1 MiB), 87 MiB/181 MiB Go alloc/total (heap fragmentation: 11 MiB, heap reserved: 60 MiB, heap released: 93 MiB), 296 MiB/372 MiB CGO alloc/total (4.7 CGO/sec), 15.3/9.0 %(u/s)time, 0.0 %gc (0x), 17 MiB/1.5 MiB (r/w)net
I210518 00:49:53.579751 212 2@server/status/runtime.go:553 ⋮ [n6] 1471 runtime stats: 621 MiB RSS, 338 goroutines (stacks: 6.3 MiB), 105 MiB/182 MiB Go alloc/total (heap fragmentation: 7.3 MiB, heap reserved: 46 MiB, heap released: 92
MiB), 296 MiB/372 MiB CGO alloc/total (1.1 CGO/sec), 15.4/7.2 %(u/s)time, 0.0 %gc (0x), 17 MiB/1.7 MiB (r/w)net
I210518 00:50:03.579561 212 2@server/status/runtime.go:553 ⋮ [n6] 1472 runtime stats: 623 MiB RSS, 338 goroutines (stacks: 5.2 MiB), 122 MiB/190 MiB Go alloc/total (heap fragmentation: 5.4 MiB, heap reserved: 39 MiB, heap released: 84
MiB), 296 MiB/372 MiB CGO alloc/total (0.5 CGO/sec), 12.2/5.5 %(u/s)time, 0.0 %gc (0x), 17 MiB/1.5 MiB (r/w)net
I210518 00:50:13.579829 212 2@server/status/runtime.go:553 ⋮ [n6] 1473 runtime stats: 624 MiB RSS, 339 goroutines (stacks: 4.2 MiB), 55 MiB/189 MiB Go alloc/total (heap fragmentation: 16 MiB, heap reserved: 95 MiB, heap released: 85 MiB), 296 MiB/372 MiB CGO alloc/total (0.9 CGO/sec), 14.1/5.2 %(u/s)time, 0.0 %gc (0x), 17 MiB/1.5 MiB (r/w)net
I210518 00:50:23.580344 212 2@server/status/runtime.go:553 ⋮ [n6] 1474 runtime stats: 626 MiB RSS, 341 goroutines (stacks: 6.2 MiB), 88 MiB/167 MiB Go alloc/total (heap fragmentation: 13 MiB, heap reserved: 41 MiB, heap released: 107 MiB), 296 MiB/372 MiB CGO alloc/total (0.1 CGO/sec), 13.3/7.9 %(u/s)time, 0.0 %gc (0x), 17 MiB/1.8 MiB (r/w)net
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 MIGRATION job ‹659385678523269126›: stepping through
state reverting with error: running migration for 20.2-22: running iterate callback: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 +(1) attached stack trace
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + – stack trace:
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/migration/migrationjob.resumer.Resume
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrationjob/migration_job.go:95
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | […repeated from below…]
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 +Wraps: (2) running migration for 20.2-22
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 +Wraps: (3) attached stack trace
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + – stack trace:
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Iterate
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:562
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/migration/migrationcluster.(*Cluster).IterateRangeDescriptors.func1
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrationcluster/cluster.go:142
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:773
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:828
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnI210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:772
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/migration/migrationcluster.(*Cluster).IterateRangeDescriptors
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrationcluster/cluster.go:136
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/migration/migrations.truncatedStateMigration
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrations/truncated_state.go:44
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/migration.(*KVMigration).Run
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/kv_migration.go:152
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/migration/migrationjob.resumer.Resume
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrationjob/migration_job.go:82
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func1
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1193
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1194
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:255
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:214
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | runtime.goexit
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | /usr/local/go/src/runtime/asm_amd64.s:1374
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 +Wraps: (4) running iterate callback
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 +Wraps: (5)
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | (opaque error wrapper)
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | type name: github.com/cockroachdb/cockroach/pkg/util/contextutil/*contextutil.TimeoutError
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 +Wraps: (6) ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + |
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | (opaque error leaf)
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 + | type name: google.golang.org/grpc/internal/status/*status.Error
I210518 00:50:25.591248 89610 jobs/registry.go:1175 ⋮ [n6] 1475 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errbase.opaqueWrapper (6) *errbase.opaqueLeaf
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 MIGRATION job ‹659385678523269126›: stepping through
state failed with error: running migration for 20.2-22: running iterate callback: ‹rpc error: code = DeadlineExceeded
desc = context deadline exceeded›
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 +(1) attached stack trace
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + – stack trace:
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/migration/migrationjob.resumer.Resume
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrationjob/migration_job.go:95
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | […repeated from below…]
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 +Wraps: (2) running migration for 20.2-22
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 +Wraps: (3) attached stack trace
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + – stack trace:
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Iterate
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:562
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/migration/migrationcluster.(*Cluster).IterateRangeDescriptors.func1
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrationcluster/cluster.go:142
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:773
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:828
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnI210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:772
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/migration/migrationcluster.(*Cluster).IterateRangeDescriptors
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrationcluster/cluster.go:136
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/migration/migrations.truncatedStateMigration
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrations/truncated_state.go:44
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/migration.(*KVMigration).Run
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/kv_migration.go:152
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/migration/migrationjob.resumer.Resume
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/migration/migrationjob/migration_job.go:82
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func1
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1193
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1194
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:255
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:214
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | runtime.goexit
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | /usr/local/go/src/runtime/asm_amd64.s:1374
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 +Wraps: (4) running iterate callback
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 +Wraps: (5)
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | (opaque error wrapper)
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | type name: github.com/cockroachdb/cockroach/pkg/util/contextutil/*contextutil.TimeoutError
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 +Wraps: (6) ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + |
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | (opaque error leaf)
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 + | type name: google.golang.org/grpc/internal/status/*status.Error
I210518 00:50:25.827676 89610 jobs/registry.go:1175 ⋮ [n6] 1476 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errbase.opaqueWrapper (6) *errbase.opaqueLeaf
E210518 00:50:26.439733 89610 jobs/adopt.go:260 ⋮ [n6] 1477 job ‹659385678523269126›: adoption completed with error running migration for 20.2-22: running iterate callback: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
I210518 00:50:27.285513 701 server/auto_upgrade.go:75 ⋮ [n6] 1478 error when finalizing cluster version upgrade: ‹set-version›: running migration for 20.2-22: running iterate callback: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›

The entire cluster has no user SQL activity at this time, and no batch jobs (other than the automated migration job).

Network latency is good except 1 node (N6) that is physically far away with network latency in admin website showing 200-300ms.

Out of curiosity, how much data do you have running in this cluster? To be able to diagnose further we’d likely need to take a look at the debug zip. You can send it through Cockroach Labs, and do reference this forum post. We should be able to pick it up from there.

Will send info via that link. Current usage is 112GiB disk space used. Not terribly huge?

Just had 2 nodes hard-crash recently for unknown reason, killing my SSH access - will have to physically reboot them before I can attempt debug.zip.

A prior attempt at debug.zip was MANY gigabytes, so will have to see what I can do to make it smaller.