How to cancel a never-ending revert

I have an import that failed (context canceled error), and the revert seems to be running forever. I have posted before about how revert takes a long time for a large dataset, but that was on the order of 24 hours, this revert has been running for 9 days.

After a day or two I noticed errors in the logs like so (primary keys obscured)

W210524 15:38:01.962924 186995257 kv/kvclient/kvcoord/dist_sender.go:1514 ⋮ [n1,job=‹659740178040946689›] 2313269  slow range RPC: have been waiting 96.33s (1 attempts) for RPC RevertRange [‹/Table/54/2/"pk1"›,‹/Table/54/2/"pk2"›) to r527949:‹/Table/54/2/"pk3"› [(n20,s115):1, (n16,s94):2, (n13,s76):3, next=4, gen=1293]; resp: ‹(err: <nil>), *roachpb.RevertRangeResponse›

Now, I had increase my range sizes to 4GB max since these are large nodes that will be storing quite a bit of data, and I figured that might be causing these revert range commands to take too long, so I changed that down to 1GB and let it do the splits. That got rid of this error, but still, after several days, the revert continues…

Now in the logs the only revert-related logging I see is that it seems to be repeatedly trying to do the import:

seu-pid206.pdx.turnitin.io: I210527 18:15:46.322240 407841 sql/revert.go:89 ⋮ [n18,job=‹659740178040946689›] 112575  reverting table ‹mytable› (54) to time 1621407124.604026590,2147483647
seu-pid206.pdx.turnitin.io: I210528 00:53:42.657859 407841 sql/revert.go:89 ⋮ [n18,job=‹659740178040946689›] 118699  reverting table ‹mytable› (54) to time 1621407124.604026590,2147483647
seu-pid206.pdx.turnitin.io: I210528 07:30:22.302704 407841 sql/revert.go:89 ⋮ [n18,job=‹659740178040946689›] 124821  reverting table ‹mytable› (54) to time 1621407124.604026590,2147483647
seu-pid206.pdx.turnitin.io: I210528 14:13:06.184212 407841 sql/revert.go:89 ⋮ [n18,job=‹659740178040946689›] 131030  reverting table ‹mytable› (54) to time 1621407124.604026590,2147483647

I started the import on v21.1.0, and also just tried upgrading to v21.1.1 to see if that helps.

It’d be nice to fix whatever is preventing this thing from finishing, but would also be nice to know if there is a workaround if that can’t happen, like a way to just kill the revert entirely, drop the table, and restore it from backup. (I had issues before so already have been told how to make the table visible to drop it if necessary.)

Start by pausing it. It won’t cancel it but it’ll at least stop retrying until we can work out the problems.

If you can gather stack traces from the nodes while it’s running that’s likely to have some clues. Also, this feels very worthy of a github issue.

Let’s use import: rolling back IMPORT INTO is slow · Issue #62799 · cockroachdb/cockroach · GitHub as the issue. Sorry for the noise.