I’m sorry to hear that your experience hasn’t been as smooth as we’d like it to be. I’d love to be able to troubleshoot the rest of these issues for you.
Cockroach makes heavy use of the disks you provide it, so using a faster disk will result in better performance. We suggest using SSD or NVMe devices with a recommended volume size of 300-500 GB. HDDs are slow, and aren’t optimized for database workloads.
How often has this happened? Does it eventually stablize? I have a hunch as to what may be happening here, but having logs to confirm this would be great. You could send over a debug zip next time this happens or if you still have the logs from the last time this happened we could take a look. I would just need to know the approximate time that this started.
What is the volume size you’re using? I would suggest using our recommended volume size if you aren’t already doing so. Also, if you’re importing with the experimental features, it will not scatter the replicas, when using the normal import feature, we make sure to run the scatter as tables are being imported to prevent one node getting slammed and running into imbalanced nodes. Was there any specific reason you were using the experimental import?
Thanks for filing this one! This seems like a straightforward bug, I spoke with our developers and what seems to be happening is that when you pause this, one of the workers might not get the message that the job was paused, and continues to work, and eventually fails instead of pausing and throwing away any work that it continued to do after the initial pause.
Let me know if you have any questions.