Import Range Replication Skew

replication
(Philippe Laflamme) #1

Hi,

During an import of a large table, I’m seeing a large skew in the distribution of ranges across the nodes. I have a 12-node cluster, but the ranges are overly represented on only 3 of them: 35K ranges vs ~4K on the other nodes.

Now, one of those 3 nodes failed due to lack of disk space. This is unfortunate since there’s plenty of space left on the cluster, just not on this particular node given the range skew.

Is there anything that can be done to avoid this skew during import?

FYI: I’m using v19.1.0-rc.2 and using experimental_direct_ingestion

(Ron Arévalo) #2

Hey @Phil,

Could you send over a screenshot of your data distribution page.

It can be found in the admin ui at the link <host>/#/data-distribution.

Thanks,

Ron

(Philippe Laflamme) #3

Looking at this screen doesn’t seem to be useful for this problem given that the table doesn’t show up since it’s in the process of importing (i.e.: IMPORT table_name(...) ...) Is there a way to show the imported table in this screen? Here it is anyway:

In the meantime, here’s a screenshot of the Replicas per Node metric. Please note that the numbers are different than what I mentioned earlier, because I destroyed the previous cluster and created a new one before restarting the failed import…

09%20AM

(Ron Arévalo) #4

Hey @Phil,

Thanks for the screenshot. Did the import ever finish? If it did, did the replicas rebalance afterwards?

I was discussing this issue with our engineers, and the scatter of ranges is an issue that we are working making better. At the moment, both scattering of leases and replicas are not performing well, as you can tell.

You can follow along with this issue.

Thanks,

Ron

(Philippe Laflamme) #5

Hey @ronarev,

Thanks for the information. I’ve been failing at importing this table for the past few weeks honestly. It looks like I’m hitting various limitations at different times:

  • using HDDs, I had to add the following environment variable: COCKROACH_ENGINE_MAX_SYNC_DURATION=24h otherwise, it would fail with disk stall errors;
  • when the number of ranges per node gets around 20-30K, the cluster starts thrashing. The list of “problem ranges” (for various reasons) becomes gigantic and everything comes to a grinding halt;
  • using (smaller) SSDs, I run out of disk space on only a subset of nodes because of the problem described here;
  • to get the cluster to rebalance during the import, I tried pausing the job, but got this error: https://github.com/cockroachdb/cockroach/issues/36900

So, unfortunately, my CDB experience has been pretty bad so far. To be fair, at this point, I’m using rc releases and experimental features, but that’s because it also failed using the latest release and normal settings.

Importing a big chunk of data in CDB has lots of sharp edges at the moment; it makes it less likely for us to move over existing workloads.

(Ron Arévalo) #6

Hey @Phil,

I’m sorry to hear that your experience hasn’t been as smooth as we’d like it to be. I’d love to be able to troubleshoot the rest of these issues for you.

Cockroach makes heavy use of the disks you provide it, so using a faster disk will result in better performance. We suggest using SSD or NVMe devices with a recommended volume size of 300-500 GB. HDDs are slow, and aren’t optimized for database workloads.

How often has this happened? Does it eventually stablize? I have a hunch as to what may be happening here, but having logs to confirm this would be great. You could send over a debug zip next time this happens or if you still have the logs from the last time this happened we could take a look. I would just need to know the approximate time that this started.

What is the volume size you’re using? I would suggest using our recommended volume size if you aren’t already doing so. Also, if you’re importing with the experimental features, it will not scatter the replicas, when using the normal import feature, we make sure to run the scatter as tables are being imported to prevent one node getting slammed and running into imbalanced nodes. Was there any specific reason you were using the experimental import?

Thanks for filing this one! This seems like a straightforward bug, I spoke with our developers and what seems to be happening is that when you pause this, one of the workers might not get the message that the job was paused, and continues to work, and eventually fails instead of pausing and throwing away any work that it continued to do after the initial pause.

Let me know if you have any questions.

Thanks,

Ron

(Philippe Laflamme) #7

Yep, I’m aware of this. I’m going through a proof of concept, so this is by no means for production purposes. I am using production data though. FWIW, I gave up on using HDDs for this POC.

This would happen consistently on every try. The only way I found to mitigate this was to double the number of nodes from 12 to 24. FWIW, I haven’t seen this on SSDs yet.

I’m now using a 12 node cluster with 250GB provisioned SSDs on GKE.

Nothing specific beyond “Let’s see if this makes the import work.” I’ve launched the import job again without this feature (but now on SSDs). It was doing great at first but now seems to have stalled entirely. All the cluster metrics are flat. Here’s the debug.zip file https://storage.googleapis.com/cockroachdb-debug/2019-04-17-debug.zip

My pleasure! Happy to help make this product better :slight_smile:

(Philippe Laflamme) #8

@ronarev Any idea why the import would stall like it did? It’s still reported as pending / running, yet is clearly making no progress whatsoever. I can provide additional debug.zip files if that’s helpful.

Cheers

(Ron Arévalo) #9

Hey @phil,

There’s nothing in the logs that would indicate any reason for the job to fail.

If you run SHOW JOBS does it show the import as a failure?

Thanks,

Ron

(Philippe Laflamme) #10

Hey @ronarev,

No, it shows up as running.

Cheers,
Philippe

(Ron Arévalo) #11

@Phil,

Has it made any progress since you notice it stalled? Or is it just stuck? Also, none of your nodes have failed, correct?

It just sounds like this is a really slow moving import.

Thanks,

Ron

(Philippe Laflamme) #12

It has made zero progress. No node is marked as failed nor suspect, the cluster looks perfectly healthy.

Also, all metrics are basically flat: no ranges are being created, “live bytes” is not moving at all, etc.

The job clearly stopped, but it’s not reflected in its state. Is there anything I can do to further investigate this? I can restart the job to see if this happens again.

Cheers,
Philippe

(Ron Arévalo) #13

Hey @Phil,

I would suggest canceling the job if you haven’t done so already. You can cancel the job by running CANCEL JOB job_id, you can find the job_id by running SHOW JOBS.

If you restart the job and it fails again at around the same location, let us know.

Thanks,

Ron

(Philippe Laflamme) #14

Hi @ronarev,

I destroyed the cluster, created a fresh one and restarted the import. The same thing seems to have happened. The import made progress for a while and now all the metrics look flat and the job’s status is running.

Is there anything I can do to figure out if some node somewhere is actually doing something for this job?

Here’s the debug.zip file: https://storage.googleapis.com/cockroachdb-debug/2019-04-23-debug.zip

Cheers,
Philippe

(Ron Arévalo) #15

Hey @Phil,

I’ll need to escalate this to our devs as this seems to be a bug with import. However we need some more information from you.

So far, I know that you have a 12 node cluster with 250GB provisioned SSD’s on GKE.

Can you tell me what machine type you’re using, or if it’s a custom machine can you provide me with how many vCPUs and Memory it is provisioned with, if it’s not custom then the machine name will do. Could you also provide the DDL.

Lastly, if it’s possible, could you provide us with the import data? Is this test data that you’re able to share?

You can upload all that information here.

Thanks,

Ron

(Philippe Laflamme) #16

Hey @ronarev ,

I’ve uploaded the DDL as well as a screenshot of the GKE node pool.

Unfortunately, I cannot share the data since this is actually from production. Perhaps I can mangle it somehow, but it’s basically just some json blobs under a UUID key (trying to migrate an HBase workload). I provided the details of each individual file being imported.

Please let me know if I can provide anything else to help investigate this.

Cheers,
Philippe

(Ron Arévalo) #17

Hey @Phil,

Thanks for the data. Our engineers advised if you could try the import again and when it does get stuck you can check the goroutines for each node and see what stacks are in importccl, then we can see what the import is waiting on.

Thanks,

Ron

(Philippe Laflamme) #18

Hey @ronarev ,

I’ll do that ASAP. Is the process of dumping goroutines documented anywhere? I’m not particularly familiar with go tooling.

Looks like they’ll be included in debug.zip from now on https://github.com/cockroachdb/cockroach/issues/33318 what’s the best way for me to provide those in the meantime?

Cheers,
Philippe

(Ron Arévalo) #19

Hey @Phil,

If the cluster is running, you can visit the admin UI page <host>/debug/pprof/ui/goroutine/<node> this will take you to a page where you can see whats running, you could also look in the debug zip, but this endpoint is a bit more visual, if you sort it by top, it might be helpful to see what the import is getting stuck on.

Thanks,

Ron

(Philippe Laflamme) #20

Hey @ronarev ,

I’ve added a dump of goroutines from each node at approximately the same time to the shared folder. None of them show any stack within importccl, I also ran it a few times.

I also wrote another script that would dump the stack traces of a node until it found evidence of import within any of them. Again, couldn’t find anything.

Is there anything else I can provide short of the actual data or access to the running cluster?

Cheers,
Philippe