Slow perfomance after upgrade

,

In our multitenant system we have 200+ Databases ,
We started with v19.1.5 and performance was better till we migrated it to v20.1.17 .
A query which usually takes half a second in v19.1.5 is taking over minutes to complete in v20.1.17.

Issues

  • high latency
  • slow queries. taking minutes to complete one.(for example a “show tables;” query in any data base is taking a minute or more )
  • memory always above 80 or 90 percent in nodes
  • image
  • Dbeaver client it taking significant amount of time to load data when connected to cluster

Is there any hotfix ,configuration changes that need to be done to make it better?

I have posted the same in Polynomial latency growth for database isolated ("Bridge") multi-tenant models · Issue #63206 · cockroachdb/cockroach · GitHub
is this due to any architecture limitation?
Thanks

1 Like

The memory being high is not, in itself, a problem. Cockroach eagerly uses memory for its buffer caches.

I’m curious to understand the CPU utilization when this is happening. Could you provide screenshots of the hardware dashboard? Also, as importantly, could you grab a cpu profile during such an incident (/debug/pprof/profile?seconds=30).

Also, for a statement that is slow, can you run EXPLAIN ANALYZE (DEBUG) on it and consider providing me with the jaeger trace it contains?

Things got a bit less efficient in 20.1 as we tried to prune the buggy, old schema change system in favor of a more robust but less efficient jobs based one. It shouldn’t be this bad though. I wonder if there’s something else going on. One possibility is that there’s contention on the jobs table or with your transactions in general.

I’d like to think we’ve made things marginally better from an efficiency perspective over the last few releases. We’ve definitely made things much better on a number of other dimensions. I highly encourage you to keep upgrading to 21.1.

We have architectural plans to make large schemas more efficient in upcoming releases. I’d like to confirm that you’re bumping up against such things. My hope is that there’s something else going on here. Any more insight you can give me in the shape of the workload and queries being issued when you run into problems would be helpful.

@ajwerner Thanks for checking this post.

I ran this on SHOW TABLES on defaultdb which is having 0 tables.But it got errored out with a message.

show tables alone took 60 seconds just complete .This is happening all the time in all databases,
The DML queries are responding as usual.
We are using flyway migration tool do schema migrations and validations ,After the upgrade the flyway validation is taking 13+hours to validate all 200 databases(already migrated). This used to complete within 40min in v19.1.5 .

We have 1400 active db connections ,average query/Second is around 300
Hardware configuration
Azure Eseries vm 16cpu, 128gb RAM, Premium SSD with 1100 IOPS and 125MB throughput.

I ran show jobs ,it was giving 0 rows(19seconds to complete).

That’s the plan . But it would be great if we can figure any other reason for this issue.

Thanks again @ajwerner