Performance degrades over time

Can anyone please provide some tuning advice?

I have a single node CRDB server coupled with Nakama as a game server. Over a period of several weeks, the performance degraded until it impacted the game. I did a backup and recreated the database after which performance was back to it’s best but it is now degrading again.

The symptoms are:

  • increased use of disk space although the actual data stored is very much lower
  • Dramatic increase in database reads
  • Increase in CPU load

I set gc.ttlseconds=3600 before restarting the server when I recreated the database but it’s unclear if this had a major beneficial effect.

I would appreciate any useful pointers. Thanks!

Sorry to hear that you’re having problems. What is the nature of the workload? Is it update-heavy? Given you’re saying that it took several weeks to degrade, I wonder if any of your queries are performing full table scans and thus getting slower because each invocation requires reading more data.

If you go to the statements page in the UI, what are the common queries and what are the slow queries?

Hi Andrew, thanks for getting back to me.

The workload is update heavy (typical for a game server), although I explicitly read each record before updating it. There are relatively few inserts, and the tables are not growing significantly over time.

There are no slow queries, the increased workload appears to be a background operation in the database which becomes more intensive over time. I attach the first page of the statements list as it stands now, although the system is running very cleanly right now (see below).

The game plays fine, with no slow operations, until this background task consumes all the resources (esp. CPU) and suddenly all queries are slow.

This occurred yesterday. I rebooted the server at 20:20 last night and then today at 13:45 I wiped the database and reimported from a backup. You can see the effect on system resources from the attached charts. In particular, Disk I/O is back to normal with much lower data read rates. Also disk and memory usage is much lower.

Does this give any clues to what is happening? Thanks again.

Huh, that’s confusing. When the CPU utilization gets high, it’d be great to see a CPU profile. You can get one of those at <url>/debug/pprof/profile?seconds=30. Also a stack trace might help (<url>/debug/pprof/goroutine?debug=2).

The fact that the disk usage was so high makes me think that garbage has something to do with the problem here. Perhaps even shorter GC TTLs will help.

What version is this?

I’m on version 20.2.0.

I thought garbage too, which is why I shortened the TTL to 3600 previously. This time I’m on 600 seconds. Am I right in thinking that this needs to be longer than the time taken to complete a backup?
As I’m on a single node there’s no syncing nodes, I imagine the TTL is used for that too (similar to journalling on mongoDB?)

I’d like to resolve this before it gets back into a poor performance state again, but if not then I’ll take more diagnostics and grab the data you’ve asked for. Thanks.

No, the TTL isn’t used for replication in cockroach. It is used for incremental backups, time travel queries, and changefeeds. For update-heavy workloads full backups generally make more sense than incrementals. Full backups are not really affected. While a backup is being taken, the gc of the overlapping data is paused (the data is “protected”).

Were there any long-running jobs? Do you continue to see the data size growing?

Also, 20.2.7 has a number of fixes over 20.2.0; if you find an opportunity to upgrade, it won’t hurt.

The data size isn’t significantly changed since my last restart, I will monitor it over the coming days. There are no long running jobs.

Thanks for the clarification on the TTL. Is it safe to set it extremely low as I’m not using those features? What is the minimum value?

I’ll upgrade at the next restart, thanks for that.

It’s generally not advisable to set it below around 10 minutes. We don’t have much in the way of testing shorter than that. I should also note that it will, in effect, cap the duration of transactions.

1 Like