TimeSeries queries timing out, some way to fix or reset it?

Hi there,

I have recently been experimenting with migrating to CRDB for one of our workloads that’s currently in Cassandra and have run into a confusing issue with metric in the Admin UI.

TL;DR: all requests to /ts/query return a 504 Gateway timeout after 30 seconds (context deadline exceeded). So admin UI metric graphs never load. The logs don’t seem to say anything useful about it. Is there some way to reset the timeseries data? This happens even if I try to load just 1 metric in the custom time series chart.

Things I’ve tried that didn’t work:

My cluster seems healthy otherwise, I can insert and query normal data.

This problem came up when I started adding many nodes. Initially I tested on a single node (which happened to be a larger node with quite a few SSDs), and I quickly added 20 smaller nodes that have HDD and let it rebalance. One of those nodes had a disk error and died so I had to decommission it. According to that admin UI that happened fine.

Any help would be appreciated.

I just had to rebuild this, it seemed unfixable.

When I started the bad node back up I was able to get metrics back, but the node would crash after a little while because it had corrupt data. It seemed that the timeseries metrics call was timing out because it needed data from a range the bad node had. Trying to recover that data wasn’t working, so at least 1 range wasn’t able to replicate elsewhere. CRDB doesn’t seem to give any options to just drop a range entirely and write off the data loss. Less learned.

Note that on the node with a bad disk, I could copy some data, but the rocksdb manifest was corrupt. I tried repairing it using https://github.com/facebook/rocksdb/wiki/RocksDB-Repairer which succeeded but then CRDB wouldn’t start since it uses a custom comparator.