Should we significantly reduce the default TTL?

Hey comrades,

I’d like to follow up on something that a couple of us were discussing the other week - should we significantly decrease the default data TTL from the current default of 24h for the 1.1 release?
The TTL controls when the GC process can cleanup deleted rows or old versions of rows. Old data is nice because you can to time-travel queries (i.e. SELECT … AS OF SYSTEM TIME t): you can only time-travel within the TTL. I doubt that too many people use this feature directly, but I believe that backups use it. Of course, every query is technically a “time-travel query”, as in it needs to read data from the txn’s snapshot. So the TTL also limits the maximum duration of transactions (but very long transactions that want to write anything are probably usually even more limited by the amount of memory allocated to the timestamp cache - we only keep so much state in there and the overflow is handled by an advancing low-water mark).
For all the nice things, there’s also significant issues with keeping a lot of garbage around: it costs disk space (imagine getting an alert about running low on disk space and you think that deleting your huge scratch table will help but shit, after you delete it, you have even less space because of all the tombstones you’ve just written).
Second, it makes the runtime of scans unpredictable, divorced from people’s intuitive model that the cost is proportional to the amount of data currently in the table. It has already been reported by users several times that some particular use cases were behaving terribly and the answer, albeit not obvious, was that they were scanning over loads of tombstones. This happens in workloads with high row churn, such as when a table is used as a work queue. The number of live rows was << the amount of garbage and, sometimes due to our query planner not being very smart, sometimes necessarily, full table scans were slow.

So I think we should change this arbitrarily chosen 24h default to something way shorter (say, 10min). Large backups that take longer than that would be a problem, but I think that backups could move away from using time-travel queries and do something more custom - for example, contact each node in advance and open a RocksDB snapshot on which they’ll later operate. Some thought would need to be put into retries.

What do y’all think?

cc @bdarnell @tschottdorf @dan @david

I’ve also wondered whether the default TTL is unnecessarily large, but I think that the tombstone issue isn’t a compelling reason for reducing the TTL by default (instead, it calls for better discoverability of the condition, for example it should show up in traces). Similarly, a dropped table doesn’t need to remain available for historical reads (right?), so we should do something internally that wipes the data more eagerly.

I do think that the right solution for backups is to make sure the GC threshold is large enough (after all, with any workaround you put into place, you get pretty much the same outcome). Going to nodes and opening long-lived snapshots would be a messy solution and is probably unworkable due to the interaction with RocksDB internals.

I could be persuaded to reduce the default TTL, but I don’t think it’s a good idea to switch backups to be based on long-lived rocksdb snapshots. That just adds a lot of complexity and fragility (if any node restarts the backup would have to be aborted because it could no longer get its snapshot back). So the TTL needs to stay high enough that backups are still possible.

RocksDB supports lightweight “checkpoints”. These are persistent (and thus avoid the major caveat with snapshots). I haven’t thought at all about how backup could use them.

Interesting, those could be useful (


).

These checkpoints look interesting indeed.
Another thing I was thinking (which I guess ends up being pretty similar) is that TTLs could be temporarily increased while a backup is in progress - they’d extend to the start of the backup. So when a backup starts, something is gossiped about how no GC can be done until the backup is done (or some lease expires). Restarting nodes would not do any GC until they figure out what backups are in progress.

Being able to use time-travel queries is nice. However, they would be far more useful if they could be used in subqueries such as INSERT INTO … select…

Personally, I like the default of 24 hours, but I think anything between 1 hour and 1 week is reasonable as long as it can be configured. If you set it to 10 minutes, might as well set it to 0… 10 mins is going to run into edge cases, and might as well make them stick out sooner.

@jlauro That’s a good point about subqueries. It’s tricky to allow different parts of the same query to operate at different timestamps, but at least for the case of INSERT INTO... SELECT AS OF SYSTEM TIME it would be very useful if we could make that work.

So when a backup starts, something is gossiped about how no GC can be done until the backup is done (or some lease expires). Restarting nodes would not do any GC until they figure out what backups are in progress.

I’m -1 on dynamically changing settings like this based on backup activity. You have to provision for the worst case, so you’ll need enough disk space for the old versions, queries that can tolerate the presence of additional MVCC versions, etc. And if you’re doing that you should just set the GC TTL to the maximum allowable backup duration anyway, instead of taking the risk that starting a backup could cause a formerly-working system to break down.

The RocksDB checkpoints are more interesting because they work at a lower level (so the versions become invisible to MVCC iteration), but I think it’s going to be complicated to use them. They’re unreplicated state, so rebalancing during the backup could reduce the availability of these checkpoints (or the backup would have to keep going back to former range members that have them?).

@andrei w.r.t to increasing just during a backup op – that might work for regular (i.e. “full”) backups, just extending it long enough to do consistent read of the whole DB.

However Incremental backups use mvcc revisions to capture only what has changed since a previous backup. They are thus much cheaper to produce than full backups, e.g. in a large DB with a smaller working set currently seeing writes. The GC window is the minimum frequency at which one can schedule incremental backups, so making it order minutes would make incremental backups operationally more annoying.

Just to echo David… an incremental backup can only run if the last backup (either full or incremental) is still within the gc window. In other words, if you want to use incremental backups, you have to run them at least once per whatever period the GC TTL is set to. Doing this every 10 minutes is pretty crazy given the current performance of backup

I’m -1 on this for that reason. Everything that has to be re-tuned from our default settings to make CockroachDB truly production ready (and I personally include incremental backups in that) is something that we won’t be testing as often

Second, it makes the runtime of scans unpredictable, divorced from people’s intuitive model that the cost is proportional to the amount of data currently in the table.

we could, theoretically, also move older, non-live data out of the “live” keyspace and only query it for historical queries to address this problem (of course, that too isn’t a trivial change).

I wonder if we could encode something in sstable metadata to allow quickly skipping large sections of tombstones. For example, we could encode the first and last non-deleted key in the sstable metadata and then somehow use that information when scanning for a non-deleted key.

Yeah, ideas around this came up - for example, we already maintain statistics at the range level about the number of “live keys” and the time when the statistics were generated, so we could maintain additional info about the time of the last change and skip a range entirely if there’s been no recent changes and it only contains tombstones. But this line of thought seems dirty, I was hoping we wouldn’t need to explore it.

I subscribe to this logic generally, but it seems to me that in the case of backups, they can be scheduled such that they occur during natural periods of lower overall pressure. So just because you want to do backups sometimes doesn’t necessarily mean that you need to provision for peak load + backup. And also temporarily increasing the GC need not be the norm for backups - it’d only be activated in hopefully exceptional cases where incremental backups don’t finish in the configured TTL - so it’d be similar to any number of forms of backpressure that get activated when things in the system are bad.

I don’t know if I agree with this - the default TTL is just a default, I think you’re supposed and encouraged to change it. Similarly with other aspects of zone configs and provisioning. A database such as ours needs to be flexible enough for all sorts of workloads that we don’t test ourselves.
If the problem is incremental backups, I’d say that, given that incremental backup is a feature only available to enterprise license users, and most of our current users are not that, how about we make two defaults - one for people who activate incremental backups, and one for people who don’t. For incremental backup people, I’d ask @david and @dan what the default should be. I think it’s unlikely that the answer is the current 24h, which has been chosen long before incremental backup.
This all is moot if we don’t like the dynamic TTL idea for full backups and we’re not willing to work on the RocksDB checkpoints.

I think we need to try something out, though, because I think a lot of people are going to run into problems with the current TTL. Perhaps the moving-history-out-of-the-regular-keyspace idea… Do more people like that?

As you’re discovering, adjusting the default TTL is also problematic. We should remain open to all of the options.

it seems to me that in the case of backups, they can be scheduled such that they occur during natural periods of lower overall pressure.

Yes, tweaking settings during (full) backups is more acceptable than doing so based on other factors. It still seems like a last resort to me.

I’d say that, given that incremental backup is a feature only available to enterprise license users, and most of our current users are not that, how about we make two defaults - one for people who activate incremental backups, and one for people who don’t.

We may not have many enterprise users yet, but we don’t want to set up a choice where you have to make major performance sacrifices (in certain workloads) to enable incremental backups.

I think we need to try something out, though, because I think a lot of people are going to run into problems with the current TTL. Perhaps the moving-history-out-of-the-regular-keyspace idea… Do more people like that?

Yes, I’d like to explore options for minimizing the use of tombstones without changing the global TTL. The SQL partitioning RFC had some discussion about migrating older data to cold storage, although I think this would end up being very coarse-grained and not suitable for getting the number of tombstones down where we need it. I think a more interesting/viable option would be to have separate TTLs for indexes compared to the main table.

FWIW, Spanner’s SQL paper says that their new data layout, Ressi, separates data into an active file and an inactive one, with queries for current timestamp only needing to read the active one. No further details.