I’d like to follow up on something that a couple of us were discussing the other week - should we significantly decrease the default data TTL from the current default of 24h for the 1.1 release?
The TTL controls when the GC process can cleanup deleted rows or old versions of rows. Old data is nice because you can to time-travel queries (i.e. SELECT … AS OF SYSTEM TIME t): you can only time-travel within the TTL. I doubt that too many people use this feature directly, but I believe that backups use it. Of course, every query is technically a “time-travel query”, as in it needs to read data from the txn’s snapshot. So the TTL also limits the maximum duration of transactions (but very long transactions that want to write anything are probably usually even more limited by the amount of memory allocated to the timestamp cache - we only keep so much state in there and the overflow is handled by an advancing low-water mark).
For all the nice things, there’s also significant issues with keeping a lot of garbage around: it costs disk space (imagine getting an alert about running low on disk space and you think that deleting your huge scratch table will help but shit, after you delete it, you have even less space because of all the tombstones you’ve just written).
Second, it makes the runtime of scans unpredictable, divorced from people’s intuitive model that the cost is proportional to the amount of data currently in the table. It has already been reported by users several times that some particular use cases were behaving terribly and the answer, albeit not obvious, was that they were scanning over loads of tombstones. This happens in workloads with high row churn, such as when a table is used as a work queue. The number of live rows was << the amount of garbage and, sometimes due to our query planner not being very smart, sometimes necessarily, full table scans were slow.
So I think we should change this arbitrarily chosen 24h default to something way shorter (say, 10min). Large backups that take longer than that would be a problem, but I think that backups could move away from using time-travel queries and do something more custom - for example, contact each node in advance and open a RocksDB snapshot on which they’ll later operate. Some thought would need to be put into retries.
What do y’all think?