I’ve been investigating some problems we have with
CREATE TABLE statements sometimes returning retryable errors to clients. This is a problem particularly when the client is not prepared to handle such an error (e.g. when the
CREATE is done by an ORM. There are specific improvements we can make such that a
CREATE is less likely to return an error, but generally retries can be necessary for many reasons, for creates and other statements.
These errors are not observed by clients when they do either implicit transactions (single statements outside of a txn) or when the error is encountered by a statement sent in the same batch of statements as its
BEGIN; in these cases transactions are automatically retried at the server. The reasoning for the latter is that we know that none of the statements in the BEGIN’s batch were the result of some conditional logic the client might have based on reads performed in the transaction.
Ben had a good observation the other day and I want to see if anyone has any thoughts: the reasoning about the client logic not being conditional extends to statements in the first batch after the
BEGIN, if the
BEGIN is trailing (or alone in) a batch. More generally, it extends to statements sent by the client before any results of previous statements from the same txn have been sent to it. So, we could also auto-retry such first batches. This would prevent clients from seeing retryable errors when they do, say
CREATE TABLE, then
COMMIT, with the
CREATE being in separate batches.
It is obviously true that statements in such a first batch are not conditional on reads (or generally stmt results) generated by their transaction. They could, however, be conditioned by things observed through other concurrent transactions - and it’s theoretically possible that, if the client was in charge of directing a retry, it’d refuse to perform it (or perform different logic in the retry). This is already true with the existing automatic retries; I can’t see a clear difference with extending the retries in the way proposed… Apart from the fact that, with this proposal, the server could start a retry after an arbitrarily long period of time after the
BEGIN has been sent, whereas before the time was controlled exclusively by how long the server took to execute statements.
Does anyone have any thoughts on this?
A related question is when we should chose a timestamp for (the first attempt of) a transaction with its
BEGIN in a separate batch. Currently, we chose the timestamp when we see the
BEGIN (as opposed to deferring it until the first kv operation is performed). This is because we want functions like
cluster_logical_timestamp(), which might be evaluated early, to return a consistent timestamp. It seems weird that it’s possible for the reads in that txn to not observe state that the client has observed (through other concurrent txns) before issuing them. So a proposal would be to defer the timestamp assignment at least to when we see the first statement in the txn after the
BEGIN. However, @tschottdorf was telling me in another context that choosing the timestamp early allows us to guarantee to clients that, if they wait after doing the
BEGIN, and only then issue statements they get . But now I can’t reconstruct what that guarantee might have been exactly…