Transaction mysteriously fails


(Ivan Dubrov) #1

I’m sometimes getting the following error when trying to run “RELEASE SAVEPOINT” at the end of the transaction (the error is from the Rust postgres driver):

Error(Db(DbError { severity: “ERROR”, parsed_severity: None, code: SqlState(“25P02”), message: “Expected “ROLLBACK TO SAVEPOINT COCKROACH_RESTART”: current transaction is aborted, commands ignored until end of transaction block”, detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some(“sql/sqlbase/errors.go”), line: Some(62), routine: Some(“NewTransactionAbortedError”) }))

What is confusing is that this transaction is effectively a read-only one (it does a couple of SELECT statements and that’s it). It’s not marked explicitly as read-only, but it doesn’t do any writes.

This only happens under somewhat heavy load (multiple concurrent transactions).

We are also observing a similar error which might be or might not be related:

Error(Db(DbError { severity: “ERROR”, parsed_severity: None, code: SqlState(“XX000”), message: “SAVEPOINT COCKROACH_RESTART needs to be the first statement in a transaction”, detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: None, line: None, routine: None }))

This one happens when we do detect serialization error in transaction & try to do the retry. It happens when we run “SAVEPOINT COCKROACH_RESTART” just after running “ROLLBACK TO SAVEPOINT COCKROACH_RESTART” after seeing serialization error (40001).

It also happens under load. Seems related to transaction serialization errors, but restarting transactions works fine most of the time. Only some of the restarts are causing this error.

The question I have is there any way I can capture some detailed logs on CockroachDB side to help troubleshooting this behavior?


(Ron Arévalo) #2

Hey @idubrov,

Sorry for the delay in response.

Regarding the retries, how often are they happening and under what schema? What are your testing conditions, and is the workload only reads, I know you mentioned that this specific transaction was read-only but are there any test cases involving writes? Retries shouldn’t happen frequently—however they do happen. Usually we encourage users to manage them client side.

Also, what version of CRDB are you using?

Thanks,

Ron


(Ivan Dubrov) #3

We have a very simple schema, pretty much 3 tables, two of which are essentially K-V (key to JSON blob). Third table is where we store our “search indices”, and this table is queries for all kind of searches. It’s pretty much a secondary index table mapping values to keys. During searches, we join this table on itself (but most of our searches are trivial).

The workload is reads + writes. We process a stream of data (~300k requests a day, I think?). Each request is a transaction with ~10 SQL statements, probably 50%/50% reads/writes. It takes ~300ms for each transaction (which is a bit too big, but that’s a separate issue).

While this is happening, we also have a separate low-frequency read-only requests which were failing. Later we found that CRDB didn’t use correct index for them and opted to do an essentially a full scan, so conflict is not surprising. We forced the index and it solved the problem.

We have a code to restart transaction which works most of the time. The puzzling part is that sometimes it does not, with one of the two errors above. There are two cases which I tested and they seem to work fine:

  1. Serialization error is returned when running regular SQL statement (like, “INSERT” of some sort). In this case, we would execute “ROLLBACK TO SAVEPOINT cockroach_restart”. Then proceed to #3.
  2. Serialization error is returned when running “RELEASE SAVEPOINT cockroach_restart”. In this case, we do nothing and proceed to #3.
  3. We re-start the loop by running “SAVEPOINT cockroach_restart” and re-run the transaction.

However, sometimes I get one of those two errors.

My theory is that we get Expected “ROLLBACK TO SAVEPOINT COCKROACH_RESTART”: current transaction is aborted, commands ignored until end of transaction block when the last statement of our transaction is successful, but then transaction times-out before we do “RELEASE SAVEPOINT cockroach_restart”. Could that be the case?

However, I don’t have any ideas how SAVEPOINT COCKROACH_RESTART needs to be the first statement in a transaction could be happening. According to the logs, we detect serialization error, run the “ROLLBACK TO SAVEPOINT cockroach_restart” and then do “SAVEPOINT cockroach_restart”.

It is still possible that our connection pool manager plays tricks on us or something, so if there was a way to make cockroach to emit extra logs (maybe, log every statement in the transaction?) every time this happen, it would help.

We use CRDB 2.1.4.

P.S. Some numbers for the last 1 hour:

Serialization errors happened: 8
“SAVEPOINT COCKROACH_RESTART needs to be the first statement in a transaction”: 4 (these are happening on read-write transactions, I think).
“ROLLBACK TO SAVEPOINT COCKROACH_RESTART”: 0 (I think, this one was linked to that read-only transaction with non-ideal index – seems like we fixed that and it’s gone).
Transactions: ~6500


(Ron Arévalo) #4

Hey @idubrov,

Thank you for being so thorough in your response! Following the steps you took I attempted to reproduce, but was not able to. Regarding logging, you could try running with --vmodule=conn_executor=2 which will log statement info. However this could have impact on performance and I suggest you turn it back down as soon as your done testing.

Also, are you using the execute_txn function from our docs or have you implemented the retry logic yourself?

Thanks,

Ron


(Ivan Dubrov) #5

We have our own code to restart transactions (which was originally based on the one in the documentation, but was rewritten because we needed to have transaction reference in a lifetimeless struct, so we used rental): https://gist.github.com/idubrov/4bfffd2ecb10575ca2acd27aa371f35d


(Ivan Dubrov) #6

I turned on the logging, but it doesn’t seem to print “BEGIN”/“COMMIT”/“SAVEPOINT”/“RELEASE” statements, which makes it not very useful.

I want to see those to understand if we have issue on client side (like, sending rogue “SELECT 1” to test connection liveness, just in between “ROLLBACK TO SAVEPOINT” and “SAVEPOINT”).

P.S. Okay, got some logs from the failure:

First record from CRDB logs (<read query> is a pretty bad read query which scans over lots of rows):

I190213 22:41:18.859708 13524243 sql/exec_log.go:164 [n3,client=10.120.0.18:48392,user=xxx] 13732 exec “” {} “” {…} 843.556 0 “HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1550097678.014283549,0 encountered previous write with future timestamp 1550097678.014283549,1 within uncertainty interval t <= 1550097678.017017538,0; observed timestamps: [{2 1550097678.017017538,0} {3 1550097678.014283549,0}]”

Second record from CRDB logs, same query (transaction was restarted & same query was executed):

I190213 22:41:19.674577 13524243 sql/exec_log.go:164 [n3,client=10.120.0.18:48392,user=xxx] 13969 exec “” {} “” {…} 1658.513 0 “”

Application logs:

WARN startup xxx 2019-02-13 22:41:19.285672023+00:00 - serialization error – restarting transaction
ERROR startup log_panics 2019-02-13 22:41:19.917391033+00:00 - thread ‘main’ panicked at 'called Result::unwrap() on an Err value: Error(Db(DbError { severity: “ERROR”, parsed_severity: None, code: SqlState(“XX000”), message: “SAVEPOINT COCKROACH_RESTART needs to be the first statement in a transaction”, detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: None, line: None, routine: None }))
stack backtrace:

What is interesting is that I also see the following line in the CRDB logs (same connection):

I190213 22:41:25.329754 13524243 sql/exec_log.go:164 [n3,client=10.120.0.18:48392,user=bend] 16266 exec “” {} “” {…} 1540.270 0 “HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1550097683.787311066,0 encountered previous write with future timestamp 1550097683.787311066,1 within uncertainty interval t <= 1550097683.792518573,0; observed timestamps: [{1 1550097683.792518573,0} {3 1550097683.787311066,0}]”

Which does not correspond to any “serialization error” in application server logs.

Could it be possible that CRDB tries to restart statement by itself, doing “server-side transaction restart” and it conflicts with our client-side retries somehow? Because I expect to see our application trying to do that transaction restart for that last entry.


(Andrei Matei) #7

@idubrov, did you enable logging with --vmodule=conn_executor=2 or with set cluster setting sql.trace.log_statement_execute = true;? Your logs look like that latter, and that one indeed seems to not have begin, etc. (that one is means more like an audit log I believe).


(Ivan Dubrov) #8

I think, you are right. I did both at some point but then just used “set cluster setting sql.trace.log_statement_execute = true;”

I didn’t realize those were different.

I rewrote our transaction code to avoid doing multiple “SAVEPOINT” statements (documentation says it’s enough to do “ROLLBACK TO SAVEPOINT”, there is no need to run another statement after that), which seems to remove the issue for us. I’m still curious what kind of mystery statement happens between us doing “ROLLBACK TO SAVEPOINT” and “SAVEPOINT” on next iteration, but I think, we are good for now.