Running into TransactionRetryError with Wildfly Application using JPA

I started to migrate my applications form PostgreSQL to CockroachDB. At the first tests everything seems to work fine. But now we run into a problem with a production system. We receife the TransactionRetryWithProtoRefreshError.

The error messages look like this:

Caused by: javax.transaction.RollbackException: ARJUNA016053: Could not commit transaction.
	at org.jboss.jts//com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1299)
	at org.jboss.jts//com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:126)
	at org.jboss.jts.integration//com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:94)
	at org.wildfly.transaction.client@1.1.11.Final//org.wildfly.transaction.client.LocalTransaction.commitAndDissociate(LocalTransaction.java:75)
	at org.wildfly.transaction.client@1.1.11.Final//org.wildfly.transaction.client.ContextTransactionManager.commit(ContextTransactionManager.java:71)
	at org.jboss.as.ejb3@20.0.1.Final//org.jboss.as.ejb3.tx.CMTTxInterceptor.endTransaction(CMTTxInterceptor.java:89)
	... 114 more
	Suppressed: org.jboss.jca.core.spi.transaction.local.LocalXAException: IJ001156: Could not commit local transaction
		at org.jboss.ironjacamar.impl@1.4.22.Final//org.jboss.jca.core.tx.jbossts.LocalXAResourceImpl.commit(LocalXAResourceImpl.java:182)
		at org.jboss.jts//com.arjuna.ats.internal.jta.resources.arjunacore.XAOnePhaseResource.commit(XAOnePhaseResource.java:120)
		at org.jboss.jts//com.arjuna.ats.internal.arjuna.abstractrecords.LastResourceRecord.topLevelOnePhaseCommit(LastResourceRecord.java:172)
		at org.jboss.jts//com.arjuna.ats.arjuna.coordinator.BasicAction.onePhaseCommit(BasicAction.java:2395)
		at org.jboss.jts//com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1497)
		at org.jboss.jts//com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:96)
		at org.jboss.jts//com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
		at org.jboss.jts//com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1287)
		... 119 more
	Caused by: javax.resource.ResourceException: SQLException
		at org.jboss.ironjacamar.jdbcadapters@1.4.22.Final//org.jboss.jca.adapters.jdbc.BaseWrapperManagedConnection.checkException(BaseWrapperManagedConnection.java:1187)
		at org.jboss.ironjacamar.jdbcadapters@1.4.22.Final//org.jboss.jca.adapters.jdbc.local.LocalManagedConnection.commit(LocalManagedConnection.java:107)
		at org.jboss.ironjacamar.impl@1.4.22.Final//org.jboss.jca.core.tx.jbossts.LocalXAResourceImpl.commit(LocalXAResourceImpl.java:172)
		... 126 more
	Caused by: org.postgresql.util.PSQLException: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh): "sql txn" meta={id=d273fd45 key=/Table/162/1/"46f364b4-0f74-45fa-bee1-f54cd8f6ec4c"/0 pri=0.00787410 epo=0 ts=1624280820.040931699,1 min=1624280819.597028110,0 seq=2} lock=true stat=PENDING rts=1624280819.597028110,0 wto=false gul=1624280820.097028110,0
  Hint: See: https://www.cockroachlabs.com/docs/v21.1/transaction-retry-error-reference.html#retry_serializable
		at org.postgresql@42.2.5//org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
		at org.postgresql@42.2.5//org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2183)
		at org.postgresql@42.2.5//org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:308)
		at org.postgresql@42.2.5//org.postgresql.jdbc.PgConnection.executeTransactionCommand(PgConnection.java:755)
		at org.postgresql@42.2.5//org.postgresql.jdbc.PgConnection.commit(PgConnection.java:777)
		at org.jboss.ironjacamar.jdbcadapters@1.4.22.Final//org.jboss.jca.adapters.jdbc.local.LocalManagedConnection.commit(LocalManagedConnection.java:96)
		... 127 more

2021-06-21 15:07:00,121 ERROR [org.jboss.as.ejb3.timer] (EJB default - 2) WFLYEJB0020: Error invoking timeout for timer: [id=73e67e1c-750f-42b8-9db2-b1bae3b23c31 timedObjectId=office-alexander-logistics.office-alexander-logistics.AsyncEventScheduler auto-timer?:false persistent?:false timerService=org.jboss.as.ejb3.timerservice.TimerServiceImpl@382055fc previousRun=Mon Jun 21 06:50:04 CEST 2021 initialExpiration=Sun Jun 20 17:30:04 CEST 2021 intervalDuration(in milli sec)=5000 nextExpiration=Mon Jun 21 15:07:04 CEST 2021 timerState=IN_TIMEOUT info=Imixs-Workflow AsyncEventScheduler]: javax.ejb.EJBTransactionRolledbackException: javax.transaction.RollbackException: ARJUNA016053: Could not commit transaction.

I already took a look into the recommendations on the page: Transaction Retry Error Reference | CockroachDB Docs

But as I am running a Java EE/JPA Application I fear that I have not much chance to hack the transaction handling from JPA. And I also do not realy whant to do this.

Can anybody give some experiences with Java EE/JPA applications if it is a problem to run this with CockroachDB?

I still hope I can tweak something in the EJB/Transaction module or in the PostreSQL JDBC driver settings.

I am happy for any kind of hint.

===
Ralph

Hi @Ralph ,

transaction retry protocol isn’t specific to crdb on its own. You get same thing if you put postgres in serializable transaction isolation mode. You can find more documentation about it and anomalies for postgres here PostgreSQL: Documentation: 13: 13.2. Transaction Isolation . You can try to find a solution for postgres. I think it is not so uncommon.

For Hibernate you can look at this example: Build a Java App with CockroachDB and Hibernate | CockroachDB Docs . But we switched from Java to TypeScript, so I can’t assist you.

Cheers,
Anton

Thanks for you reply!
My applications are running typically on PostgreSQL/Oracle without any problems. I am running a Java EE app with a lot of transactions all covered by EJB/JPA. I don’t want to touch this code.
I just migrated my existing postgreSQL data into CockroachDB and I hoped that I can run the same application with the latest PSQL JDBC driver against CockroachDB.
In simple situations this worked fine. But in some more complex transactions - in my case taking round about 1 second with read/writes - I run into this error.

To me this means at the moment CockroachDB is not comparable to PostgreSQL concerning Transaction handling. It seems not to be easy to switch to CockroachDB with complex EJB/JPA transactions.

I found this page form the Cammunda Project dealing with the same problems: CockroachDB Database Configuration | docs.camunda.org

The cammunda changed there implementations especially for CockroachDB which is not an option form me. My code is based on Imixs-Workflow which is a Java EE workflow engine.

So my question is: is there any chance to solve this problem by configuration? I already played around with the cluster setting kv.closed_timestamp.target_duration which I increased to 30s. But this seems to have no effect.

Looking at the error type (RETRY_SERIALIZABLE), I suspect that your application works fine on Postgres because Postgres by default uses a lower level of transaction isolation than CockroachDB. At an extremely high level, what I think is happening is that multiple concurrent transactions are conflicting with each other. Postgres will allow these conflicting transactions to go through, whereas Cockroach requires you to retry one transaction to fix the conflict. CockroachDB intentionally cannot be configured to use lower levels of transaction isolation because we believe it is too easy for this to cause incorrect behavior in application code. If you want to go deeper into this, check out our blog post, Real Transactions are Serializable. Is there a way to configure Java EE/JPA to automatically retry transactions that fail with serializability errors? Is this blog post relevant?

Thanks for your response. Interesting point!

At the core I am using EclipseLink JPA. And I think the default behaviour of the JPA impl is Dirty Read which means within a transaction uncommitted reads are possible. This is what I expect in our workflow engine. See also here.

But the JPA/EclipseLink stuff should not affect the database layer? Or am I wrong? So I do not think it makes sense to play with the JPA Isolation level here.

But maybe the interesting part is what happens in my application. In the concrete situation and what I can see from the log files is something like this:

  • EJB-A starts a transaction and updates existing dataset and inserts a new one. The transaction is closed as the EJB method terminates (stateless session EJB) .

  • Within the next 100-300ms a different thread (a EJB TimerService triggered every second) starts EJB-B with a transaction and reads the new generated data and updates this data. The transaction is closed as the EJB method terminates.

To me this all did not sound very unusual.

But what I also observed today is, that the problem did not occur every time in this situation. It occurs only sometimes.
Can it be the case, that the issue is related to the fact, that the second thread (different JDBC connection from the connection pool) starts updating the dataset in a very short delay? Maybe within the next 100ms? Or is this nonsense in your eyes?

We are developing the Imixs workflow engine since several years an I think the behaviour is very solid and proved. This is why I was so surprised today as I saw the issue the first time in production.

===
Ralph

I am sorry to say that this Enterprise Java is way over my head and I’m not able to provide as much advice as I’d like to.

But the JPA/EclipseLink stuff should not affect the database layer? Or am I wrong? So I do not think it makes sense to play with the JPA Isolation level here.

That sounds right to me; I don’t think it should be possible to avoid these errors by changing the isolation level.

Can it be the case, that the issue is related to the fact, that the second thread (different JDBC connection from the connection pool) starts updating the dataset in a very short delay? Maybe within the next 100ms? Or is this nonsense in your eyes?

A transaction serializability error should occur only when there are two transactions that are happening at the same time, and one transaction writes data that would affect the output of the other transaction. So, it might only occur sometimes based on the content or timing of your two transactions. I’m not completely sure that I understand your comment, but it shouldn’t be possible to cause this error unless you have multiple transactions happening at the same time.

We are developing the Imixs workflow engine since several years an I think the behaviour is very solid and proved. This is why I was so surprised today as I saw the issue the first time in production.

I think you will also see this behavior in Postgres if you set the transaction isolation level to serializable (ALTER DATABASE <db name> SET DEFAULT_TRANSACTION_ISOLATION TO 'serializable';).

I see a few main ways to resolve this problem:

  1. Configure your application to retry any transaction that fails with a serializable error. I could provide an example of how to do this in Java, but I don’t know if that will apply to Java EE.
  2. Ensure that you don’t ever run two possibly-conflicting transactions at the same time. Maybe this could be done by adjusting how your timer and threads work.
  3. Restructure your database schema to avoid possible conflicts. This would probably be too much work.

Thanks again for your reply. I understand your arguments and I have now also read more about how to run CoackroachDB with JPA as explained here.
And yes, I think in my concrete scenario I have exactly the problem of overlapping transactions. And it looks like if I have a timing problem here. I agree that it can be an option to control this inside the application. I have to think about the whole situation again.

But I have now recognized a second point which really surprised me. It is about reading large data columns nearly in parallel of inserting/updating the same data. In the application which I switched form PostgreSQL to CockroachDB, binary data of documents (0.5 to 2MB in size) are stored in a blob column of a table. And there are many parallel threads accessing the data (10-30).

What I observed is the following:

  • Thread-A inserts some big data (1MB) into the table and commits
  • Thread-B starts reading this data in a very very short delay of time (again below 100ms)

It looks like that Thread-B is not receiving all the data. Like if the data is still written by the commit of Thread-A?

Can this be possible? Maybe if Thread-A and Thread-B are connected to a different node in my 3-node cluster?

I have now also read more about how to run CoackroachDB with JPA as explained here.

Ah great! I’m sure that you saw that that example includes retrying. Thanks for digging that up; now I can use that to help others in the future. :wink:

I believe it is possible for statements to be processed in an order different than real-world time because CockroachDB does not guarantee linearizability. However, Thread-B should never read only part of the blob. For example, if you’re writing a new row hello world to your table, it’s okay for Thread-B to either not see the new row or to see hello world but it should never see something like hello w.

Sorry, this was incorrect. If thread B begins a transaction after thread A commits, then thread B will see the data that thread A wrote.

To be honest, I lost a little bit trust in CockroachDB when I follow the jpa example here. JPA is a way to abstract the database access from the developer in an object orientated view. JPA code is - especially in a trivial examples like in this bank account demo - very lean and simple. As a developer you should not deal with the Database layer behind in your code. That is the idea of JPA.

In a business application based on JPA you usually have a lot of code creating, updating or changing the relationships between objects. In my personal point of view, it is not recommended to pollute all this code with a database vendor specific implementation. This makes the code no longer interoperable.

From the marketing behind the cockroach project it seems that because Cockroach supports the Postgre Wire Protocol you should not be forced to change your implementation if you change form PostgreSQL to CockroachDB. But for JPA I think this is not true.

As it seems to be necessary to change the way how to access data stored in a CochroachDB, I suppose that the Cockroach Project implements its own JPA OR-Mapper. And it should be the task of the OR-Mapper to deal with the differnt handling of isolation in ACID transactions.

Than with such a Cockroach-OR-Mapper a java project can switch from PostgreSQL or other SQL Databases to CockroachDB by replacing the OR Mapper. Note: the OR-Mapper is independent form the JPA Code like the JDBC driver.
Only in this way the developer can keep the code clean.

Hi @Ralph. I thought i’d weigh in on a few of your comments…

In my personal point of view, it is not recommended to pollute all this code with a database vendor specific implementation. This makes the code no longer interoperable.

Nothing suggested above or that I would suggest to you now is vendor specific. Any database that runs with SERIALIZABLE isolation will recommend that your application contain retry logic. For example Postgres calls this out quite clearly here (PostgreSQL: Documentation: 9.5: Transaction Isolation ) in section “13.2.3. Serializable Isolation Level”… “applications using this level must be prepared to retry transactions due to serialization failures”. As stated above we’ve made a decision to only support SERIALIZABLE isolation for many reasons and as such applications using CockroachDB need to/should consider implementing retry logic. With modern, lightweight frameworks like Spring and AOP this is quite simple to do and require very little code modification. Unfortunately EJB’s and your particular stack make this a bit more challenging but not impossible. Again, if you were to set your current Postgres instance to SERIALIZABLE isolation you would run into the same issue.

From the marketing behind the cockroach project it seems that because Cockroach supports the Postgre Wire Protocol you should not be forced to change your implementation if you change form PostgreSQL to CockroachDB. But for JPA I think this is not true.

Again your implementation should change not because you are moving to CockroachDB but because you are moving from a weak isolation to the strongest isolation level, SERIALIZABLE. These changes would be required if you made this change and remained with Postgres. In other words, throwing this error is because of our adherence to the Postgres Wire Protocol, not a violation of it.

As it seems to be necessary to change the way how to access data stored in a CochroachDB, I suppose that the Cockroach Project implements its own JPA OR-Mapper. And it should be the task of the OR-Mapper to deal with the differnt handling of isolation in ACID transactions.

I disagree with this assertion. We fully support most modern ORM’s in Java (Hibernate and Jooq). I don’t think the purpose of ORMs is handle isolation driven application behaviors but rather to properly map objects to queries, etc. These decisions are almost always left to the application developer not the ORM framework or JPA spec.

Than with such a Cockroach-OR-Mapper a java project can switch from PostgreSQL or other SQL Databases to CockroachDB by replacing the OR Mapper. Note: the OR-Mapper is independent form the JPA Code like the JDBC driver.
Only in this way the developer can keep the code clean.

Again i disagree. ORMs change behavior by switching dialects. A single ORM, like Hibernate, supports dozens of different databases including Cockroach by simply changing a single property. You are correct that JPA as a specification is different than its implementation (hibernate is perhaps the most common implementation of the JPA spec but there are others) and any well written application, should it desire to be portable, should be written to the spec not the implementation. The trick here is that the JPA specification does not attempt to solve for retry logic (it is not in the spec or any implementation). It is therefore understood and assumed this should be handled by the application layer not by JPA/EJB/ORM, etc.

Having said all of this i’d love to see you get lmixs working with CockroachDB. We’ve previously worked closely with team at Flowable (another BPM product) to support CockroachDB. It is certainly possible to get lmixs there too, but to do so you have to be willing to make application changes… not because they are vendor specific but because real transactions are SERIALIZABLE, SERIALIZABLE requires retry and thats the only isolation level we support.

Thank you for your reply. I’m assuming that your knowledge in database development is much deeper than mine. Therefore, I will not be able to have a hard discussion with you.

I have already switch one of our customer projects to cockroach and we see no problems with Imixs-Workflow there.
But the second project was a little bit horrible. The RETRY_SERIALIZABLE issue was not the worst. Users saw only an error message and repeated the process (thanks to the transaction rollback).

But what scared me and forced me to immediately do a back migration to PostgreSQL was the fact that there were several situations where data was not completely read. In the concrete case it was a blob column with a 0.5-1 MB data block which was not transferred correctly. And I did not recognized this because no error was thrown. The data block represented PDF documents and the documents where corrupted afterwards.
It is certainly my mistake not to have tested these scenarios in more detail. But the whole thing was no longer fun.

I have written blogs about cockroach and also added the Database into the Imixs-Cloud project. GitHub - imixs/imixs-cloud: A Lightweight Kubernetes Environment
But for now I can not recommend to use Imixs-Workflow with CockroachDB. And I wouldn’t say that this is due to a faulty implementation of our Workflow engine. As you said, it’s because of a transaction isolation level that is not supported by our project. I can accept this. But on the other hand, I have read many books about JPA and I was not aware that all the examples are only valid to a ‘weak’ isolation level of the database behind.

Just one more question about the isolation level SERIALIZABLE:
Does it mean, that a select query will lock the full table? If I am understanding this topic right this means that if a Thread-A reads data form the table, no other Thread can update or modify the data. Wouldn’t that break the performance if a lot of parallel transactions and read accesses take place?

I’d really like to drill into this missing data issue. Do you by chance have more details you could share? Some test harness which recreates this?

No I am sorry I have currently no system to reproduce the issue to generate more data. If I see it again I will provide more information.