Performance Issues with JDBC batch inserts

Hi there,

I am seeing issues where some batched statements are not getting rewritten and executed as a batch.

  • The queries that are not getting rewritten are UPSERT queries & INSERT INTO… ON CONFLICT …DO NOTHING queries.

Example queries:


If you look at the screenshot the mean row affected value is 1 when in fact the test is inserting ~1024 rows per test. (The high execution count is across multiple runs but if batching was working correctly, I would have expected that count to be much lower.)

Here’s another example:

Full Query from the diagnostics page:
INSERT INTO profiles.profile_versions(profile_id, profileoperation_id, "timestamp") VALUES ($1, $2, __more1__) ON CONFLICT (profile_id, "timestamp") DO NOTHING RETURNING NOTHING

The queries uses prepared statement:
db.prepareStatement("QueryString”)
//…set params
db.executebatch()

I am using postgres driver version 42.2.9 as per the recommendation in the docs and I am setting
reWriteBatchedInserts to true in the connection string.

I can confirm that this works for simple insert queries like:
INSERT INTO profiles.profile_updates(profileoperation_id, profile_id, "timestamp", action, details) VALUES ($1, $2, __more3__), (__more100__)

I ended up just creating multi row upserts & insert for some of these cases to get around this for now. Is there some setting I am missing or is it somehow expected behavior for these queries.

Table structures for some of the tables I mentioned in the queries above:

CREATE TABLE public.profile_versions (
  profile_id INT8 NOT NULL,
  profileoperation_id INT8 NOT NULL,
  "timestamp" INT8 NOT NULL,
  CONSTRAINT "primary" PRIMARY KEY (profile_id ASC, "timestamp" ASC),
  INDEX profileoperation_id (profileoperation_id ASC),
  INDEX "timestamp" ("timestamp" ASC),
  FAMILY "primary" (profile_id, profileoperation_id, "timestamp")
)

CREATE TABLE public.profile_versions_latest (
  profile_id INT8 NOT NULL,
  profileoperation_id INT8 NOT NULL,
  "timestamp" INT8 NOT NULL,
  CONSTRAINT "primary" PRIMARY KEY (profile_id ASC),
  FAMILY "primary" (profile_id, profileoperation_id, "timestamp")
)

Thanks,
Vinay.

Hi Vinay!

The RETURNING clause prevents pgjdbc from being able to use batched inserts. (See here in the pgjdbc source code.) I’m pretty sure this logic also prevents UPSERT from being batched, since that wouldn’t have a statement type of INSERT.

Thanks for the prompt response. That makes sense!

Is manually creating multi upsert and insert statements the preferred approach in this case?
What would you recommend the ideal batch size for these multi upsert/insert statement if at all?

Thanks,
Vinay.

Yeah, I think you’d have to make these batches manually.

We have some guidance about choosing a batch size here: Build a Java App with CockroachDB and JDBC | CockroachDB Docs

Ultimately it depends on your workload, though, so I’d recommend testing with different powers of two. (But if you are making the batches manually, then I don’t think there’s any particular reason for it to be a power of two.)

So, I tried batch size of 500 & 1000 and I got almost similar latency numbers.
I also updated the table to use hash-based sharding.
Here’s the query:
INSERT INTO profiles.profile_versions(profile_id, profileoperation_id, "timestamp") VALUES ($1, $2, __more1__), (__more900__) ON CONFLICT (profile_id, "timestamp") DO NOTHING RETURNING NOTHING
Here’s the updated table schema:

CREATE TABLE public.profile_versions (
profile_id INT8 NOT NULL,
profileoperation_id INT8 NOT NULL,
“timestamp” INT8 NOT NULL,
CONSTRAINT “primary” PRIMARY KEY (profile_id ASC, “timestamp” ASC) USING HASH WITH BUCKET_COUNT = 8,
UNIQUE INDEX profile_versions_profile_id_timestamp_key (profile_id ASC, “timestamp” ASC),
INDEX profileoperation_id (profileoperation_id ASC),
INDEX “timestamp” (“timestamp” ASC),
FAMILY “primary” (profile_id, profileoperation_id, “timestamp”, crdb_internal_profile_id_timestamp_shard_4, crdb_internal_profile_id_timestamp_shard_8)
)

I am seeing latencies to the tune of 12-13 secs for this query.
I have pasted below the logical plan for this query from the diagnostics page:

  • Insert

arbiter indexes = profile_versions_profile_id_timestamp_key

into = profile_versions(profile_id, profileoperation_id, timestamp, crdb_internal_profile_id_timestamp_shard_4, crdb_internal_profile_id_timestamp_shard_8)

  • Render

  • Filter

filter = profile_id IS NULL

  • Lookup join (left outer)

table = profile_versions@profile_versions_profile_id_timestamp_key

equality = (column1, column3) = (profile_id,timestamp)

equality cols are key

  • Render

  • Values

size = 3 columns, 969 rows

Execution Stats:


is there anything more I could do to optimize this query

I think the issue is likely related to this: INDEX “timestamp” (“timestamp” ASC)

We recommend against indexing on sequential values since that creates a hotspot. All of the writes in the same batch creates a hotspot so all the load gets directed to one range.

Is that timestamp index necessary?

So I followed the guidelines which said we should index everything used in a where clause.
its not relevant to the insert, but is used in one of the fetch queries:

SELECT …
FROM profiles.profiles p
JOIN (
SELECT pv.profile_id, MAX(pv.timestamp) AS timestamp
FROM profiles.profile_versions pv +
WHERE pv.profile_id IN (…) AND pv.timestamp <= [VAL>]
GROUP BY pv.profile_id ) max ON max.profile_id = p.id
JOIN profiles.profile_versions pv ON pv.profile_id = max.profile_id
AND pv.timestamp = max.timestamp
JOIN profiles.profile_operations po ON po.id = pv.profileOperation_id

Should I consider using a hash index for this secondary one too ?

WHERE pv.profile_id IN (…) AND pv.timestamp <= [VAL>]

Since the WHERE clause filters by profile_id AND timestamp, then this query would be satisfied by an index over (profile_id, timestamp). You can see an example in the docs on multi-column keys. Those docs are for primary keys, but you can also make a multi-column secondary index.

Depending on your other queries, you could also make the PRIMARY key be (profile_id, timestamp, profileoperation_id), then the primary index could be used to satisfy the query you shared.