Deleting while inserting


We’re hitting a big performance issues when trying to delete old entries while inserting newer entries.
In some cases we’ve experienced absurd slowdowns, with a batch of 100 new entries taking up to 35s to complete the insert, an increase of about 200-500x what it is in regular.
This issue is more pronounced with larger tables with more indices, but it also impacts small tables.

Does anyone have any advice to run deletes + inserts concurrently? we only delete old entries, so there shouldn’t be conflicting queries.

Our workflow requires inserting a lot of (potentially useless) data, running jobs to mark the useful ones and delete the rest.

Is there a way to do this without affecting insert speed?

Can anyone share some of your DELETE performances?
I can delete about 100 rows in 300-600ms which feels a lot longer than what I can achieve with inserts. The main issue is I can insert in parallel without worrying about contention, but multiple DELETE ... WHERE ... LIMIT ... will hit the same rows and deadlock very easily.
I’ll try first selecting the primary keys of the rows I want to delete and then deleting them with a DELETE ... WHERE pk IN (...) and I’ll be able to run those in parallel.

1 Like

@CMajeri can you share your DDL and the DELETE statement?

It sounds like you intuited that selecting to get the primary keys is best. If the data is old you can speed this up further and avoid contention with inserts by running the select query AS OF SYSTEM TIME -1m

Here’s my table:

	id UUID NOT NULL DEFAULT gen_random_uuid(),
	idx INT8 NULL,
	amount DECIMAL(38) NULL,
	"timestamp" TIMESTAMPTZ NULL,
	pinned BOOL NOT NULL,
	broadcasted BOOL NOT NULL,
	UNIQUE INDEX test_hash_idx_key (hash ASC, idx ASC),
	INDEX test_from_timestamp_idx (from ASC, "timestamp" ASC),
	INDEX test_to_timestamp_idx (to ASC, "timestamp" ASC),
	INDEX test_pinned_broadcasted_timestamp_idx (pinned ASC, broadcasted ASC, "timestamp" ASC),
	INDEX test_from_pinned_idx (from ASC, pinned ASC),
	INDEX test_to_pinned_idx (to ASC, pinned ASC)

And my delete query:

DELETE FROM test WHERE pinned = false AND broadcasted = false AND created_at <= X AND timestamp <= Y LIMIT Z

The inserts are just INSERT ... ON CONFLICT NOTHING with timestamp and created_at close to time.Now()

Basically I insert a lot of data, but am only interested in those that get pinned/broadcasted, which happens after the insert. I then periodically try to GC everything, which is where things go south.
I’ve collected metrics showing huge spikes in insert speed when concurrent deletes are happening, so I tried only inserting after the GC was done, but the deletes take an order of magnitude longer than the inserts and I can’t have this kind of stop-the-world GC…

My other idea is to simply shard my data so that I organize it in 3 tables:

  • permanent
  • current
  • previous

with writes only going to current and pinning/broadcasting exporting data from previous/current to permanent, and then periodically truncating previous and logically switching previous and current, but that’s a lot of potentially unnecessary work.

If you have any ideas to make this kind of workflow easier, I’ll gladly take them.
I still have to try explicitly selecting like 100k rows, and then running deletes in parallel batches by specifically targetting PK instead of potentially conflicting sets. I’ll come back with results when I do.

Can you elaborate with the AS OF SYSTEM TIME ? Can I run that with inserts or is that only for selects? I don’t select all that much so I didn’t really look into it.

I now realize you mean selecting the rows to delete with an as of system time select, and then run my delete on that! I’ll try it, it sounds faster!

Yeah, let me know how your test goes - running the select AOST should avoid any conflicts with the writes, and then the delete should be fast since it’s only using the primary key. The issue with the current delete is the created_at filter. It’s causing a full scan of the primary key (i.e.: a full table scan):

               tree              | field  |                  description
  count                          |        |
   └── delete                    |        |
        │                        | from   | del_test
        └── limit                |        |
             │                   | count  | 1
             └── render          |        |
                  └── index-join |        |
                       ├── scan  |        |
                       │         | table  | del_test@test_pinned_broadcasted_timestamp_idx
                       │         | spans  | /0/0/!NULL-/0/0/2019-01-01T00:00:00.000000001Z
                       └── scan  |        |
                                 | table  | del_test@primary
                                 | filter | created_at <= '2019-01-01 00:00:00+00:00'```

So initially, the second scan didn’t worry me, since I figured it was only applied after the first scan on the first index.
The immense majority of records with timestamp < X will also have a created_at < X (which are most of my request). A similar select query is lightning fast.

I guess it would help to at least set the index to cover created_at

Oh my god it’s insane how fast it is.
I can delete batches of 100 rows, 20 in parallel.
The select takes about 300ms to return 50k entries, and then my deletes are just lightning fast.
20 batches of 100 rows take a few ms.
Definitely a MASSIVE improvement over what I had before (a delete of 1k took multiple seconds).

So… definitely thanks!
I’ll improve my indices to maybe make the select faster, although at that point the issue is mostly fixed.
The as of system time also helps immensely, it’s really neat to simply stop caring about inserts while reading.

Really happy about this!
I’ll tune my batch sizes and all but that’s an improvement of more than x100 where I was looking for maybe x10 to keep up with the GC.

Icing on the cake, I don’t have to do the sharded setup with truncate table.

Just another performance question:
smaller batches seem much much more performing than larger. Is this due to the fact that smaller batches modify a single shard rather than multiple and so can be executed simply by the lease holder?
If so, does it apply to other types of queries?

1 Like

You can use the SQL tracing facility to determine how many ranges/shards are touched by a single delete. You can also use it to investigate the difference between different sizes of deletes. What you say is a likely explanation but only a trace would tell for sure.

CockroachDB tries very hard to parallelize the work on multiple ranges for SELECT and other bulk I/O operations. SQL mutations like DELETE etc are not yet as much as optimized, which is why you found them more sensitive to fine-grained tuning.

So yes there are general concepts that apply similarly, but not necessarily this one (however, without knowing exactly what has the most influence in your case, it is hard to say for sure. Again, traces would help.)

That’s great to hear! Let us know if you need any further assistance.