Is there a maximum amount of rows that I can insert at once using INSERT VALUES statement?

‘cockroach dump’ command creates Insert statements each by 100 rows, but it’s too slow not only to insert millions of rows but also to insert tens of thousands of rows using the statement.

So I merged all within one statement using Regex. It works for tens of thousands rows, however, failed on millions of rows. I think the connection was aborted by Cockroach DB.

It seems PostgreSQL has a limitation of query size(https://dba.stackexchange.com/questions/131399/is-there-a-maximum-length-constraint-for-a-postgres-query), but I didn’t get such detailed error message.

Is there any limitation?

Failed query file size is 892MB.
I used Psql windows executable client (psql.exe -f / v9.6.3), windows binary version of latest cockroach DB v.1.1.2 on Widows 10. And had enough disk and memory space.

Our current implementation limits the size of a transaction to:

A) 100000 write intents, in your case column families being inserted. If your schema has all the tables in one column family, then this corresponds to a max of 100000 rows; if you have 2 column families, that’s a max of 50000 rows, etc.

B) whatever fits in the SQL memory budget, which you configure with -max-sql-memory (the default is 128MB)

That being said, if you need to ingest a large amount of data from another database, then INSERT statements is not the optimal way to go. You’d get 20x the performance (or more) by ingesting from CSV using our (experimental) IMPORT CSV statement.

1 Like

Thank you for your response :slight_smile:

by the way, I inserted 20952 rows at once successfully into a table which has one column family as you can see in the screenshot.
"INSERT 0 20952"
This is more than 10000 rows. Am I misunderstanding?

Yes, I made a mistake, my apologies: the maximum is 100.000 write intents, not 10.000.

This is configurable by the way, with the cluster setting kv.transaction.max_intents. See the documentation for SET CLUSTER SETTING. Note that increasing this number risks making the transaction commit much slower, and possibly not at all.

1 Like

Hi Raphael ‘kena’ Poss,

I’m confused about your answer. If I understand correctly for 100K limit to be applicable, all TABLES in a given schema must be in the same column family. Did you mean that all columns in a given table must be in the same table? Do UUIDs used for primary keys have to be unique across tables? If that is the case, are there plans (or configuration) to relax that constraint and require UUIDs to be unique only for a given column family/table PK?

I’m currently working on a project that will be recording data coming from devices and doing so in long batches for performance reasons. So the plan is to gather up 5 minutes worth of records and insert them in one batch. And, on top of that devices share data between each other and this is where I really need to insert many records at once, since I also need inserts in the many-to-many table between data records and device records.

My schema already has over 20 tables and that number will grow as I implement more features. Does this mean that if my schema has 50 tables, each one being its own column family (given that no table has one-to-one relation to another one), then I can only insert 2K records into my many-to-many table per second per node?

If that is the case and there is a core technical reason that would prevent UUIDs from being reused across tables (I just don’t know enough about this), then it would be a show stopper and force me to move to manually implement caching of big ints per inserting app node for primary keys (I don’t think that big ints unique per node per timestamp, which if I understand correctly is what your big serial implementation is, would give me enough range for my worst case scenario).

So, yet another question, is it possible to provide 64 bit integers that the are guaranteed to be unique (due to being allocated in blocks via a fake sequence block increment mechanism) as primary keys and not break the db? Would that successfully take place of the usual bigserial or would UUID be automatically inserted as the rowid in that case? If UUIDs would be inserted would they break anything (even if they are never used by the calling app)?

Thank you very much for your time and effort!

Do UUIDs used for primary keys have to be unique across tables?

No, primary key values for a table need only be unique within the table. You can use the same UUID as the primary key value across tables.

Are there plans to relax that constraint and require UUIDs to be unique only for a given column family/table PK?

Nothing about UUIDs inherently guarantee uniqueness (although they’re probabilistically unique if implemented properly according to the standard). In Cockroach, you can require any column(s) of any type (UUID, integers, strings) by adding a UNIQUE constraint for a column or a set of columns. The primary key by default has to be unique.

Does this mean that if my schema has 50 tables, each one being its own column family (given that no table has one-to-one relation to another one), then I can only insert 2K records into my many-to-many table per second per node?

If you’re doing all these INSERT/UPDATEs in one transaction, then you can have writes/updates on at most 100k column family rows (all tables have at least one column family: the primary key; each index is an additional column family; explicit column families as well). So if you have a table with a primary key and a secondary index, each write to a logical row will write to 2 column family rows.

This is not a per second nor a per node limitation: it is on a per transaction basis for however long it takes that transaction to commit. You can of course issue more transactions concurrently: the big caveat here is if you have contention between the transactions (i.e. if 2+ transactions are writing/updating the same rows or reading/writing the same rows - also called write-read and write-write conflicts), then your transactions may be aborted/retried and transaction times will be severely affected.

The rule of thumb here is to keep transactions as lean and self-contained as possible. This is not just a limitation to Cockroach but with all databases that guarantee some form of serializable isolation.

Would that successfully take place of the usual bigserial or would UUID be automatically inserted as the rowid in that case?

If you don’t specify a primary key for a table, a unique primary key will be assigned to each row. This limits your ability to query for individual rows. See the primary keys docs for more info.

Hi Richard,

Thank you for your reply. If I understood correctly each column family is allowed 100K writes/reads per transaction. That is enough for me, thank you!

Hi Russ,

Sorry if my explanation wasn’t clear (I’ve updated my comment on the 100k write intents): you can have at most 100k writes/updates per transaction, where each count of write/update is per column family.

So if you have a table with a primary key and a secondary key and you write to 1 row with ID = 5, that counts as 2 writes (because you have 2 column families).

If you wrote to 50 rows from the same table, that would count as 100 writes.

Hi Richard,

Thank you for the clarification - so with the assumption that one table is always one column family for data + one column family per index I’m still golden. In my case the biggest number of rows go into the many-to-many table which only has the ids of the joined tables and hence only the primary key with no additional indexes.

Thanks, :slight_smile: