Alternative to JDBC's setFetchSize for Pagination

Let’s say I have a table containing 20 million rows and I have to scan through the entire table.

  1. How to do this efficiently in cockroach ?
  2. How does cockroach control the amount of rows to be returned? In case of postgres we have used setFetchSize to control the batch size. How to control the batch size in case of cockroach?
  3. I have seen the offset and limit as mentioned here. Let’s say I am processing in a batch of 10,000 by making multiple queries using LIMIT & OFFSET
SELECT id, name FROM accounts LIMIT 10000 OFFSET 10000;

How does cockroach maintains the offset ? If I have specified OFFSET 10000 then does cockroach has to traverse the first 10,000 rows to return the next batch.

Hi @tuk. OFFSET 10000 does mean we’d have to scan the first 10000 rows. A more efficient way to do this is something like SELECT id, name FROM accounts ORDER BY {primary key} LIMIT 10000 then for each subsequent batch, add a WHERE {primary key} > {last pk value in the previous batch}.

And instead of using a transaction to keep your batches consistent (it’s not a good idea to keep transactions open for a very long time), you may also want to use the AS OF SYSTEM TIME feature.

Hi @dan,

Is the suggested approach - client keeps track of last fetched PK, and uses it as pagination token, the most efficient approach to scanning a large number of rows from CRDB via JDBC (assuming client does not have large amounts of memory)?

Second question = what if I just issue the SELECT id, name FROM accounts request on a table containing 20M rows? How will CRDB/JDBC handle it? Will CRDB fetch all 20M rows and send it as a single batch to the client? Or is there some implicit pagination underneath?


thanks,
gaurav

Yeah, paginating by PK is the best way to do this. In fact, it’s how our dump command is implemented.

what if I just issue the SELECT id, name FROM accounts request on a table containing 20M rows? How will CRDB/JDBC handle it? Will CRDB fetch all 20M rows and send it as a single batch to the client? Or is there some implicit pagination underneath?

We do internal pagination and the results are streamed back to the client as they come in, but it’s generally a bad idea to have any transactions or queries that take more than 10 seconds.

Thank you for taking time to answer the queries. One last followup query on this - if I used AS OF SYSTEM TIME in the above select query, would it be equivalent or worse than using client side pagination (client side pagination would result in multiple queries to be executed serially)?

AS OF SYSTEM TIME doesn’t change the cost of a query. It just lets you run two selects that will be consistent with each other but without having to run them in a transaction. This is useful in pagination, for example, if records are being inserted while the pagination is running.