Name resolution and prepared statements

postgresql
sql

(Antonio Manuel Macías Ojeda) #1

Hi,

I need help understanding https://www.cockroachlabs.com/docs/stable/sql-name-resolution.html and prepared statements.

In PostgreSQL, prepared statements are per connection and database, so I can’t reuse prepared statements across several databases, even when they have the same tables and schemas. This is because PostgreSQL works that way, a connection is to a database, in contrast to MySQL USE with which you can switch databases prior to running a query.

Our use case seems natural for the distributed nature of CRDB: To have data isolation, we have different customers on different databases, but all on the same CRDB cluster. We would like our microservices to use prepared statements for performance, but considering it’s wire-compatible with Postgres, I’m not sure this is the right way to do it. For now we are doing one-shot queries doing string interpolation databaseName.tableName and that works, but seems like a waste of performance when those are repeated over and over again.


(Antonio Manuel Macías Ojeda) #2

An approach we could do is to prepare the statement the first time it’s used against DB X, and keep it for reuse when that particular query runs against that particular DB. The problem is if the connection on CRDB is per-database like in Postgres, since it would be a waste to disconnect from DB Y to prepare and run a query on DB X, then disconnect from DB X loosing that prepared statement and do the same on Y, back an forth.


(Antonio Manuel Macías Ojeda) #3

Also a bit more context: It might be that the whole “each customer has their own DB” approach is a bogus way of doing multi-tenancy on this scenario, but given that Cockroach doesn’t support per row-level access-control, this seemed like a good approach. So an alternative better approach for the CRDB world would be welcome as well :slight_smile:


(Raphael 'kena' Poss) #4

You can use prepared statements across multiple databases in cockroachdb [*].

The object names are expanded “internally” during prepare, so you can reuse them even if you switch to a different database.

[*] CockroachDB knows how to handle object names that are regular SQL identifiers, for example FROM t will be translated to FROM db.public.t so you still get the right thing when you change to db2. However, the following two things are not handled automatically:

  • references to information_schema and pg_catalog in SQL views. These will have different results depending on the current database.

  • references to database objects via SQL strings, For example currval('seqname') will not be recognized as a name relative to the database. You must modify this explicitly in your app code to use a fully qualified name e.g. currval('db.public.seqname').

Does this help?


(Antonio Manuel Macías Ojeda) #5

Thanks for your reply!

The object names are expanded “internally” during prepare, so you can reuse them even if you switch to a different database.

[*] CockroachDB knows how to handle object names that are regular SQL identifiers, for example FROM t will be translated to FROM db.public.t

I see the name resolution will work, but does the prepared statement provide any advantage here? Can Cockroach somehow reuse the execution plan when the prepared statement changes the “target” schema? Or when the database changes it has to re-parse and re-plan the query? If it has to re-parse/re-plan, does it keep the two plans in the connection or just the latest one?

Even though the name resolution would work, it’s not clear to me if prepared statements would provide an advantage in this scenario.


(Raphael 'kena' Poss) #6

I don’t think there’s a clear advantage, no. I was just pointing out that the behavior should not be surprising/incorrect.


(Antonio Manuel Macías Ojeda) #7

Got it. Thanks for your reply!

Still not clear on what’s the best way to get multi-tenancy out of Cockroach then. No row level access control makes putting all customers data on the same database and schema a bit risky if a programmer forgets to add “where customer_id = $x” on each query, which also becomes a burden to write and maintain. But with separate databases and no prepared statements there’s a noticeable performance hit when the same queries are re-parsed/re-planed over and over again. Am I missing something?


(Raphael 'kena' Poss) #8

I think you’ve confused yourself.

  1. if you do implement multiple tenancy using separate database per app, you can absolutely benefit from prepared statements. This is because in a typical multi-tenant scenario, each app only uses one database. What you would do then is to embed the app’s database in the client connection URL, or use SET database =... at the beginning of each session.

    If you do this, and you do not change the current database in the middle of a session, you can absolutely utilize prepared statements. There is no pitfall, that’s exactly what they are made for.

    But the fact you’re asking is making me curious - have you even tried this scenario? What makes you think there is a problem?

  1. Regardless of the discussion on prepared statements, you should realize that CockroachDB is not super well suited for multi-tenancy. The larger problem is that there is no resource isolation: load by one app will impact performance of another app, and there is no way to isolate against that.

    You may find more use in having multiple CockroachDB clusters running side-by-side. For example if you have beefy hardware server, you can run multiple CockroachDB nodes (belonging to separate logical clusters) on each server, using OS-level resource isolation.


(Antonio Manuel Macías Ojeda) #9

The fact is that 1) is not our scenario. We have multiple customers running the same app, and our microservices serve all of them. So we don’t want each microservice to hold X connections per customer to Cockroach, each of them with multiple prepared statements doing exactly the same queries against exactly the same DB structures (just on separate databases for data separation, but all of them have exactly the same structure).


(Antonio Manuel Macías Ojeda) #10

As for 2), that’s unfortunate to hear, since CockroachDB has being marketed as multi-tenant from the very beginning. Searching for github issues gives for example https://github.com/cockroachdb/cockroach/issues/633 with team members answering about it, so the assumption has always been that CockroachDB will be multi-tenant. I’m just asking what’s the best way to structure/solve it, not if CRDB is a good match since the assumption was always that, yes, CRDB is ideal for this. Has the roadmap/vision changed?


(Raphael 'kena' Poss) #11

Thank you for explaining further.

Regarding multi-tenancy, let me be clear: CockroachDB is not yet well suited for multi-tenancy when the tenants have non-aligned incentives. We don’t really call “multiple apps using the same cluster” a “multi-tenant” scenario if the apps come from the same vendor and serve the same end-users. What we call “multi-tenant” is when the cluster is used by separate organizations that somehow compete with each other or where you need strong guarantees of internal confidentiality between apps. That’s where CockroachDB will not help yet.

Anyway, to come back to prepared statements.

Something that wasn’t clear to me before is that you want prepared statements that persist / are shared across different client connections. So yes that is not supported by CockroachDB.

But what about this: keep a pool of connections, then in each connection prepare your statements for each value of database, but give them each time a different prepared statement name. For example:

set database=db1;
prepare q1_db1 as select x...;
prepare q2_db1 as select y...;
set database=db2;
prepare q1_db2 as select x...;
prepare q2_db2 as select y...;
set database=db3;
prepare q1_db3 as select x...;
prepare q2_db3 as select y...;

etc

Then every time you serve a client for a given db, you’d use:

set database = dbM;
execute qx_dbM(...);

This way you get most of the benefits of prepared statements: each prepared statement is executed in the database where it was prepared.

(Note: crdb 2.1 does not optimize prepared statements too much, but we’ll improve this further in 19.1)

Would that work?


(Antonio Manuel Macías Ojeda) #12

Ah sorry, vocabulary then, I have seen “multi-tenant” used for just a “tenant_id” in a table column, for example… so I used the term loosely, sorry.

Yes, that would work, and it’s sort of my workaround for now. I’m hoping prepared statements will give a boost in performance in the future and giving that we do know which queries we are going to execute, it seems like a waste to not take advantage of that.

However, if I setup the prepared statements upfront upon connection (for each connection, in a connection pool) I would have to do #customers (DBs) * #queries-on-this-service for each service. Since we provide a service to multiple customers, #customers can be a high number. Also with the best practice of restarting idle connections after some time, a lot of those might potentially be a waste. I’m thinking that for this it would be better to not use prepare statements except the first time you do a query against a database on that connection, short of like a cache, but then there’s no guarantee that it will be reused ever before the connection is removed from the pool, as there’s no guarantee that the next query of the same customer will hit that connection on the pool.

Probably the best design is then to move all customers to a single database and having just one set of prepared statements. But appending “where customer_id = $x” to all queries is quite error prone, and with views I can’t do updates. I would appreciate any further ideas here, I don’t mind migrating everything to a new design :slight_smile:


(Raphael 'kena' Poss) #13

Maybe you can have a per-connection cache of prepared statement IDs in your load balancer / db API client-side. Then only run the PREPARE if the statement has not been prepared yet. This way you only need to prepare when a query is actually being used.

The scenario adding where customer_id = $x is something we know well. Our native support for geo-partitioning will evolve in this direction towards the 19.2 release.


(Antonio Manuel Macías Ojeda) #14

Thank you for your insights. After considering the options, some POCs, and reading related forum replies and github issues, I guess we’ll move to that pattern of where customer_id = $x. We’ll implemented a linter pass on our pipeline (some regex checks) to ensure there can’t be queries without it. I guess that’s the best we can do and seems like that’s the CRDB way to do it in the planned future :slight_smile: