SQL logical/physical plan optimizations


I am learning ditributed SQL optimization techniques. I read cockroachdb RFC [1] about it and cannot realize one thing. From the document I understood that a cost-based optimizer (CBO) works during logical query planning ignoring all information about data distribution on nodes in a cluser. After logical planning it produces only one optimized logical plan. Next, the logical plan is transformed to a physical plan. I cannot fully understand why an optimal logical plan always leads to an optimal physical plan. Is it so? Why is it so? Or did I get how the optimizer works wrong?

[1] https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160421_distributed_sql.md

Hi Ivan,

Thanks for the interest!

Your observation is correct - the distribution could matter in terms of getting the best logical plan; in addition, when creating a distributed plan from a logical plan, we have to make certain decisions (e.g. on how many nodes would we run a distributed join or aggregation?) which should also be cost-based.

The current architecture is a byproduct of how our system has evolved over time. We have started developing the optimizer relatively recently (2.1 was the first release with an optimizer) and we are still working on parity with the old heuristic planning code.

In the future, we plan to make the optimizer more aware of distribution, and pull more aspects of the physical planning part in the optimizer. Our upcoming 19.1 release does have a relevant feature: zone-aware index costing - if you have a geo-distributed cluster and create similar indexes in each locality, the optimizer will choose the local index.