Thanks for the question. You are correct that CDB can be accessed through SparkSQL using the Postgres driver. This allows a Spark deployment to perform operations on DataFrames produced by running SQL queries against a running CDB cluster. However, as you noted, without a specialized driver some optimizations will be unavailable, such as those pertaining to data locality.
However, all is not lost, because the SQL provided by SparkSQL will be run through CDB’s own internal distributed query execution engine. This engine performs a lot of the same kinds of optimizations that make Spark so powerful. For starters, it pushes filtering and computation to the data, instead of the other way around. It also uses the compute resources available across your entire CDB cluster to perform large joins and other expensive aggregations. This still means that all data will be fed through a single SQL gateway node at some point during the processing of a Spark flow, but careful tuning of where this Spark/CDB barrier ends up should help you make the most of what both systems have to offer and avoid a lot of the need for a locality-aware driver.
I’d encourage you to play around with this and report back with your findings. That will help us guide our development efforts going forward.