CDB + Spark Support

Hello there,

I am trying to check to see CDB is available for Apache Spark. I could currently use CDB using Postgres driver. But I would like to check to see if there are any specialized drivers that I can use which optimizes for predicate-push-down and other optimizations pertaining to data-locality.


Hi @muthu,

Thanks for the question. You are correct that CDB can be accessed through SparkSQL using the Postgres driver. This allows a Spark deployment to perform operations on DataFrames produced by running SQL queries against a running CDB cluster. However, as you noted, without a specialized driver some optimizations will be unavailable, such as those pertaining to data locality.

However, all is not lost, because the SQL provided by SparkSQL will be run through CDB’s own internal distributed query execution engine. This engine performs a lot of the same kinds of optimizations that make Spark so powerful. For starters, it pushes filtering and computation to the data, instead of the other way around. It also uses the compute resources available across your entire CDB cluster to perform large joins and other expensive aggregations. This still means that all data will be fed through a single SQL gateway node at some point during the processing of a Spark flow, but careful tuning of where this Spark/CDB barrier ends up should help you make the most of what both systems have to offer and avoid a lot of the need for a locality-aware driver.

I’d encourage you to play around with this and report back with your findings. That will help us guide our development efforts going forward.

Hello Nathan,

Thank you for the response. As you pointed out the read and write through the gateway node can become bottleneck.
I shall keep you posted as we progress with Spark + CDB and keep you posted on the performance.
Our current use case would be around using Spark as a means to perform UDAF and UDF using data residing in CDB.