Distributed Query Execution Plan: Aggregator Communication

In the above distributed query plan, the nodes have flow lines between each pair of nodes from the “by hash” operator to the “unordered” operator.

My question is: if I added some more nodes to this cluster, would this query become faster or slower given that this node communication is occurring ?


In general, each node would be processing less data so I would expect a query like this to be faster. Note that in this particular case, we are sending very few rows after the first aggregation stage (total of 6 rows in the picture), so that part of the plan doesn’t matter much. It’s the table scanning and filtering (TableReader) and the first stage of Aggregator where most of the work is happening.

Thanks for the explanation. I have another question though (just curious): what do the different layers of aggregator do?

Does the first set perform the max operation on the node data, then the second layer (where all the cross aggregator comma is happening) is about comparing each initial aggregator’s interim results before passing the final result?

If this is the case, I assume the second aggregator layer comms are efficient and distributed than collating the interim aggregations to a single node?