Know the content of nodes

Hi,
I am currently working on a project where it is critical to know how my data is distributed among the nodes, for instance in which nodes are my elements from table “user” where “age<30”, etc. I know it will be on about 3 nodes since I have 3 replicas, but I would like to know precisely which of them.

First I tried to use the monitoring UI, but I couldn’t find a way to see what data is in each node.
Then I tried launching SQL queries targeting precise sets of data and check on the monitoring which nodes would react to these queries, but my UI aways say that the first node receives the queries, no matter the query I put in entry, which I find really strange.

So here are my questions in decreasing order of priority :

  1. Is these any way I can possibly know which data is in which node?
  2. Do you know why on monitoring UI only the first node receives the queries?
  3. Is it possible to know which are the primary ranges and which are the replicas?

Thank you.

Hi Manu,

The “first node” that you see receiving the query (and thus producing latency data points in the time series) in the monitoring UI is called the “gateway node” (this is briefly described in our architecture docs: they are definitely worth a skim through at the very least). This is the node you connect to via cockroach sql --host=somenode:1234.

There is also the “leaseholder” (some node that has a replica for a given range) which actually performs the consistent reads/writes from its replica. So if your gateway node is node 1 and the leaseholder is node 2, it will still read/write data via node 2 (and send the output back to node 1). The leaseholder will however re-locate and adapt to workloads received by your cluster as described in this blog post.

  1. In short, you can think of your data being stored and accessed on the leaseholder node. To see which node is the leaseholder for each range, you can use the (experimental/internal) SQL query

    SHOW TESTING_RANGES FROM TABLE foo
    

    The “start/end keys” contain the primary key values of your table. So a range with start and end key /1 to /5 contain the rows with primary key value 1 to 4 (the end key is exclusive).
    Of course your leaseholder can move around (see the linked Follow the Workload blog post).

  2. Since the gateway node is the final interface between the cluster and a client or connection, latencies are shown with respect to the gateway. You can place a load-balancer in front of your cluster to distribute connections (you can use something like HAProxy as described in our docs).

    The neat thing about Cockroach is the distributed execution engine (which works with almost all read-only queries): it offloads as much computation to the leaseholder(s) so that as little data and deferred computation are sent over the wire to the gateway node.

  3. See #1.

1 Like

Hi @richardwu, thank you for your detailled and clear answer. I will test these without delay.
Have nice Christmas Holidays !

Let me know if you run into any other issues. Happy holidays to you too!