CockroachDB running in Kubernetes with Longhorn or Ceph

I have an architecture question about CockroachDB running in Kubernetes:
I am running a self managed Kubernetes cluster with Lonhorn as the storage system. But my question is also valid to a Storage system based on Ceph.

Currently I am running single PODs with PostgreSQL and a persistent volume provided by Longhorn. All my Longhorn volumes are distributed over 3 replicas (nodes).

Now I am considering switching to CockroachDB with a cluster of 3 nodes. I guess that it does not make sense to replicate each cockroachDB volume via Longhorn in 3 additional replicas on 3 nodes. As far as I understand, this would mean that I store each data 9 times: databases are replicated by CockroachDB on 3 PODs and each POD uses a distributed filesystem (longhorn) replicated over 3 nodes.

So want is the recommendation to deploy CockroachDB in combination with a distributed filesystem like Longhorn or Ceph? Should I reduce the replicas in my PersistentVolume form 3 to 1? Or is it a bad idea to use Longhorn/Ceph in combination with CockroachDB?

Thanks for help

I think I found an answer here

At this time, orchestrations of CockroachDB with Kubernetes use external persistent volumes that are often replicated by the provider. Because CockroachDB already replicates data automatically, this additional layer of replication is unnecessary and can negatively impact performance. High-performance use cases on a private Kubernetes cluster may want to consider a DaemonSet deployment until StatefulSets support node-local storage.

and also here.
But I think some parts of the documentation talking about daemonSets and version 1.1.0 are outdated?

I will make some tests if it is possible to use local persistence volumes with statefullSets…

That makes sense. I think running in Ceph with persistence of 1 is likely to just cause extra overhead and operational complexity. I believe that StatefulSets do now have the storage support though am not an expert. Let me know if you run into any trouble.

I succeeded to setup CockroachDB with a statefulSet using a hostPath instead of a local persistence volume:

  volumes:
  - name: datadir
    hostPath:
      path: /cockroachdb
      type: Directory

I did not figure out how to define local persitenceVolumes for each node. Also I did not yet understand what the advantage of a local persistenceVolume would be in compare to a hostPath.

I published a blog post explaining how to setup CockroachDB in Kubernetes using local disks.
https://ralph.blog.imixs.com/2021/04/22/cockroachdb-kubernetes/

I still can not see any advantage using a statefulSet instead of a daemonSet when running Cockroach in Kubernetes. Also I think using hostPath is the most effective way to configure the cluster nodes.

Running Cockroach as a DaemonSet is much more fragile than using a StatefulSet in most Kubernetes environments. You certainly can run Cockroach as a DaemonSet and possibly eke out of a bit of extra performance but we don’t recommend it due to the fragility of the setup. On CockroachCloud, we’ve opted to use StatefulSets for the stability and ease of administration.

In most Kubernetes environments, nodes are considered ephemeral. GKE is probably the most painful example of this, they will automatically upgrade and patch your nodes which results in a full node turn over every now and again. Without using some form of persistent storage, DaemonSet deployments would lose all their data when nodes are replaced in a cluster.

Even restarting nodes can be a bit disruptive. StatefulSets provide a stable network identity to each pod. Unless you’ve assigned static IPs to each Kubernetes node, you run the risk of Cockroach nodes being unable to contact each other.

By definition, a DaemonSet pod can’t be rescheduled to another node. If you discover a hardware defect on a specific Kubernetes node, you won’t be able to reschedule the CockroachDB node to another Kubernetes node. So you’re risking data loss in the case hardware failure.

I would strongly recommend against using a DaemonSet in cases where fault tolerance and data survivability are important. For your specific case, it seems safe. Though you could likely achieve a similar setup with improved stability by using a StatefulSet and one of Kubernetes’ volume plugins. I’ll chat with team about updating our docs to better show cases the drawbacks of using a DaemonSet.

1 Like

Thanks for your response.
Your argument against the stability of a DaemonSet may be valid for a GKE environment. I do not know this. But I am running a self managed Kuberents Cluster. Replacing a node means I switch the hardware in a rack. This, of course did not happen unexpected. And also each node has a stable IP adress in my cluster. As a result a single node in such a Kubernetes Cluster is a very stable and clearly defined thing and not ephemeral.

And with that background in mind, I think the argument to use volume plugins instead of hostPath volumes seems not to be valid for me. A hostPath is a clearly defined device on a specific hardware in my cluster. The performance should also be good, as no layer is between the DB and the disk.
For me, the CockroachDB Cluster seems to be very simmilar to concept of Longhorn (a distributed filesystem). Longhorn also uses a Daemenset and hostPath to define where to allocate the disc space on a node. I can add and remove nodes as I need it. And I understand that this is also the core idea behind Cockroach? I use Longhorn until now to run my PostgresDB PODs. And my plan is to migrate all these single PODs into one CockroachDB Cluster.

I wonder why you are worried about fault tolerance and data survivability. You say “…DaemonSet pod can’t be rescheduled to another node…”: this is exactly what I expect. I don’t can see the disadvantage here.

Of course, I can run CockroachDB in a StatefulSet using my existing Volume Plugin based on Longhorn. But than each CockroachDB dataDir is replicated on 3 Longhorn Nodes - this results in 9 redundant data stores.

To avoid this, my idea was to run the StatefulSet - which is recommended - with Local PersistenceVolumes. But there seems to be no way to combine a Local PV with a SatefulSet. I can’t figure out how to define seperate Local PVs to be used by the PODs running on different nodes. So finally I came to the conclusion that the StatefulSet will only work with hostPath volumes. And yes, this worked. But than I realized that the Daemenset offers much more flexibility than the SatefulSet.

I agree with you that all I am talking about here will become difficult in a Kuberents Cluster hosted by AWS, GKE or Microsoft Azure because they all need to abstract the hardware and disks in some way. But sometimes I get the impression that many users believe a Kuberentes cluster is only running by Amazon, Google and Microsoft…

Please correct me if I am telling nonsense here.

Ah! I’m sorry I think I had misread something along the way. I thought your goal was to run Cockroach on a replicated disk but that replication was happening outside of Kubernetes. You and Andrew are 100% correct that you’d have 9 redundant copies and that it would be overkill.

Aside from the nuance of DaemonSet vs StatefulSet, have you looked at Kubernetes’ blog post about local volume provisioning in conjunction with StatefulSets. If I’m reading it and your question correctly this time, it sounds like your exact use case. I can’t say I’ve gotten a chance to try it out personally yet but I’m a bit interested in trying it with local SSDs in GKE.

Would you mind saying more about what you found to be more flexible with a Daemonset? Statefulsets certainly have some oddities and sharp edges but for the most part their guarantees are very helpful when it comes to running stateful systems.

It’s less that users think all Kubernetes clusters are being hosted by a cloud and more about the guarantees provided by Kubernetes. When deploying on Kubernetes, in a general sense, the configurations of the nodes are an implementation (or infrastructure?) details. They could change from cluster to cluster or even node to node. If a user, developer, operator, etc, wants to network together a set of pods, the “Kubernetes Way” would be to use services or the network identities provided by a StatefulSet. This keeps the required manual intervention low and the portability high.

When addressing the general question of “should I run Cockroach as a StatefulSet or DaemonSet?” I will always answer StatefulSet. In addition to the various reasons I said, a StatefulSet is a more correct representation of the service being deployed. Again, in the general since of how Kubernetes thinks about things.

If reading that question in regard to your specific case, I think the only advantage would be portability. If you ever decide to build a second cluster and want to deploy Cockroach onto it, you could just reuse your StatefulSet definitions from the first cluster. A DaemonSet would require some back and forth with various configuration values. If you’re willing to spend 5 extra minutes setting up the cluster, I don’t see any reason your cluster would benefit from moving to a StatefulSet.

Sorry for some of the misreadings on my behalf but this has been a very thought provoking conversation! Provided we can get a working example of Local Persistent Volumes, we’ll likely replace the DaemonSet references in our docs with them.

Ok, finally I succeeded to run the CockroachDB cluster as a StatefulSet with local volumes. My fault was that I did not create a volume for each node. So my local-volume.yaml file have to look like this one:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: datadir
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  #persistentVolumeReclaimPolicy: Delete
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /cockroachdb
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - test-worker-1

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: datadir
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  #persistentVolumeReclaimPolicy: Delete
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /cockroachdb
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - test-worker-2

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: datadir
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  #persistentVolumeReclaimPolicy: Delete
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /cockroachdb
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - test-worker-3

Now with the volumeClaimTemplate in my StatefulSet

  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
        - "ReadWriteOnce"
      storageClassName: "local-storage"
      resources:
        requests:
          storage: 10Gi

The PODs are scheduled on my nodes. That’s fine.

But now again I must come back to the point StatefulSet vs DaemonSet. I am still not convinced that the StatefulSet has an advantage againsted the DaemonSet in a self managed cluster (not cloud a solution like GKE, AWS, Azure).

To sum things up:

The StatefulSet defines a fixed number of PODs. The PODs are scheduled by the Kubernetes scheduler based on internal decisions form the current cluster load.
This means that if a node goes down, Kubernetes will schedule the missing POD form my StatefulSet on any other available Node in my cluster. If I am using hostVolumes than I will of course lost all my data in the datadir of the dead node. This is bad as you explained.
To avoid this situation it is recommended to use PersistentVolumes in combination with a StatefulSet. The PersistentVolumeClaim mechansim abstracts the volume and guaranties the correct scheduling form the POD togeher with the volume.

But now, I don’t want to use my default volume System (Longhorn or Ceph) for the CockroachDB. For better performance I want to use local discs. How to solve this? Here the LocalVolume comes into the play. A LocalVolume is bound to the disk of a specific node (as you can see in my .yaml example above).
With the PersistentVolumeClaim mechanism and the StatefulSet now all my PODs will be scheduled correctly on the nodes where I have defined the local volumes.

But the whole thing is just a crutch to avoid that PODs being scheduled on the wrong nodes. If a node goes down in this scenario, Kuberentes can not reschedule the POD on another node until a matching PV is provided. So if I want to reschedule the missing Cockroach-Node on another Kubernetes-Node I first need to create a new LocalVolume. And this absolute makes no sense for me. I am sorry to say this. But I am running a self managed Kubernetes cluster and each Kubernetes-Node is a specific Hardware which I have root access to. So for example I can backup the /cockroachdb/ directory and move it on another node. (Should I?)

In the whole scenario I have no effect of a dynamic scaling Cockroach cluster. Of cores I can use the static LocalVolume provisioner but is still not the point.

Now let’s look at the DaemonSet. The DaemonSet defines a set of POD to be scheduled on each node fulfilling the selection criteria defined by Node affinities & Taints. The Kubernetes Scheduler guaranties that on each node a POD will be scheduled. If I add a new Kuberentes-Node, Kubernetes will schedule a new Cockroach POD on this node.
There is no need to use LocalVolumes. I can use hostVolumes because the linking between the POD and the local disk system will not change. This is similar to the deployment of the cadvisor or the Lonhorn manager. It is important that the POD sees the disk of the node where it is running on.

I don’t think you should condemn the DaemonSet.

Sorry for the long post. Maybe Github would be a better place. (I always need to wait one day until my post is approved).

Thanks again for your feedback! And thanks for the link. I will give the statefullSet a second chance now :wink:

I think a DeamonSet has the following advantage: With the concept of ‘Node affinities’ and ‘Taints and Tolerations’ you can define on which of your nodes Cockroach-Nodes should run. If I need more nodes than I can add one more node into my cluster or I can label one of my existing nodes so that this node will become a member of my CockroachDB.

I think this is very convenient, that Kubernetes takes care of the deployment of new PODs automatically just by checking the description of my nodes. For example I can define that only nodes with SSDs or with a minimum amount of disk size should schedule CockroachDB. I don’t see this feature in StatefullSet.