Help running Jepsen test on my CRDB cluster

I am trying to run this Jepsen test on my cockroachdb cluster:

here is my cluster (deployed with helm on kubernetes)

I’ve installed the leiningen, and pulled down the repo mentioned above.
I am trying to run lein run test --help but I get this error.

Here is how project.clj looks like:

I can remove the flag UseFastAccessorMethods but then it will throw this error:

edonsra@elx7041939q:~/.lein/jepsen/cockroachdb$ lein run test --help
Compiling jepsen.cockroach.runner
java.lang.ExceptionInInitializerError, compiling:(runner.clj:1:1)
Exception in thread "main" java.lang.ExceptionInInitializerError, compiling:(runner.clj:1:1)
	at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3657)
	at clojure.lang.Compiler.compile1(Compiler.java:7474)
	at clojure.lang.Compiler.compile1(Compiler.java:7464)
	at clojure.lang.Compiler.compile(Compiler.java:7541)
	at clojure.lang.RT.compile(RT.java:406)
	at clojure.lang.RT.load(RT.java:451)
	at clojure.lang.RT.load(RT.java:419)
	at clojure.core$load$fn__5677.invoke(core.clj:5893)
	at clojure.core$load.invokeStatic(core.clj:5892)
	at clojure.core$load.doInvoke(core.clj:5876)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at clojure.core$load_one.invokeStatic(core.clj:5697)
	at clojure.core$compile$fn__5682.invoke(core.clj:5903)
	at clojure.core$compile.invokeStatic(core.clj:5903)
	at clojure.core$compile.invoke(core.clj:5895)
	at user$eval20$fn__29.invoke(form-init10123414058161471011.clj:1)
	at user$eval20.invokeStatic(form-init10123414058161471011.clj:1)
	at user$eval20.invoke(form-init10123414058161471011.clj:1)
	at clojure.lang.Compiler.eval(Compiler.java:6927)
	at clojure.lang.Compiler.eval(Compiler.java:6917)
	at clojure.lang.Compiler.eval(Compiler.java:6917)
	at clojure.lang.Compiler.load(Compiler.java:7379)
	at clojure.lang.Compiler.loadFile(Compiler.java:7317)
	at clojure.main$load_script.invokeStatic(main.clj:275)
	at clojure.main$init_opt.invokeStatic(main.clj:277)
	at clojure.main$init_opt.invoke(main.clj:277)
	at clojure.main$initialize.invokeStatic(main.clj:308)
	at clojure.main$null_opt.invokeStatic(main.clj:342)
	at clojure.main$null_opt.invoke(main.clj:339)
	at clojure.main$main.invokeStatic(main.clj:421)
	at clojure.main$main.doInvoke(main.clj:384)
	at clojure.lang.RestFn.invoke(RestFn.java:421)
	at clojure.lang.Var.invoke(Var.java:383)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.Var.applyTo(Var.java:700)
	at clojure.main.main(main.java:37)
Caused by: java.lang.ExceptionInInitializerError
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at clojure.lang.RT.classForName(RT.java:2168)
	at clojure.lang.RT.classForName(RT.java:2177)
	at clojure.lang.RT.loadClassForName(RT.java:2196)
	at clojure.lang.RT.load(RT.java:443)
	at clojure.lang.RT.load(RT.java:419)
	at clojure.core$load$fn__5677.invoke(core.clj:5893)
	at clojure.core$load.invokeStatic(core.clj:5892)
	at clojure.core$load.doInvoke(core.clj:5876)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at clojure.core$load_one.invokeStatic(core.clj:5697)
	at clojure.core$load_one.invoke(core.clj:5692)
	at clojure.core$load_lib$fn__5626.invoke(core.clj:5737)
	at clojure.core$load_lib.invokeStatic(core.clj:5736)
	at clojure.core$load_lib.doInvoke(core.clj:5717)
	at clojure.lang.RestFn.applyTo(RestFn.java:142)
	at clojure.core$apply.invokeStatic(core.clj:648)
	at clojure.core$load_libs.invokeStatic(core.clj:5774)
	at clojure.core$load_libs.doInvoke(core.clj:5758)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at clojure.core$apply.invokeStatic(core.clj:648)
	at clojure.core$require.invokeStatic(core.clj:5796)
	at clojure.core$require.doInvoke(core.clj:5796)
	at clojure.lang.RestFn.invoke(RestFn.java:805)
	at jepsen.web$loading__6434__auto____4685.invoke(web.clj:1)
	at jepsen.web__init.load(Unknown Source)
	at jepsen.web__init.<clinit>(Unknown Source)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at clojure.lang.RT.classForName(RT.java:2168)
	at clojure.lang.RT.classForName(RT.java:2177)
	at clojure.lang.RT.loadClassForName(RT.java:2196)
	at clojure.lang.RT.load(RT.java:443)
	at clojure.lang.RT.load(RT.java:419)
	at clojure.core$load$fn__5677.invoke(core.clj:5893)
	at clojure.core$load.invokeStatic(core.clj:5892)
	at clojure.core$load.doInvoke(core.clj:5876)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at clojure.core$load_one.invokeStatic(core.clj:5697)
	at clojure.core$load_one.invoke(core.clj:5692)
	at clojure.core$load_lib$fn__5626.invoke(core.clj:5737)
	at clojure.core$load_lib.invokeStatic(core.clj:5736)
	at clojure.core$load_lib.doInvoke(core.clj:5717)
	at clojure.lang.RestFn.applyTo(RestFn.java:142)
	at clojure.core$apply.invokeStatic(core.clj:648)
	at clojure.core$load_libs.invokeStatic(core.clj:5778)
	at clojure.core$load_libs.doInvoke(core.clj:5758)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at clojure.core$apply.invokeStatic(core.clj:648)
	at clojure.core$require.invokeStatic(core.clj:5796)
	at clojure.core$require.doInvoke(core.clj:5796)
	at clojure.lang.RestFn.invoke(RestFn.java:551)
	at jepsen.cli$loading__6434__auto____180.invoke(cli.clj:1)
	at jepsen.cli__init.load(Unknown Source)
	at jepsen.cli__init.<clinit>(Unknown Source)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at clojure.lang.RT.classForName(RT.java:2168)
	at clojure.lang.RT.classForName(RT.java:2177)
	at clojure.lang.RT.loadClassForName(RT.java:2196)
	at clojure.lang.RT.load(RT.java:443)
	at clojure.lang.RT.load(RT.java:419)
	at clojure.core$load$fn__5677.invoke(core.clj:5893)
	at clojure.core$load.invokeStatic(core.clj:5892)
	at clojure.core$load.doInvoke(core.clj:5876)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at clojure.core$load_one.invokeStatic(core.clj:5697)
	at clojure.core$load_one.invoke(core.clj:5692)
	at clojure.core$load_lib$fn__5626.invoke(core.clj:5737)
	at clojure.core$load_lib.invokeStatic(core.clj:5736)
	at clojure.core$load_lib.doInvoke(core.clj:5717)
	at clojure.lang.RestFn.applyTo(RestFn.java:142)
	at clojure.core$apply.invokeStatic(core.clj:648)
	at clojure.core$load_libs.invokeStatic(core.clj:5774)
	at clojure.core$load_libs.doInvoke(core.clj:5758)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at clojure.core$apply.invokeStatic(core.clj:648)
	at clojure.core$require.invokeStatic(core.clj:5796)
	at clojure.core$require.doInvoke(core.clj:5796)
	at clojure.lang.RestFn.invoke(RestFn.java:1289)
	at jepsen.cockroach.runner$loading__5569__auto____36.invoke(runner.clj:1)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3652)
	... 35 more
Caused by: java.lang.ClassNotFoundException: javax.xml.bind.DatatypeConverter
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at clojure.lang.RT.classForName(RT.java:2168)
	at clojure.lang.RT.classForNameNonLoading(RT.java:2181)
	at org.httpkit.server$loading__6434__auto____5096.invoke(server.clj:1)
	at org.httpkit.server__init.load(Unknown Source)
	at org.httpkit.server__init.<clinit>(Unknown Source)
	... 120 more
Compilation failed: Subprocess failed (exit code: 1)

I would be really grateful if I could get some help configuring this and running the test on my cluster. This seems really complex to me.

/Andreas

I think your Java JDK is too new. Try setting up java version 8 and running lein. We haven’t upgraded the pieces in that repo in a long time.

Another option is to try to nix jepsen/project.clj at main · jepsen-io/jepsen · GitHub this line and see if it gets lein to work for you.

Hello. Yes I was able to run lein run test after changing from Java 11 to Java 8

I’ve spent 8 hours trying to understand how the communication between my Jepsen host and my cockroachdb-cluster (that runs on kubernetes) are about to work? There is nowhere that I enter an IP-adress of the cluster? I’ve paired the CockroachDB cluster and my Jepsen host with SSH-key but that doesn’t solve the communication piece? The flag nodes what is supposed to listed there? The nodes aren’t being run on the Jepsen host.

Here is how my cluster looks like on k8’s:

How do I configure my Jepsen test to communicate with the cluster?

That’s a heck of a lot of time spent reading that program. The jepsen suite (not just for cockroach, just generally) assumes that it can control things at a pretty low level. It is in no way set up to just be pointed at a cluster. In fact, it wouldn’t really be able to do very much if it was. Jepsen needs to be able to partition and kill nodes to run its tests. The repo you’re looking at has a bunch of code to download and start cockroach binaries on the local machine. See jepsen/auto.clj at 0a1e11810c4cc601d3963e864f66d64be4745609 · jepsen-io/jepsen · GitHub and more generally just that whole file.

I worry that you may be barking up the wrong tree here trying to just drop in that repo on a k8s cluster. It really is not clear that that makes a lot of sense with the Jepsen test. You could make the concepts align but it would require writing a good bit of clojure to make the concepts line up.

Was there some way that our documentation was misleading and made it seem like the Jepsen test is a thing you can point at existing clusters?

I think my basic knowledge about cockroachDB, kubernetes and Jepsen is poor, since I’ve just started using them for this thesis I am doing.

What threw me off was this:


That made me believe that I should install the Jepson framework locally, and send requests to my CRDB-cluster.

I worry that you may be barking up the wrong tree here trying to just drop in that repo on a k8s cluster.

We were about to try that, but so far we just cloned down the repo locally, and tried to point the test to the cluster. We never dropped the repo into the CRDB-cluster.
I don’t really know how to proceed from here. Do you recommend us to run Minicube and the Jepson framework on the same machine, and run the tests locally?

Sure, you are correct, the clojure file I pointed you at (auto.clj) also has logic to SSH across to other nodes to run cockroach. Note these three modes: jepsen/auto.clj at 0a1e11810c4cc601d3963e864f66d64be4745609 · jepsen-io/jepsen · GitHub

The long and the short of it is that this not going to “just work” with a cluster in k8s. The jepsen test assumes that it fully controls the cluster and that it sets up the cluster. It is a reasonably big project to try to get the Jepsen test to understand and interact with k8s. If that’s a path you want to walk down. I encourage you to study and get comfortable with the test as it is. I suspect that you might be in a bit over your head trying to get this to run against k8s. Nothing small and simple with flags or anything like that is going to get the jepsen test to run against a k8s cluster whether it’s in minikube or on some remote cluster.

Maybe that was too harsh, the jepsen repo does seem to have some control logic for k8s. I don’t know much about it. jepsen/k8s.clj at 5628ee5ff308885670cc854575a6a24967d4c32c · jepsen-io/jepsen · GitHub. I don’t think you’re going to get much support from us trying to make that work. It’s uncharted waters. Good luck. Feel free to report back.

Yes, I am way above my head at this point. For now I would just be happy to get the Jepsen test to execute on the CockroachDB-cluster without k8’s involed. I did run some TPC-C tests on my CRDB-cluster that was deployed on k8’s, I guess I was hoping that the configuration would be similar, but it surely wasn’t.

Any tutorials or guidelines to recommend in order to do just a plain “basic” test?

I really want to share that Jepsen isn’t a fun or interesting thing to just go and run. It makes lots of low-level assumptions. TPC-C is a benchmark and we work hard to make that easy to run as it can validate hardware configurations and demonstrate the capabilities of the database. We have many other workload like that. Jepsen isn’t a workload, it’s a correctness checking suite in the face of chaos. It is very slow and it’s not something where you compare one run to another and it’s certainly not something where you will get value looking at different databases and their jepsen tests without stepping back pretty far and asking what these different databases are offering and whether they actually do offer what they claim. You don’t need to take us at our word but, at the same time, this thing is not fun to adapt to new environments. I don’t know what it is you’re trying to do and I’ll try to be as helpful as I can be but I do worry you’re misplacing your efforts.

The following file has a sketch of running the test: cockroach/jepsen.go at master · cockroachdb/cockroach · GitHub

All this code seems to assume you’ve got some set of nodes which are running ubuntu and you can SSH into with a private key. If you want to violate that assumption then you’re going to need to change some clojure.

  1. Clone the jepsen repo
  2. Make the change to match the tarball structure in the jepsen code (or make a corresponding tar ball) cockroach/jepsen.go at 7892e685c967a06601b5b7ce0cd3950da5a56e24 · cockroachdb/cockroach · GitHub
  3. run lein install for jepsen root (the jepsen subdir)
  4. rm invoke.log from the cockroach subdir
  5. Then run the test you want to run like this: cockroach/jepsen.go at 7892e685c967a06601b5b7ce0cd3950da5a56e24 · cockroachdb/cockroach · GitHub

Note that you’ll need to be providing the addresses to the ubuntu nodes.

With enough effort I’m sure you can get this all to work. Good luck.

Thanks for a throughout explanation. The thesis is “Evaluation of CockroachDB in a cloud-native environment”, in which we have tested to scale up/out, stress test and measuring throughput (with various cluster configurations). One research question was about ‘consistency’, in which we wanted to run this chaos test. However, now when I weigh-in what you have pointed out, I think we can put our efforts elsewhere.

Making a home-brewed script in simulating transaction while taking down nodes isn’t really reliable, so I was hoping this Jepsen test could help us in that regard.

We will rethink our next move, thanks again for taking your time :slight_smile: