Trying to split same key range


#21

@benesch @nathan - Please find the output of the below command attached. output4.ttf (35.2 KB)

curl "http://10.153.189.146:8091/debug/logspy?count=100&duration=30s&grep=split,n2,s2,r186" >> output4.txt

#22

@benesch @nathan - Can you also let me know

  1. What could have caused this issue?
  2. How can we avoid this?
  3. How can we recover the system from this and restore the normal state?

(nathan) #23

Aha! It looks like the following line is the culprit:

I180501 02:24:15.005467 26305083 internal/client/db.go:556 [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] failed batch: command is too large: 87076188 bytes (max: 67108864)

I’m going to open an issue in Github to explore why we’re seeing this, but for now, I think I know how we can fix it. Could you run the following command and watch too see if the error goes away?

SET CLUSTER SETTING kv.raft.command.max_size='100000000';


#24

@nathan - I am not seeing this more after executing the command.

Can you explain this a bit more what does this command do?

I also have a follow up question . I was seeing log like below

initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/

But I don’t see any time range in 1921;

root@:26257/vnera> select updatetimestamp from config_str order by updatetimestamp asc limit 1;
+-------------------------------+
|        updatetimestamp        |
+-------------------------------+
| 2018-04-26 19:13:25.678+00:00 |
+-------------------------------+
(1 row)

(nathan) #25

Can you explain this a bit more what does this command do?

The command bumps the maximum size of a Raft command. We place a limit on this size for stability reasons, but bumping it up a little bit is usually perfectly safe. I’m not sure at the moment what was causing this command to grow so large, but I’ve opened https://github.com/cockroachdb/cockroach/issues/25233 to track this.

What version of CockroachDB are you using?

But I don’t see any time range in 1921;

Could you return the result of SHOW CREATE TABLE config_str? This might be explained by the ordering of updatetimestamp in index 4. It’s possible that the key looks like that because the index is descending.


#26

@nathan - I am using cockroach v2.0.0

root@:26257/vnera> show create table config_str;
+------------+--------------------------------------------------------------------------+
|   Table    |                               CreateTable                                |
+------------+--------------------------------------------------------------------------+
| config_str | CREATE TABLE config_str (                                                |
|            |                                                                          |
|            |     customerid INTEGER NOT NULL,                                         |
|            |                                                                          |
|            |     objecttype INTEGER NOT NULL,                                         |
|            |                                                                          |
|            |     objectid BIGINT NOT NULL,                                            |
|            |                                                                          |
|            |     propertyname STRING NOT NULL,                                        |
|            |                                                                          |
|            |     timeinstant BIGINT NOT NULL,                                         |
|            |                                                                          |
|            |     updatetimestamp TIMESTAMP WITH TIME ZONE NULL DEFAULT now(),         |
|            |                                                                          |
|            |     value STRING NULL,                                                   |
|            |                                                                          |
|            |     CONSTRAINT "primary" PRIMARY KEY (customerid ASC, objecttype ASC,    |
|            | propertyname ASC, objectid ASC, timeinstant DESC),                       |
|            |                                                                          |
|            |     INDEX cid_ot_prop_oid_time_uts (customerid ASC, objecttype ASC,      |
|            | propertyname ASC, objectid ASC, timeinstant ASC, updatetimestamp ASC),   |
|            |                                                                          |
|            |     INDEX cid_uts (customerid ASC, updatetimestamp ASC),                 |
|            |                                                                          |
|            |     INDEX uts_desc_cid_ot (updatetimestamp DESC, customerid ASC,         |
|            | objecttype ASC),                                                         |
|            |                                                                          |
|            |     INDEX uts_cid_ot (updatetimestamp ASC, customerid ASC, objecttype    |
|            | ASC),                                                                    |
|            |                                                                          |
|            |     INDEX ot_uts_cid (objecttype ASC, updatetimestamp ASC, customerid    |
|            | ASC),                                                                    |
|            |                                                                          |
|            |     FAMILY "primary" (customerid, objecttype, objectid, propertyname,    |
|            | timeinstant, updatetimestamp, value)                                     |
|            |                                                                          |
|            | )                                                                        |
+------------+--------------------------------------------------------------------------+
(1 row)

Time: 12.168264ms

(Nikhil Benesch) #27

@nathan good intuition. Looks like index 4 is indeed descending on updatetimestamp:

INDEX uts_desc_cid_ot (updatetimestamp DESC, customerid ASC, objecttype ASC)


#28

@nathan @benesch - It is still not clear to me. How is 1921 showing up in the index name

initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/

As I have posted above the earliest date in updatetimestamp is 2018-04-26.

root@:26257/vnera> select updatetimestamp from config_str order by updatetimestamp asc limit 1;
+-------------------------------+
|        updatetimestamp        |
+-------------------------------+
| 2018-04-26 19:13:25.678+00:00 |
+-------------------------------+
(1 row)

Can you please explain this a little more?


(nathan) #29

@tuk keys (byte arrays) in CockroachDB’s distributed key value store are sorted lexicographically. This allows the SQL layer to perform scans of the keys and get back ordered results, which is how scans on indexes are made efficient. This is explained in detail in this blog post, although we have made a few changes to this SQL<->KV mapping since that was published, so don’t take it as a single source of truth.

The blog post gives an example of the table test, which has an index on key INT. This means that the value of key is part of every key. If a table has a row where key=10 and a row where key=11, the keys of the rows will look something like /test/1/10 and /test/1/11 (the 1 means index 1). Note that these are pretty printed key representations - the real keys are just encoded byte arrays. However, regardless of pretty printing, we take care to ensure that the relative sorting of the keys are the same whether they are in SQL values (10 < 11) or in KV encoded values (encode(10) < encode(11)). This is what allows us to scan ordered keys in their encoded format and get back ordered SQL values. Based on some of your previous comments, I think you actually knew all that!

Things get a touch more difficult with descending indexes. With these indexes, larger values actually come before smaller values. For instance, we want 11 to sort before 10 so that we can still perform a forward scan to retrieve SQL values in descending sorted order. But our KV map only has one ordering direction, ascending. To get around this, we play a little trick. When encoding SQL values to byte array keys in descending indexes, we first invert the SQL value. For instance, 10 is inverted to -10, and 11 is inverted to -11. In fact, there is an inversion function for all data types such that f(f(x)) = x and if x<y then f(x) > f(y). This ensures that the keys that we encode give the desired ordering (11 < 10 and (encode(-11) < encode(-10))). The only other consideration is that we have to remember to invert the SQL value again when decoding.

This works in all cases where we have information about the table readily available - if we see a key in a descending index then we know to invert the value when decoding. However, when a key is accessed deep inside our system, like it was in this log message, we don’t have easy access to schema information and don’t know we’re reading a descending index, so we don’t bother inverting the SQL value when pretty printing. This doesn’t cause any issues, other than the occasional confusion.

So long story short, 1921-09-04T06:01:02.831999999Z must be the inverted value of 2018-04-26 19:13:25.678+00:00. Whenever this value is accessed through SQL, the key will be correctly inverted. However, in these logs, we’re left with something one step closer to the raw key.


#30

@nathan- Should this be fixed to show the actual (non-inverted) key in logs to avoid confusion?


(nathan) #31

Ideally yes it would be nice, but in practice, I don’t think it will be changed any time soon. The problem is that we have no easy access to metadata about the SQL schema while at that level of the system. Retrieving this information would not only be a breach of the SQL/KV abstraction barrier, but could also require communicating with a computer half-way across the world.