Count(*) seems to return incorrect result

Hi, I’m trying to understand why a count(*) returns a different row count than I get with a SELECT *

I am running a single node in insecure mode on my laptop with:
cockroach start --insecure --listen-addr=localhost -s /opt/crd

The count query:

 select count(*) from "MemberMessage";
  count  
+-------+
  26202  
(1 row)

Time: 70.411608ms

When doing a select:

select * from "MemberMessage"; 
<row content>
(22 rows)

Time: 28.520901ms

When selecting just the id I get all the rows.

select id from "MemberMessage";         
warning: buffering more than 10000 result rows in client - RAM usage growing, consider another formatter instead
   id    
+-------+
  32346  
  32349  
<many more>
(26202 rows)

Time: 69.946766m

I did try to delete all “MemberMessage” rows prior to this, resulting in a message about too many transactions and the need to split this delete query up.

update:
I also do not seem to be able to delete these “hidden” rows with DELETE FROM "MemberMessage" WHERE 1=1;
(I guess the title is wrong as it is the SELECT clause that does not return all rows)

Hey,

What version of CockroachDB are you running?

Just to confirm, there are supposed to be 26202 rows and select * is returning 22 rows?

Can you elaborate on the “hidden rows”?

Thanks,
Matt

I Matt, I’m running v19.1.5, this is the full output when starting:

*
* WARNING: RUNNING IN INSECURE MODE!
* 
* - Your cluster is open for any client that can access localhost.
* - Any user, even root, can log in without providing a password.
* - Any user, connecting as root, can read or write any data in your cluster.
* - There is no network encryption nor authentication, and thus no confidentiality.
* 
* Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v19.1/secure-a-cluster.html
*
CockroachDB node starting at 2019-10-23 16:10:00.634570411 +0000 UTC (took 1.1s)
build:               CCL v19.1.5 @ 2019/09/23 14:12:16 (go1.11.6)
webui:               http://localhost:8080
sql:                 postgresql://root@localhost:26257?sslmode=disable
client flags:        cockroach <client cmd> --host=localhost:26257 --insecure
logs:                /opt/crdb/logs
temp dir:            /opt/crdb/cockroach-temp300492176
external I/O path:   /opt/crdb/extern
store[0]:            path=/opt/crdb
status:              restarted pre-existing node
clusterID:           65556e3c-3a9f-42bf-a6ef-9d695a5168ac
nodeID:              1

I mean hidden rows as I cannot get them to show up with select , but they are there when only selecting the ID. or when using count().

I have tested this a bit more and it looks like I am seeing rows that are still queued to be deleted.
I can reproduce it by:

  • Importing the test database;
  • Deleting say 8000 rows, a couple of times.
  • Check how many rows we have with select(*)
  • Check actual data in rows with select id, or select body

Example:

c1nodejshosdev=# delete from "MemberMessage" limit 7000;
DELETE 7000
c1nodejshosdev=# select count(*) from "MemberMessage";
 count 
-------
 26885
(1 row)

c1nodejshosdev=# delete from "MemberMessage" limit 8000;
DELETE 708
c1nodejshosdev=# select count(*) from "MemberMessage";
 count 
-------
 26458
(1 row)

c1nodejshosdev=# select count(*) from "MemberMessage";
 count 
-------
 26458
(1 row)

c1nodejshosdev=# select body from "MemberMessage";
 body 
------
(0 rows)

c1nodejshosdev=# select count(*) from "MemberMessage";
 count 
-------
     0
(1 row)

So the problem goes away after a while, but I find it strange that after all rows are deleted and the delete statement returns, a select count still returns rows.

And even though not visible in this sequence of queries, sometime I can select the ID field, but not the body field after having supposedly deleted all rows. So apparently there is a certain time after a delete operation that the database returns inconsistent results.

Thanks for bringing this up, I’ll investigate as to why this is occurring.

Can you share your DDL with me?

Thanks,
Matt

Certainly, what is the best way to send you this file ?

Another example, the first query is to show every row has a body field.
After all rows are deleted, the difference between a select count(*) and a select count(body) becomes clear. After a couple of seconds all rows are deleted and the correct counts are returned. In some of my tests under a higher load this took much longer.

c1nodejshosdev=# IMPORT PGDUMP 'nodelocal:////dbexports.pgsql';
       job_id       |  status   | fraction_completed |  rows  | index_entries | system_records |   bytes   
--------------------+-----------+--------------------+--------+---------------+----------------+-----------
 497414611555975169 | succeeded |                  1 | 612962 |       1888034 |              0 | 295202223
(1 row)

c1nodejshosdev=# select id as missing_body from "MemberMessage" where body IS NULL OR body = '';select count(*) as count_x from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";delete from "MemberMessage" limit 4000;select count(*) as count_x from "MemberMessage";select count(body) as count_body from "MemberMessage";
 missing_body 
--------------
(0 rows)

 count_x 
---------
   42708
(1 row)

DELETE 4000
 count_x 
---------
   41170
(1 row)

DELETE 4000
 count_x 
---------
   39642
(1 row)

DELETE 4000
 count_x 
---------
   38046
(1 row)

 count_body 
------------
      30708
(1 row)

DELETE 4000
 count_x 
---------
   36487
(1 row)

DELETE 4000
 count_x 
---------
   34923
(1 row)

 count_body 
------------
      22708
(1 row)

DELETE 4000
 count_x 
---------
   33324
(1 row)

 count_body 
------------
      18708
(1 row)

DELETE 4000
 count_x 
---------
   31658
(1 row)

 count_body 
------------
      14708
(1 row)

DELETE 4000
 count_x 
---------
   29986
(1 row)

 count_body 
------------
      10708
(1 row)

DELETE 4000
 count_x 
---------
   28093
(1 row)

 count_body 
------------
       6708
(1 row)




DELETE 4000
 count_x 
---------
   26456
(1 row)

 count_body 
------------
       2708
(1 row)

DELETE 2708
 count_x 
---------
   25251
(1 row)

 count_body 
------------
          0
(1 row)

DELETE 0
 count_x 
---------
   25251
(1 row)

 count_body 
------------
          0
(1 row)

c1nodejshosdev=# 
c1nodejshosdev=# 
c1nodejshosdev=# 
c1nodejshosdev=# select count(*) as count_x from "MemberMessage";
 count_x 
---------
   25251
(1 row)

c1nodejshosdev=# select count(body) as count_body from "MemberMessage";
 count_body 
------------
          0
(1 row)

c1nodejshosdev=# 

You can paste it here or send me a private message.

I was able to narrow it down to a circular foreign key reference (self-referencing) in the table. When removing this constraint (a “parent_id” referencing the primary “id” field in the same table) all statements were fast and consistent.

Interesting, thanks for the update.

Interesting. That’s a common design pattern for enterprise Java programs that use JPA subclassing, will have to keep that in mind if we use CockroachDB for JPA data in the future.

Hi @Arend

Could you message me the DDL for this table that was being affected, with the self referencing foreign key? I am trying to replicate the issue in house to try to identify the actual cause, and this would be really appreciated.

Cheers,
Ricardo

Hi Ricardo,

Thanks for the follow-up.
Please see DM.

Hey @Arend

I have tried to reproduce the issue using similar schemas on a geo replicated cluster, and I have been unable to do so. I would like to continue to troubleshoot your database further to determine why yours is behaving differently. Please see my DM, and we will continue to work on this through that medium.

Cheers,
Ricardo