Insert queries very slow after upgrading from 2.1.6 to 19.1.4

Cloud Provider: google
Setup: three region GCE deployments (us-west, us-central, us-east) with a n1-highmem-8 machine in each region (total 3 nodes)
Load balancer: redundant HAProxy

We’ve been using cockroach for over a year now and we recently upgraded to 19.1.4. One of our business critical tables is taking over 300 seconds to insert 10 rows in a single statement, when in the past it would be 300ms in the 99th percentile for 1000 rows in a single statement.

We’re also seeing major spikes in service latency, GC Pause Time, CPU Time, and Memory Usage even after disconnecting the database from all production clients.

This issue only seems to affect our largest table which is currently 150GB with 1968 ranges. We have an empty clone of the same table in another database and insertions behave as expected there, very quickly.

Insert Query (shortened for brevity)

INSERT INTO sessions (event_id, geo_city, geo_continent, geo_continent_code, geo_country, geo_country_code, geo_latitude, geo_location_accuracy, geo_longitude, geo_metro_code, geo_postal_code, geo_region, geo_region_code, geo_time_zone, session_id, session_ip, site_id, time, user_avatar, user_created, user_email, user_id, user_name, user_service, user_uid, user_username)
    VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26),
 ($27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52),
 ($53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64, $65, $66, $67, $68, $69, $70, $71, $72, $73, $74, $75, $76, $77, $78),
 ($79, $80, $81, $82, $83, $84, $85, $86, $87, $88, $89, $90, $91, $92, $93, $94, $95, $96, $97, $98, $99, $100, $101, $102, $103, $104)
 ON CONFLICT (site_id, session_id, event_id) DO NOTHING;

Explain of the insert query

 count
└── insert
│               into     sessions(event_id, geo_city, geo_continent, geo_continent_code, geo_country, geo_country_code, geo_latitude, geo_location_accuracy, geo_longitude, geo_metro_code, geo_postal_code, geo_region, geo_region_code, geo_time_zone, time, session_id, session_ip, site_id, user_avatar, user_created, user_email, user_id, user_name, user_service, user_uid, user_username)
│               strategy inserter
└── render
└── filter
│               filter   site_id IS NULL
└── lookup-join
│               table    sessions@user_id_idx
│               type     left outer
│               pred     (@15 = @28) AND (@1 = @27)
└── values
                size     26 columns, 32 rows

Session Table Below

CREATE TABLE sessions (
	event_id STRING NOT NULL,
	geo_city STRING NULL,
	geo_continent STRING NULL,
	geo_continent_code STRING NULL,
	geo_country STRING NULL,
	geo_country_code STRING NULL,
	geo_latitude FLOAT8 NULL,
	geo_location_accuracy INT NULL,
	geo_longitude FLOAT8 NULL,
	geo_metro_code STRING NULL,
	geo_postal_code STRING NULL,
	geo_region STRING NULL,
	geo_region_code STRING NULL,
	geo_time_zone STRING NULL,
	"time" TIMESTAMP NULL,
	session_id STRING NOT NULL,
	session_ip STRING NULL,
	site_id STRING NOT NULL,
	user_avatar STRING NULL,
	user_created TIMESTAMP NULL,
	user_email STRING NULL,
	user_id STRING NULL,
	user_name STRING NULL,
	user_service STRING NULL,
	user_uid STRING NULL,
	user_username STRING NULL,
	CONSTRAINT "primary" PRIMARY KEY (site_id ASC, session_id ASC, event_id ASC),
	INDEX user_id_idx (site_id ASC, user_id ASC),
	INDEX user_uid_idx (site_id ASC, user_uid ASC),
	INDEX user_edit_idx (site_id ASC, user_id ASC) STORING (user_avatar, user_created, user_email, user_name, user_service, user_uid, user_username),
	INDEX user_id_session_id_idx (user_id ASC, session_id ASC) STORING (user_avatar, user_created, user_email, user_name, user_service, user_uid, user_username),
	FAMILY "primary" (event_id, geo_city, geo_continent, geo_continent_code, geo_country, geo_country_code, geo_latitude, geo_location_accuracy, geo_longitude, geo_metro_code, geo_postal_code, geo_region, geo_region_code, geo_time_zone, "time", session_id, session_ip, site_id, user_avatar, user_created, user_email, user_id, user_name, user_service, user_uid, user_username)
)

Any help would be appreciated.





Hi @Steve_Maestro,

Sorry you’re experiencing this. Can you please send a debug.zip from your cluster? That will help us be able to diagnose what’s going on. https://www.cockroachlabs.com/docs/stable/debug-zip.html

Thanks,
Jordan

It’s possible that this is being caused by a bad plan generated by the cost-based optimizer, which plans all mutations by default in 19.1. In 2.1, the optimizer only planned non-mutations by default - so it’s possible that the new mutation plans for your queries are causing bad performance.

As a test, you might want to try turning off the cost-based optimizer entirely, to see if that resolves the problem. Please note that this might have other adverse effects on queries, but validating the root cause would be a good first step. You can do that with:

SET CLUSTER SETTING sql.defaults.optimizer=off

To revert, use on instead of off.

Jordan

Also, during this test, please note that the cluster setting I mentioned only changes the behavior for new sessions - pre-existing sessions won’t get the behavior change.

Hi @Steve_Maestro,

Sorry again that you’re experiencing this. It seems that you are probably running into a bug that we’ve fixed for the upcoming 19.2 release, but is not yet back ported to 19.1. As a workaround, with the 19.1 release you could run the following instead of your INSERT query:

INSERT
INTO
    sessions
        (
            event_id,
            geo_city,
            geo_continent,
            geo_continent_code,
            geo_country,
            geo_country_code,
            geo_latitude,
            geo_location_accuracy,
            geo_longitude,
            geo_metro_code,
            geo_postal_code,
            geo_region,
            geo_region_code,
            geo_time_zone,
            session_id,
            session_ip,
            site_id,
            "time",
            user_avatar,
            user_created,
            user_email,
            user_id,
            user_name,
            user_service,
            user_uid,
            user_username
        )
SELECT
    vals.*
FROM
    (
        VALUES
            ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26),
            ($27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52),
            ($53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64, $65, $66, $67, $68, $69, $70, $71, $72, $73, $74, $75, $76, $77, $78),
            ($79, $80, $81, $82, $83, $84, $85, $86, $87, $88, $89, $90, $91, $92, $93, $94, $95, $96, $97, $98, $99, $100, $101, $102, $103, $104)
    )
        AS vals
            (
                event_id,
                geo_city,
                geo_continent,
                geo_continent_code,
                geo_country,
                geo_country_code,
                geo_latitude,
                geo_location_accuracy,
                geo_longitude,
                geo_metro_code,
                geo_postal_code,
                geo_region,
                geo_region_code,
                geo_time_zone,
                session_id,
                session_ip,
                site_id,
                "time",
                user_avatar,
                user_created,
                user_email,
                user_id,
                user_name,
                user_service,
                user_uid,
                user_username
            )
    LEFT JOIN sessions@primary AS s ON vals.site_id = s.site_id AND vals.session_id = s.session_id AND vals.event_id = s.event_id
WHERE
    s.site_id IS NULL;

This query is basically a hand-written version of the plan the optimizer creates to support ON CONFLICT DO NOTHING. By writing it this way, it allows us to add the sessions@primary index hint to ensure that the optimizer will use the correct index.

Please let me know if this query fixes the performance issues you’re seeing.

@becca, your workaround worked! Just one modification, I had to add types to the placeholders so instead of $1 I used $1::STRING

@Steve_Maestro glad to hear it! If you upgrade to 19.2 when it’s released later this fall, you should be able to go back to using the original query without seeing a performance drop. In the mean time, this modified query will be a good workaround.