Performance Issue


#1

Hi, there

As we asked questions on Gitter, it seems like the performance is not quite satisfying; for recommended VM with 3-node cluster, the transaction speed is really disappointed, 10-20 transactions per second. We worked really hard to address this issue cause it is kind of an emergency, but we cannot fix this, could anyone give any advice or guess on how the issue could be? We have tried direct writes to database nodes without passing through our servers, but the problem still exists.

Also, does anyone can share some of the testing programs for reference? Thanks for any contribution!

Best Wishes


(Marc) #2

Hello @Lionxx, could you provide us with the cockroach version, schema, queries, and approximate data sizes/number of results?

It may also be helpful to try a EXPLAIN or SHOW TRACE on your statements to see the planning and execution details.


#3

Hi, @marc
Thank you so much for your reply, could be a life saver, the cockroach version is 2.0.0 , there are three database nodes are deployed on Azure, but we have tested with single node as well, the VMs we used are all L4s machines.
Here is the show trace log:



Here is the part of the testing codes, we thought it should be the testing code bugs, and we have also tested using dbstress, it gives result with 100 threads executed in 7 seconds:

# Connect to the cluster.
# Import the driver.
import psycopg2
import psycopg2.errorcodes
conn = psycopg2.connect(database='bank', user='root', host='51.140.4.100', port=26257)


def onestmt(conn, sql):
with conn.cursor() as cur:
    cur.execute(sql)


# Wrapper for a transaction.
# This automatically re-calls "op" with the open transaction as an argument
# as long as the database server asks for the transaction to be retried.
def run_transaction(conn, op):
with conn:
    onestmt(conn, "SAVEPOINT cockroach_restart")
    while True:
        try:
            # Attempt the work.
            op(conn)

            # If we reach this point, commit.
            onestmt(conn, "RELEASE SAVEPOINT cockroach_restart")
            break

        except psycopg2.OperationalError as e:
            if e.pgcode != psycopg2.errorcodes.SERIALIZATION_FAILURE:
                # A non-retryable error; report this up the call stack.
                raise e
            # Signal the database that we'll retry.
            onestmt(conn, "ROLLBACK TO SAVEPOINT cockroach_restart")


# The transaction we want to run.
def transfer_funds(txn, frm, to, amount):
    with txn.cursor() as cur:

    # Check the current balance.
    cur.execute("SELECT balance FROM accounts WHERE id = " + str(frm))
    from_balance = cur.fetchone()[0]
    if from_balance < amount:
        raise "Insufficient funds"

    # Perform the transfer.
    cur.execute("UPDATE accounts SET balance = balance - %s WHERE id = %s",
                (amount, frm))
    cur.execute("UPDATE accounts SET balance = balance + %s WHERE id = %s",
                (amount, to))


# Execute the transaction.
count = 1000
while count > 0:
    run_transaction(conn, lambda conn: transfer_funds(conn, 1, 2, 1))
    print(count)
    count = count - 1

with conn:
    with conn.cursor() as cur:
        # Check account balances.
        cur.execute("SELECT id, balance FROM accounts")
        rows = cur.fetchall()
        print('Balances after transfer:')
        for row in rows:
            print([str(cell) for cell in row])

# Close communication with the database.
conn.close()

Thank you so much again marc, really appreciated.


(Marc) #4

This is just the slightly tweaked python example example from our docs. It is provided as a sample use case, but is not optimized for pretty much anything. Although it should still do more than a dozen txns/s.

The programs we use for performance testing can be found in the loadgen repo though they’re all in Go, so they won’t really help for developing python programs.


#5

Hi, @marc
Thank you for your advice, we also used testing tool “dbstress” to analyse the performance, but it does not help that much. So despite the code itself, what could be the case to have such unsatisfied result? Is there anything you can guess in terms of Azure configuration? We just followed the docs you provided for Azure deployment. Is there anything we should be aware of in particular?
Best Wishes


(Tim O'Brien) #6

Hey lionxx,

I think @knz got back to you on Gitter about this, but in case you’re still in the air:

If you want to test CRDB’s performance, it’s probably best to get a sample dataset from your application and test with actual queries, and ensure that they comply with identified [SQL best practices](It’s worth reviewing our SQL Best Practices.). The test code we provided in the python examples isn’t intended to demonstrate performance - all it does in practice is pull a single ID.

If you want to test performance on a single table, you’d need to do a couple things:

  1. Ensure that the table is set up with indexes.
  2. Ensure that the table is populated with a representative dataset.
  3. Ensure that the queries you’re testing are representative.

The challenge with the example above is that it’s not clear that you’re pulling from an indexed column, and run_transaction is only pulling a single value. So the script is really only testing how fast we can retrieve an un-indexed value over and over again, and even then how quickly we can access that single value from a single range. It’s also likely scanning the whole table in order to retrieve that value.