Performance Issue


#1

Hi, there

As we asked questions on Gitter, it seems like the performance is not quite satisfying; for recommended VM with 3-node cluster, the transaction speed is really disappointed, 10-20 transactions per second. We worked really hard to address this issue cause it is kind of an emergency, but we cannot fix this, could anyone give any advice or guess on how the issue could be? We have tried direct writes to database nodes without passing through our servers, but the problem still exists.

Also, does anyone can share some of the testing programs for reference? Thanks for any contribution!

Best Wishes


(Marc) #2

Hello @Lionxx, could you provide us with the cockroach version, schema, queries, and approximate data sizes/number of results?

It may also be helpful to try a EXPLAIN or SHOW TRACE on your statements to see the planning and execution details.


#3

Hi, @marc
Thank you so much for your reply, could be a life saver, the cockroach version is 2.0.0 , there are three database nodes are deployed on Azure, but we have tested with single node as well, the VMs we used are all L4s machines.
Here is the show trace log:



Here is the part of the testing codes, we thought it should be the testing code bugs, and we have also tested using dbstress, it gives result with 100 threads executed in 7 seconds:

# Connect to the cluster.
# Import the driver.
import psycopg2
import psycopg2.errorcodes
conn = psycopg2.connect(database='bank', user='root', host='51.140.4.100', port=26257)


def onestmt(conn, sql):
with conn.cursor() as cur:
    cur.execute(sql)


# Wrapper for a transaction.
# This automatically re-calls "op" with the open transaction as an argument
# as long as the database server asks for the transaction to be retried.
def run_transaction(conn, op):
with conn:
    onestmt(conn, "SAVEPOINT cockroach_restart")
    while True:
        try:
            # Attempt the work.
            op(conn)

            # If we reach this point, commit.
            onestmt(conn, "RELEASE SAVEPOINT cockroach_restart")
            break

        except psycopg2.OperationalError as e:
            if e.pgcode != psycopg2.errorcodes.SERIALIZATION_FAILURE:
                # A non-retryable error; report this up the call stack.
                raise e
            # Signal the database that we'll retry.
            onestmt(conn, "ROLLBACK TO SAVEPOINT cockroach_restart")


# The transaction we want to run.
def transfer_funds(txn, frm, to, amount):
    with txn.cursor() as cur:

    # Check the current balance.
    cur.execute("SELECT balance FROM accounts WHERE id = " + str(frm))
    from_balance = cur.fetchone()[0]
    if from_balance < amount:
        raise "Insufficient funds"

    # Perform the transfer.
    cur.execute("UPDATE accounts SET balance = balance - %s WHERE id = %s",
                (amount, frm))
    cur.execute("UPDATE accounts SET balance = balance + %s WHERE id = %s",
                (amount, to))


# Execute the transaction.
count = 1000
while count > 0:
    run_transaction(conn, lambda conn: transfer_funds(conn, 1, 2, 1))
    print(count)
    count = count - 1

with conn:
    with conn.cursor() as cur:
        # Check account balances.
        cur.execute("SELECT id, balance FROM accounts")
        rows = cur.fetchall()
        print('Balances after transfer:')
        for row in rows:
            print([str(cell) for cell in row])

# Close communication with the database.
conn.close()

Thank you so much again marc, really appreciated.


(Marc) #4

This is just the slightly tweaked python example example from our docs. It is provided as a sample use case, but is not optimized for pretty much anything. Although it should still do more than a dozen txns/s.

The programs we use for performance testing can be found in the loadgen repo though they’re all in Go, so they won’t really help for developing python programs.


#5

Hi, @marc
Thank you for your advice, we also used testing tool “dbstress” to analyse the performance, but it does not help that much. So despite the code itself, what could be the case to have such unsatisfied result? Is there anything you can guess in terms of Azure configuration? We just followed the docs you provided for Azure deployment. Is there anything we should be aware of in particular?
Best Wishes