How to access other db's like Oracle and psql from cockroachDB?

Hi, I am investigating the migration from Oracle to psql or maybe even cockroachDB. For that I need to transfer my data. In psql I used fdw_Oracle to pull the data over. How can I accomplish the same from cockroachDB.

thanks,
Ronald.

There’s nothing like this built in to CockroachDB currently. Instead, the best way to transfer data from other database into Cockroach is to dump the data from the other database, transform it into INSERT or COPY statements that Cockroach can understand, and then pipe those statements into cockroach (perhaps using the cockroach sql shell). There’s a tool called pgloader that facilitates migrations like this into postgresql; I haven’t tried it with cockroach but it might work.

1 Like

Thanks for your reply Ben,
it’s a small test case I intend to use, with about 200M rows and about 100 tables. In oracle executing single insert statements, row by row would take ages unless using array inserts. I could write a little python app that pulls the data from Oracle using big fetch sizes. Can we do array inserts into cockroachDB?
I guess I do need something similar to reach an acceptable speed…
Using fdw_Oracle I could migrate within 1 hour, which would be OK for this case. Do you have any more tips/advice on this?

thanks,
Ronald.

Cockroach’s INSERT statement accepts multiple VALUES tuples (example). For best performance, we recommend inserting about 500 rows per INSERT statement. You can also speed things up by removing all indexes (except the primary key) and foreign key constraints during the import, and adding them once the data is all there.

Thanks again Ben,
normally I do build the indexes after the data is present. Saves ages of time. I will see if I can rewrite my migration to pgsql using fdw_Oracle to a pure client implementation fetching from one db, storing in other db.
Is database connectivity on the roadmap?
I do like the fdw in pgsql …

1 Like

Internal connections to other databases like fdw are not on the roadmap; instead we’re planning to work on migration tools that live outside the database itself (similar to pgloader).

That’s a pity. Thus far I have not seen many databases that had no connections to other databases at all. Of course this can be done in the middleware but for best performance it is often smarter to have the database sort this out. Is there a design philosofy that rules inter database connectivity out?

We don’t have any philosophical objection to it, it just seems easier to do it outside the database itself. I have never personally used these features in other databases, so I’m curious: why do you find them so much better than an external process? What makes it possible to achieve higher performance this way?

mostly because it allows the smarter usage of the database optimizer to find the plans to access the data, no matter where the data is located. And yes, reading the data from an other source always will have some extrq fun. Point is, data is best managed where the data lives, in the database.
From Oracle, I can quickly access all data behind a database link without having to mess with any client code for that specific source. Just sql/plsql
From pgsql using fdw allows for the same. Very powerful.

having this makes the data location transparent.

oh, and don’t forget the consistency issues when wrestling with transactions in multiple databases. From databases that are connected you might expect them to have transactions and 2fc protocol. Having control on 1 point makes life a lot easier, for the developers.

Won’t the Postgresql COPY API work? I use the implementation in Go documented here:

(there are probably Java implementations too).

It is very fast and I use it a lot against PG… but haven’t tried against CDB yet. I extract from source in CSV format, then bulk load into PG.

I was going to try this with CDB using the Windows binary, but couldn’t get it started (this issue will be another topic in a few minutes).

Thanks for you reply Cecile,
I am not so fond of having to extract to files first before I can insert. Now I am in a small test setup with only about 100 tables and some 200M rows and here extract and load could be doable. With a distributed option it is a lot easier. One might even see a scenario in which the data is migrated in steps and that the app works using views over a database link for some tables that yet have to be migrated.

I think you have a pretty good argument for supporting these kinds of connections. We are aware that currently there’s not a very fast or easy way to get large amounts of existing data into CockroachDB. We have fast ways to get existing CRDB data in, and we had hoped to document and write an intermediate form that could produce that. Then we and users could write the various exporters from other systems into that form, and get fast inserts. Using these connectors could reasonably be that method, since it could remove a lot of complication around what the intermediate format is. I’ll bring this up in the appropriate weekly meeting today.

I think you have a pretty good argument for supporting these kinds of connections. We are aware that currently there’s not a very fast or easy way to get large amounts of existing data into CockroachDB. We have fast ways to get existing CRDB data in, and we had hoped to document and write an intermediate form that could produce that. Then we and users could write the various exporters from other systems into that form, and get fast inserts. Using these connectors could reasonably be that method, since it could remove a lot of complication around what the intermediate format is. I’ll bring this up in the appropriate weekly meeting today.

Glad with your response. Since looking wider than Oracle I saw and used fdw. I think this is a nice way to enable developers to write vendor specific fdw implementations, like fdw_Oracle. Doing so would not only enable easier transactions over multiple databases but also the best of both worlds, combining the fastest insert speed with the fastest fetch speed from the source database, for migrations. I have not taken a deep dive into cdb yet but given the way it looks like, there are some similarities with pgsql. One might even think that adopting the fdw approach would b a feasible one to implement in cdb.

Makes sense. But for completeness, I did test and the PG bulk loader works fine. Below is a transcript. The “pgcopy” code can be found at https://github.com/mandolyte/csv-utils.

Some commands to get started using built-in command line client:

$ winpty ./cockroach sql --insecure
# Welcome to the cockroach SQL interface.
# All statements must be terminated by a semicolon.
# To exit: CTRL + D.
root@:26257/> create database bulk;
CREATE DATABASE
root@:26257/> create table bulk.testcsv ( id int primary key, datex text);
CREATE TABLE
root@:26257/>

Now the bulk loader:

$ export PGCRED=postgresql://root@localhost:26257?sslmode=disable
$ ./pgcopy -input testdates.csv -schema bulk -table testcsv -urlvar PGCRED
Start Time: 2017-05-15 13:31:06.0099273 -0400 EDT
Changing all headers to lowercase!
Stop Time: 2017-05-15 13:31:08.7209332 -0400 EDT
Total run time: 2.7110059s
Inserted 5 rows
$

Now back to the SQL prompt:

root@:26257/> select * from bulk.testcsv
           -> ;
+----+------------+
| id |   datex    |
+----+------------+
|  1 | 2017-01-01 |
|  2 | 2016-12-25 |
|  3 | 2017-09-01 |
|  4 | \N         |
|  5 | 2017-04-01 |
+----+------------+
(5 rows)
root@:26257/>

why not put PostgreSQL in the mddle,
Create FDW to Oracle to read data and FDW for Cockroach (use here PostgreSQL FWD) to write to Cockroach,
Then its just a single SQL that makes insert into FDW_Cockroach select * from FDW_Oracle.
I did the same to move data from Cassandra to Oracle… again FDW Cassandra as source and FDW Oracle as target … worked like a charm … All it takes is to verify that the PostgreSQL FDW for PostgreSQL works with Cockroach …

A nice idea, thanks. It is going to look like knitting with databases :wink:
I think any database should have the option to connect to other databases. It alleviates the burden on the developer who is updating data in multiple databases from re-inventing transactions. The database already has that task.