DB design for historical data via interleaving

sql

#1

I am trying to design db model that takes advantage of interleaved tables. The project has several resources (i.e users, orders, items), and it also needs to keep historical changes for each of these resources.

Option 1:
We can have one table events, which is written to when any of the resource is created or updated.

So, when order is created, we insert both into orders table, as well as into events (name: “order_created”) via one transaction. Any further updates to this order updates order row in orders and creates new row in events (name: “order_updated”, data: only_delta) via one transaction.

Having one events table allows it to be interleaved only once. For example:

/users/1
/users/1/orders/1
/users/1/orders/2
...
/users/1/items/1
/users/1/items/2
...
/users/1/events/1
/users/1/events/1

In this case, we can’t keep events right next to the object to which they relate to, only interleaved under user.

Option 2:
On the other hand, instead of one events table, we can have user_events, order_events, items_events, which allows interleaving like this:

/users/1
/users/1/user_events/1
/users/1/user_events/2
...
/users/1/items/1
/users/1/items/1/items_events/1
...
/users/1/orders/1
/users/1/orders/1/order_events/1
/users/1/orders/1/order_events/2

Now, if I go with option 2, is my understanding correct that it offers:

  1. Faster reads for users inner joined with user_events than Option 1;

  2. Faster reads for orders inner joined with order_events (where user is given) than Option 1;

  3. Faster reads for items inner joined with items_events (where user is given) than Option 1;

  4. Faster writes for any new resource (user, order, item) + event row (user_events, order_events, items_events) via 1 transaction, as it inserts values right next to each other (same range) than Option 1;

  5. Slower reads/deletes for user_events, order_events, items_events when not joined or filtered by user and/or the resource to which event belongs to than Option 1;

  6. Generally, if my reads most of the time require resource and its events history, Option 2 is better option? My thinking is that if user will have lots of events, they will likely be in different ranges, and so making join of any resource+its history will be slower.


(Bob Vawter) #2

If your most common use case will be to access an object and the events associated with it, then Option 2 does sound reasonable. The main drawback is that if you wind up scanning a parent table, you’ll pay the performance cost of skipping the interleaved rows. This would only be exacerbated by having the two levels of interleaving if you need to scan the users table.

Option 2 sounds promising, but running a benchmark against synthetic data is the best way to know if it will be performant for your specific queries.