Main difference from PG18's approach: you get complete server isolation (useful for testing migrations, different PG configs, etc.) rather than databases sharing one instance.
And are we really doing this? Do we need to admit how every line of code was produced? Why? Are you expecting to see "built with the influence of Stackoverflow answers" or "google searches" on every single piece of software ever? It's an exercise of pointlessness.
Not sure why this is downvoted. For a critical tool like DB cloning, I‘d very much appreciate if it was hand written. Simply because it means it’s also hand reviewed at least once (by definition).
We wouldn’t have called it reviewed in the old world, but in the AI coding world we’re now in it makes me realise that yes, it is a form of reviewing.
I use Claude a lot btw. But I wouldn’t trust it on mission critical stuff.
It's being downvoted because the commenter is asking for something that is already in the readme. Furthermore, it's ironic that the person raising such an issue is performing the same mistake as they are calling out - neglecting to read something they didn't write.
App migrations that may fail and need a rollback have the problem that you may not be allowed to wipe any transactions so you may want to be putting data to a parallel world that didn't migrate.
> Eh, DB branching is mostly only necessary for testing - locally
For local DB's, when I break them, I stop the Docker image and wipe the volume mounts, then restart + apply the "migrations" folder (minus whatever new broken migration caused the issue).
Uff, I had no idea that Postgres v15 introduced WAL_LOG and changed the defaults from FILE_COPY. For (parallel CI) test envs, it make so much sense to switch back to the FILE_COPY strategy ... and I previously actually relied on that behavior.
In theory, a database that uses immutable data structures (the hash array mapped trie popularized by Clojure) could allow instant clones on any filesystem, not just ZFS/XFS, and allow instant clones of any subset of the data, not just the entire db. I say "in theory" but I actually built this already so it's not just a theory. I never understood why there aren't more HAMT based databases.
Is anyone aware of something like this for MariaDB?
Something we've been trying to solve for a long time is having instant DB resets between acceptance tests (in CI or locally) back to our known fixture state, but right now it takes decently long (like half a second to a couple seconds, I haven't benchmarked it in a while) and that's by far the slowest thing in our tests.
I just want fast snapshotted resets/rewinds to a known DB state, but I need to be using MariaDB since it's what we use in production, we can't switch DB tech at this stage of the project, even though Postgres' grass looks greener.
Restarting the DB is unfortunately way too slow. We run the DB in a docker container with a tmpfs (in-memory) volume which helps a lot with speed, but the problem is still the raw compute needed to wipe the tables and re-fill them with the fixtures every time.
For anyone looking for a simple GUI for local testing/development of Postgres based applications. I built a tool a few years ago that simplifies the process: https://github.com/BenjaminFaal/pgtt
Aurora clones are copy-on-write at the storage layer, which solves part of the problem, but RDS still provisions you a new cluster with its own endpoints, etc, which is slow ~10 mins, so not really practical for the integration testing use case.
As an aside, I just jumped around and read a few articles. This entire blog looks excellent. I’m going to have to spend some time reading it. I didn’t know about Postgres’s range types.
OP here - yes, this is my use case too: integration and regression testing, as well as providing learning environments. It makes working with larger datasets a breeze.
We do this, preview deploys, and migration dry runs using Neon Postgres’s branching functionality - seems one benefit of that vs this is that it works even with active connections which is good for doing these things on live databases.
OP here - still have to try (generally operate on VM/bare metal level); but my understanding is that ioctl call would get passed to the underlying volume; i.e. you would have to mount volume
This is really cool, looking forward to trying it out.
Obligatory mention of Neon (https://neon.com/) and Xata (https://xata.io/) which both support “instant” Postgres DB branching on Postgres versions prior to 18.
Assuming I'd like to replicate my production database for either staging, or to test migrations, etc,
and that most of my data is either:
- business entities (users, projects, etc)
- and "event data" (sent by devices, etc)
where most of the database size is in the latter category, and that I'm fine with "subsetting" those (eg getting only the last month's "event data")
what would be the best strategy to create a kind of "staging clone"? ideally I'd like to tell the database (logically, without locking it expressly): do as though my next operations only apply to items created/updated BEFORE "currentTimestamp", and then:
- copy all my business tables (any update to those after currentTimestamp would be ignored magically even if they happen during the copy)
- copy a subset of my event data (same constraint)
Works with any PG version today. Each branch is a fully isolated PostgreSQL container with its own port. ~2-5 seconds for a 100GB database.
https://github.com/elitan/velo
Main difference from PG18's approach: you get complete server isolation (useful for testing migrations, different PG configs, etc.) rather than databases sharing one instance.
Mind you, I'm not saying it's bad per se. But shouldn't we be open and honest about this?
I wonder if this is the new normal. Somebody says "I built Xyz" but then you realize it's vibe coded.
https://github.com/elitan/velo/blame/12712e26b18d0935bfb6c6e...
And are we really doing this? Do we need to admit how every line of code was produced? Why? Are you expecting to see "built with the influence of Stackoverflow answers" or "google searches" on every single piece of software ever? It's an exercise of pointlessness.
We wouldn’t have called it reviewed in the old world, but in the AI coding world we’re now in it makes me realise that yes, it is a form of reviewing.
I use Claude a lot btw. But I wouldn’t trust it on mission critical stuff.
Or at least I cannot come up with a usecase for prod.
From that perspective, it feels like it'd be a perfect usecase to embrace the LLM guided development jank
App migrations that may fail and need a rollback have the problem that you may not be allowed to wipe any transactions so you may want to be putting data to a parallel world that didn't migrate.
Raised an issue in my previous pet project for doing concurrent integration tests with real PostgreSQL DBs (https://github.com/allaboutapps/integresql) as well.
Something we've been trying to solve for a long time is having instant DB resets between acceptance tests (in CI or locally) back to our known fixture state, but right now it takes decently long (like half a second to a couple seconds, I haven't benchmarked it in a while) and that's by far the slowest thing in our tests.
I just want fast snapshotted resets/rewinds to a known DB state, but I need to be using MariaDB since it's what we use in production, we can't switch DB tech at this stage of the project, even though Postgres' grass looks greener.
Also docker link seems to be broken.
Obligatory mention of Neon (https://neon.com/) and Xata (https://xata.io/) which both support “instant” Postgres DB branching on Postgres versions prior to 18.
and that most of my data is either:
- business entities (users, projects, etc)
- and "event data" (sent by devices, etc)
where most of the database size is in the latter category, and that I'm fine with "subsetting" those (eg getting only the last month's "event data")
what would be the best strategy to create a kind of "staging clone"? ideally I'd like to tell the database (logically, without locking it expressly): do as though my next operations only apply to items created/updated BEFORE "currentTimestamp", and then:
- copy all my business tables (any update to those after currentTimestamp would be ignored magically even if they happen during the copy) - copy a subset of my event data (same constraint)
what's the best way to do this?
Something like:
https://www.postgresql.org/docs/current/sql-copy.htmlIt'd be really nice if pg_dump had a "data sample"/"data subset" option but unfortunately nothing like that is built in that I know of.