Show HN: Streambed – Stream Postgres to Iceberg on S3, Supports Postgres Wire

(github.com)

64 points | by vira28 7 hours ago

6 comments

buremba 53 minutes ago
Looks interesting! It reminds me of pg_lake, which we evaluated for our startup https://lobu.ai but it's missing a lot of pushdown capabilities which made OLAP queries expensive.
I also tried DuckLake but that required us to move away from PG-first approach. I was thinking of using Debezium to create Iceberg on S3 for our append-only PG tables and use DuckDB. I will try Streambed out as well!
vira28 2 hours ago
Author here. For context, I was the tech lead for the Postgres team at Cloudflare, and this came directly out of a challenge I kept hitting there: BI and dashboard teams needed to run long-running analytical queries, and the answer was always to spin up another bespoke read replica or stand up an ETL dump into an analytical database and query that.
So the question I started with was: what's the fewest components I could get away with? That led to the architecture here — Streambed connects to Postgres as a logical replication subscriber (same mechanism as a read replica) and streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB. There are a lot of edge cases to handle, and it's very much early days.
Welcome any feedback.
[-]
- kikimora 57 minutes ago
  To me being able to query over psql is secondary. I’m fine with any SQL. What is very important is being able to transform the data to better suite analytical queries. That is, define custom transformations, define how data sectioned and what indices available.
- ashtuchkin 1 hour ago
  Just wanted to say thank you! Very relevant to our use cases. I'll report if I find any issues.
cpard 4 hours ago
Replicating the Postgres WAL to S3 and Iceberg reliably is a hard problem but it’s not accurate to say that no ETL is needed here.
maybe you can say it’s more of an ELT pattern but anyone who’s interested into using this for realistic analytics they will have to transform the data at some point.
If an org is early enough to think that they can use a solution like this and just get in duckdb and start spitting out reports, they will be up for a really bad experience.
Please educate people to do the right thing and realize the scope of the work they are facing, it might feel that it hurts your growth in the short term but it will benefit you greatly in the mid-long term as a vendor.
[-]
- kikimora 3 hours ago
  IDK, AWS Zero ETL from Autora into Redshift really helped us at some point. You right that data transformation is very limited if not possible. But having data in an analytical store, being able to experiment with queries, understand what is wrong with your OLTP schema and then build ETL is way better than doing an upfront design.
  [-]
  - cpard 1 hour ago
    Of course it is. What you describe is one of the reasons that ELT became popular, if you couple it with a variant type and schema on read, you have a very powerful and flexible architecture.
    But there’s no free lunch, building and maintains data infrastructure that is reliable requires work. Many companies don’t realise that when they start their analytical journey and aggressive marketing doesn’t help. That’s the point I was trying to make.
    [-]
    - kikimora 1 hour ago
      I don’t disagree, just placing emphasis on a different aspect.
      In an ideal world there is a tool that moves your schema into an analytical store “as is” with a single click. Then the same tool lets you add arbitrary transformations of the data. Surprisingly I have not come across such a tool. It is earthier “one click to move your data” or “any transformation you want” but only after a significant upfront investment :(
      [-]
      - cpard 37 minutes ago
        I think I didn’t articulate myself very well on my reply. I actually wanted to say that I agree with you and emphasise again the need for educating users for the complexity of these projects.
        What you describe has been pitched by many different products for different parts of the data platform. Fivetran for example claims to do that for the extraction and loading part, good old Informatica was offering the ETL in a graphical interface etc.
        The problem that many teams ended up having is the explosion of the tooling needed by data teams.
- vira28 2 hours ago
  [flagged]
karakanb 3 hours ago
Hi, this looks interesting, thanks for sharing. I am the builder of ingestr (https://github.com/bruin-data/ingestr), so I am very much in the same space.
I really like that you did this in Go, and I'll definitely dig a bit more into the source code to see how you tackled the CDC stuff, given that there is not many reliable CDC libraries in Go, and there are quite a few gotchas when it comes to doing CDC right. We also hand-rolled ours in ingestr, or I must say clanker-rolled, and we got quite a few things wrong in the first place.
Curious about the postgres-compatible query option: what's the usecase you have in mind there? My perception is that any org that would use Iceberg also has one or a few query engines in place, is this more for debugging stuff?
Quite cool stuff, keep it up!
[-]
- vira28 2 hours ago
  Hello, I checked ingestr repo, and it is in my bookmark. Small world.
  Agree, CDC is like Death by a thousand cuts. I believe Debezium has a Java library.
  My initial need was Postgres compatibilty. Wanted to give an endpoint that BI and dashboard teams can use to query as if they are querying a Postgres replica. Added more context here https://news.ycombinator.com/item?id=48350820
oa335 1 hour ago
nice work! we have handrolled something similar at work.
do you have any perf metrics? throughput, end-to-end latency, etc?
jiangriver66 33 minutes ago
[flagged]