Databricks announces Zerobus - a Kafka-like message bus for writing data directly to your Lakehouse in the Unity Table Format. Zero copy. High throughput. Near real-time latency. Unlike Kafka, this is useful when your only destination is the lake house. I posted a while ago about the idea of open table formats reducing the need for Kafka - if S3 and table formats become the default place to store data, you omit a lot of read fanout jobs that simply copied the data to another system for use. Just store it in one place. Interesting development. It seems like every company is releasing their Kafka-like niched down alternative. (see Cloudflare Pipelines)
You can do this with BoilStream into DuckLake with FlightRPC directly from DuckDB with plain SQL inserts. We have also derived topics (materialised views) as well as Postgres interface for realtime analytics.
I don’t know. Maybe collaboration to simplify data ingestion from existing technologies and protocols (specifically OSS ) is a better approach than yet-another-ingestion-technology. Let’s see how it evolves.
Ken Chen Time for Nativelog to be unleashed with an API.
I think they should be more specific about near real-time (everyone claims to be near real-time). Other than that I do not know what actually the definition of zero copy means. Somehow they need to copy the data into the system as they are not transforming it from the source (it‘s ELT not ETL). I feel they might want to go back to ETL with shifting everything to the left.
Link to the presentation : https://www.youtube.com/watch?v=wrH5wWmFT94 I'd be really curious to know about which delivery semantics they support (if any).
𝖹̶𝖾̶𝗋̶𝗈̶𝗆̶𝗊̶ ̶𝗁̶𝖺̶𝗌̶ ̶𝖻̶𝖾̶𝖾̶𝗇̶ ̶𝗍̶𝗁̶𝖾̶ ̶𝗌̶𝗅̶𝖾̶𝖾̶𝗉̶𝖾̶𝗋̶ ̶𝗉̶𝗋̶𝗈̶𝗍̶𝗈̶𝖼̶𝗈̶𝗅̶ ̶𝖻̶𝖾̶𝗁̶𝗂̶𝗇̶𝖽̶ ̶𝗌̶𝗈̶ ̶𝗆̶𝗎̶𝖼̶𝗁̶ ̶𝖿̶𝗈̶𝗋̶ ̶𝖺̶𝗀̶𝖾̶𝗌̶,̶ ̶𝗂̶𝗇̶𝖼̶𝗅̶𝗎̶𝖽̶𝗂̶𝗇̶𝗀̶ ̶𝗃̶𝗎̶𝗒̶𝗉̶𝗍̶𝖾̶𝗋̶ ̶𝗇̶𝗈̶𝗍̶𝖾̶𝖻̶𝗈̶𝗈̶𝗄̶𝗌̶.̶ ̶𝖵̶𝖾̶𝗋̶𝗒̶ ̶𝗌̶𝗆̶𝖺̶𝗋̶𝗍̶ ̶𝖼̶𝗁̶𝗈̶𝗂̶𝖼̶𝖾̶ ̶𝖺̶𝗌̶ ̶𝗂̶𝗍̶ ̶𝖺̶𝗅̶𝗋̶𝖾̶𝖺̶𝖽̶𝗒̶ ̶𝗁̶𝖺̶𝗌̶ ̶𝖽̶𝗋̶𝗂̶𝗏̶𝖾̶𝗋̶𝗌̶ ̶𝗂̶𝗇̶ ̶𝗅̶𝗈̶𝗍̶𝗌̶ ̶𝗈̶𝖿̶ ̶𝗅̶𝖺̶𝗇̶𝗀̶𝗎̶𝖺̶𝗀̶𝖾̶𝗌̶ I mixed up an OSS lib I was looking at and this, I have no insider info
this sounds like a game changer for data management.
This is similar to BigQuery PubSub subscription, push data in a pipe and it lands in a table
Data Streaming Consultant
2moIs it time for Open Stream Formats now? S2 is an attempt, but it's an api - and also proprietary. Northguard's Range Splitting is a really cool idea of how we can provide real elasticity. Right now we have the Kafka protocol that's open and extremely widely adopted, but that's a behavioral specification, not structural. It's also coming from a different time and designed with self-hosting in LinkedIn's data centers in mind. And while we see cloud-native ideas on how to reimplement it, Kafka protocol seems a bit outdated to me? I'm curious if we'll see an innovation in this direction. Do we even need one? Maybe the available solutions are good enough and the effort to improve them would be better spent elsewhere?