How to Build a Cost-Effective Stack with Apache Iceberg and ClickHouse

1mo

If Apache Iceberg adoption is fast, the ecosystem around it is growing even faster and that's creating one of the cheapest lakehouse infrastructure options available today To support this ecosystem, we have ClickHouse with its recent updates and how it is shaping our stack of choice . The Cost-Effective Stack: OLake by Datazip + Apache Iceberg + ClickHouse Here's what I'm seeing teams build for maximum cost efficiency: OLake, handles the heavy lifting of real-time database replication - streaming from PostgreSQL, MySQL and MongoDB directly into Apache Iceberg with throughput of 46K+ records/second. It's open-source, requires minimal infrastructure (no Spark, no Flink, no Debezium), and supports all major Iceberg catalogs. Not to state the hero of the stack- Iceberg provides the open table format that eliminates vendor lock-in while delivering ACID transactions, schema evolution, and time travel capabilities. ClickHouse delivers the analytics layer with proven 5-15x cost advantages over traditional warehouses but what I want to highlight are it's recent updates . ClickHouse v25.8 dropped major updates for Iceberg support (full source in comments) : ✅ Native Write Support - Full CRUD operations, not just reads ✅ Production-Ready Catalogs - REST, Glue, Unity all promoted to beta ✅ Schema Evolution - Add/drop/modify columns seamlessly ✅ Better Deletes - Position deletes merged efficiently ✅ Near Real-time Streaming - Perfect for ingestion platforms like OLake Why This Stack Works: Real-time ingestion → Open storage → Fast analytics = Maximum performance at minimum cost. #ApacheIceberg #OpenTableFormats #ClickHouse #DataLakehouse

To view or add a comment, sign in

More Relevant Posts

Constantin Alexander 🐘

Data Architecture | Apache Paimon + Apache Iceberg | Apache Flink + Apache Fluss | Ai Ops | PostgreSQL support | AWS & GCP Specialist | Data Modeling | AI Ops | RAG | MCP |
1mo Edited
Report this post
Apache Iceberg a database without a query engine. Apache Iceberg coincidentally to its own name is more complex then the first impression, after all parquet format been around for 10 years now. Parquet was the evolution of *.csv format, first open format columnar storage, and unlike *.csv the data was stored in columns, designed for analytics, and in addition a far better compression then *.csv. And parquet was able to store metadata, so we had an open data format. Apache Iceberg is a transactional parquet storage with snapshot isolation, so new data gets appended and previous copy saved hence ACID compliance, transactional...ish, one can see how this snapshot isolation can be overwhelmed by a volume of single row transactions. A database without storage or bring your own SQL. The problem with not having SQL engine is also not having a metadata that every database has, MySQL information_schema is a good example. Iceberg needs to store that that somewhere and hence are the Iceberg catalogs and they are plenty and not exactly very interoperable and confusing. Everyone one was its own catalog: ✔️ aws Amazon Glue ✔️ Hive metadata store ✔️ gcp BigLake catalog ✔️ Rest catalog ✔️ Polaris catalog ✔️ JDBC catalog ✔️ Nessie ✔️ internal catalog ... a lot more Which one should I use? Lots of decisions and planning and lots of variables in object creation, Iceberg recently added data lineage and new datatypes, quite a bit thinking ahead will be required. One aspect is for sure - there has to be a unification of catalogs and better interoperability. today Polaris seems to be the leading Iceberg catalog with the most interoperability. In conclusion Apache Iceberg requires lots of architectural decisions and a lot of planning. Post deployment Apache Iceberg will require administration and routine maintenance performance and agility. #apacheiceberg #nessie #polaris #jdbc #git #dataengineering #pulsar #analytics #dataanalytics #cloudnative #opensource #aws #datalake #spark #trino #flink #postgresql #database #dba #linux #opensource #devops #cloudcomputing #itInfrastructure #clouddatabase #TechBlog #MediumBlog #Tutorial #Learning #CareerGrowth #TechnicalWriting #KnowledgeSharing #MySQL #MSSQL #Mongodb #DataSecurity #Backend #TechWriteUp #DevCommunity
Like Comment
To view or add a comment, sign in
VeloDB (Powered by Apache Doris)

4,885 followers
1mo
Report this post
Apache Doris is fast: 6x faster than ClickHouse, 30x faster than PostgreSQL, and 100x faster than MongoDB. These results come straight from RTABench, a benchmark designed for real-time analytics performance 🚀 If you're building real-time applications, such as live dashboards, trading platforms, or tracking systems, common benchmark tests like TPC-H, TPC-DS, and ClickBench may not fully reflect workloads that demand millisecond responsiveness (although Doris scores well on these too). That's why RTABench (created by TigerData/TimescaleDB) was designed to test databases using real-world patterns such as: - Normalized schemas with multi-table joins (reflecting typical app data models) - Selective filtering on recent time windows (testing indexing and partition pruning) - Pre-aggregated query performance for instant results Why does this matter? In high-concurrency, real-time analytical scenarios where every millisecond counts, Apache Doris proves it can deliver both speed and cost efficiency. 🤔 Check out the benchmark at rtabench[dot]com 🔗 Read our detailed blog (link in the comments) 👉 Follow VeloDB (Powered by Apache Doris) for more Apache Doris use cases and benchmark results. -- #ApacheDoris, #RTABench, ClickHouse, MongoDB

1 Comment
Like Comment
To view or add a comment, sign in
Rachel Wu

HR Director at VeloDB. We are hiring in the Bay Area,CA and Singpaore!
1mo
Report this post
🚀 Apache Doris crushes real-time analytics speed—6x faster than ClickHouse, 30x than PostgreSQL, 100x than MongoDB (per RTABench, the real-world real-time benchmark)!

VeloDB (Powered by Apache Doris)

4,885 followers
1mo

Apache Doris is fast: 6x faster than ClickHouse, 30x faster than PostgreSQL, and 100x faster than MongoDB. These results come straight from RTABench, a benchmark designed for real-time analytics performance 🚀 If you're building real-time applications, such as live dashboards, trading platforms, or tracking systems, common benchmark tests like TPC-H, TPC-DS, and ClickBench may not fully reflect workloads that demand millisecond responsiveness (although Doris scores well on these too). That's why RTABench (created by TigerData/TimescaleDB) was designed to test databases using real-world patterns such as: - Normalized schemas with multi-table joins (reflecting typical app data models) - Selective filtering on recent time windows (testing indexing and partition pruning) - Pre-aggregated query performance for instant results Why does this matter? In high-concurrency, real-time analytical scenarios where every millisecond counts, Apache Doris proves it can deliver both speed and cost efficiency. 🤔 Check out the benchmark at rtabench[dot]com 🔗 Read our detailed blog (link in the comments) 👉 Follow VeloDB (Powered by Apache Doris) for more Apache Doris use cases and benchmark results. -- #ApacheDoris, #RTABench, ClickHouse, MongoDB
Like Comment
To view or add a comment, sign in
🐿️ Ben Gamble🧑🏾🦯

Technology Sommelier, AI Whisperer
3w
Report this post
Happy Monday everyone! 🌟 Exciting news in the tech world as Apache Flink CDC releases version 3.5.0! This update brings some game-changing features: WARNING! I pressed the re-write with AI button because the formatting options are unusable on LinkedIn these days 🚀 Introducing pipeline-level #Postgres as a source and Apache Fluss (Incubating) as a sink. Plus, enhanced support for multi-table jobs and schema evolution, reducing friction and streamlining operations for real-time platforms. Check it out at flink.apache.org. 🔍 What's new and useful: - Fluss pipeline sink: A cutting-edge streaming storage layer tailored for real-time analytics, offering columnar streaming reads, changelog tracking, and high-QPS primary-key lookups. Perfect for dimension tables and low-latency enrichment. Now, Flink CDC seamlessly integrates with Fluss as a primary pipeline sink. For Bonus points this gives you full #paimon (and soon #iceburg) support out the box :D - #PostgreSQL pipeline source: Postgres now serves as a declarative pipeline source, simplifying end-to-end replication with schema evolution. This advancement eliminates concerns about new table creations getting lost in the analytical realm. Also Postgress 18 just launched so AWESOME all around!!! 🔧 Production enhancements for long-running CDC: - Ensuring stateless and schema-correct transforms by re-issuing schema from source state during failover. - Improved handling of case-sensitive identifiers and DATE/TIME precision in transforms/UDTFs. - Safer termination logic for incremental source splits to prevent job blockages. 🛠 Connector fixes & ecosystem updates: - Enhanced safety measures such as MySQL GTID out-of-order safety and Postgres partitioned-table discovery. - Shift to Binlog Service for OceanBase and Paimon upgrade to 1.2.0 for more flexible downstream writes and column comments. 🔓 Unlocking new architecture choices: By leveraging Flink CDC → Fluss integration, tables can function as both streams and live, queryable state with primary-key point lookups and changelog semantics. This setup eliminates the need for separate Kafka pipes and cache layers like Redis, offering a streamlined approach for transport and
1 Comment
Like Comment
To view or add a comment, sign in
OLake™ by Datazip

13,614 followers
1mo
Report this post
Data replication into Apache Iceberg can be troublesome. From handling CDC to managing schema changes, finding a reliable source of guidance is integral. That’s exactly what we’ve been building at OLake not just the fastest replication framework for Iceberg, but also the resources to help you get started the right way. We’ve published a family of blogs covering the three most common sources: ⚡ MySQL ⚡ MongoDB ⚡ Postgres Alongside our detailed documentation, these guides will help you experiment with your existing stack, identify common issues, explore how Olake addresses them with open table formats, and experience Iceberg replication in action. 👉 Links in the comments. 👉 And if you hit roadblocks, our Slack community (with engineers from the core team) is always there to help. #dataengineering #datareplication
2 Comments
Like Comment
To view or add a comment, sign in
Mohammed Nudman Raza Shaikh

Backend Engineer | Go & TypeScript | Distributed Systems | Scalable Multi-Tenant Architectures | PostgreSQL • Redis • Kafka • Kubernetes
1mo
Report this post
⚡ Small tweaks = big gains. We cut PostgreSQL P90 latency from 120ms → 80ms. How? - Materialized views for heavy aggregations - Stored procedures to reduce network trips - Index tuning for high-traffic queries Lesson: backend performance often comes from the database layer. 👉 What’s your favorite trick for query optimization? #PostgreSQL #BackendEngineering #Databases

1 Comment
Like Comment
To view or add a comment, sign in
Sizan Mahmud

Full-Stack Software Engineer | Python, Django, Laravel, Docker | React, Vue, Next.js, JavaScript Expert
3w
Report this post
🚀 PostgreSQL 18 just dropped and it's a GAME CHANGER! After diving deep into the release notes, I'm genuinely excited about these features that will transform how we work with databases: 🔥 Asynchronous I/O System - This isn't just an improvement, it's revolutionary. Sequential scans, bitmap heap scans, and vacuum operations are now dramatically faster. Your queries will thank you. ⚡ Skip Scan for B-tree Indexes - Finally! Multi-column indexes can now be used efficiently even without conditions on leading columns. This solves a pain point we've had for YEARS. 🆔 UUIDv7 Function - Time-ordered UUIDs that maintain uniqueness AND reduce index fragmentation. Perfect for distributed systems and microservices. 💡 Virtual Generated Columns (now default) - Compute on-demand, save storage, always up-to-date. Smart design choice by the PostgreSQL team. 🔐 Native OAuth Authentication - Enterprise-grade security that integrates with your existing auth infrastructure. 📊 Enhanced Monitoring - Per-backend I/O stats, WAL activity tracking, vacuum timing details. Observability just got so much better. The performance improvements alone make this a must-upgrade release. If you're running PostgreSQL in production, you need to see these features. I wrote a detailed breakdown of everything new in PostgreSQL 18 - link in comments! 👇 What feature are you most excited about? Drop a comment below! #PostgreSQL #Database #Performance #DevOps #SQL #PostgreSQL18 #TechNews #SoftwareEngineering #DatabaseManagement #Backend

1 Comment
Like Comment
To view or add a comment, sign in
Akash Rajvanshi

Sr DevOps Engineer @OpsTREE
1mo Edited
Report this post
Ever felt like you're flying blind with your PostgreSQL monitoring on Kubernetes? I've definitely been there. So, I put together a step-by-step guide on a setup I'm really loving: combining CloudNativePG (CNPG) with the Pigsty exporter. The result? Amazing, out-of-the-box dashboards loaded with metrics. They make it easy to track down slow queries and get a ton of insight from your query performance data. We cover the whole thing, from managing secrets with OpenBao to seeing it all come to life in Grafana. Hope you find it useful! Check it out: #PostgreSQL #Kubernetes #K8s #DevOps #Observability #OpenSource #CNPG #Grafana #Pigsty https://lnkd.in/dfzpqV99

Supercharge Postgres on K8s: A CNPG PostgreSQL + Pigsty Observability Guide akashrajvanshi.medium.com

1 Comment
Like Comment
To view or add a comment, sign in
Alex Ramanau

Software Engineering Leader (12+ yrs) | Scales HA back-end & SaaS platforms for 100M+ users | ex-{Wargaming, Canonical} | Ships the right solution fast, tunes for performance, builds happy & performing teams
3w
Report this post
🚀 PostgreSQL 18 is here, and it's packed with features that developers and DBAs will love! 🐘 This latest release focuses on performance, developer experience, and observability. Here are some of my favorite highlights: 🔹 Asynchronous I/O (AIO): This is a game-changer for performance, especially for read-heavy workloads. AIO allows PostgreSQL to issue multiple I/O requests concurrently, leading to significant throughput gains. 🔹 Native UUIDv7 Support: Finally, a built-in way to generate time-ordered UUIDs! This should be fantastic for performance, especially when it comes to indexing and caching. 🔹 VIRTUAL Generated Columns: You can now create generated columns that are computed on read, saving disk space and speeding up INSERT and UPDATE operations. This is just a glimpse of what's new in PostgreSQL 18. Full list of features is available on the official website. Wallarm is migrating to 17th version right now, it would be interesting compare performance between 17 and 18 ... And how do you feel about PostgreSQL in your production? Let me know in the comments! 👇 #PostgreSQL #Postgres #Database #OpenSource #DevOps #Databases #NewRelease

2 Comments
Like Comment
To view or add a comment, sign in
Md Wakiullah

Front-end Developer | ReactJS | NextJS
3w
Report this post
MongoDB is the NoSQL database that every backend developer should know! This micro-guide covers the basics, real code examples, and pro tips to get you started. Save it and comment which backend topic you want next!
Like Comment
To view or add a comment, sign in

3,601 followers

145 Posts

View Profile Connect

LinkedIn respects your privacy

How to Build a Cost-Effective Stack with Apache Iceberg and ClickHouse

Explore content categories

How to Build a Cost-Effective Stack with Apache Iceberg and ClickHouse

More Relevant Posts

Explore related topics

Explore content categories