How file formats impact Snowflake costs and performance

2mo

From 1 TB to 56 GB: How file formats & transformations can make or break your Snowflake bill (and speed) When you’re working in Snowflake with dbt transformations, file size isn’t just a storage number—it directly impacts query cost, load times, and performance. The same dataset might take up 1 TB in CSV format, but shrink dramatically if you use optimized columnar formats like Parquet or ORC before loading into Snowflake. Typical size impact: ╰┈➤ CSV: 1 TB ╰┈➤ Avro: ~240 GB (~76% smaller) ╰┈➤ Parquet: ~104 GB (~89.6% smaller) ╰┈➤ ORC: ~56 GB (~94.4% smaller) That’s up to 95% smaller, before you even start optimizing queries in dbt. ⸻ 📄 CSV in Snowflake • Plain text, no compression. • Slower bulk loads into Snowflake stages. • Higher cloud storage & compute costs when transforming. • In dbt, transformations will have to scan all columns, even unused ones. ⸻ 📦 Avro in Snowflake • Row-based binary format—great for event ingestion. • Works well with Snowpipe or streaming loads into Snowflake. • Supports schema evolution (helpful for dbt incremental models). • But for analytics, columnar formats beat it on query speed. ⸻ 📊 Parquet in Snowflake • Columnar format = Snowflake can prune unnecessary columns during scans. • Highly compressed, faster to load via COPY INTO. • Perfect for dbt’s incremental models—less data scanned = lower costs. • Ideal for analytical workloads on large datasets. ⸻ 📘 ORC in Snowflake • Also columnar, with excellent compression. • Stores statistics (min/max) to speed up predicate pushdown. • Great for batch loads before dbt runs transformations. ⸻ 💡 Snowflake + dbt takeaway: • Store data in stages as Parquet/ORC for maximum compression and query pruning. • Use dbt incremental models to avoid reprocessing unchanged data. • Leverage Snowflake’s micro-partition metadata for lightning-fast queries. ⸻ 💰 Before-and-After Cost Simulation Let’s say your warehouse is Medium (4 credits/hour) and Snowflake charges $2/credit. Scenario 1 – CSV + Full dbt Refresh • Data size scanned: 1 TB • Full scan every run (no incremental) • Query time: ~1 hour • Cost per run: 4 credits x $2 = $8 • Monthly (daily run): $8 x 30 = $240 Scenario 2 – Parquet + dbt Incremental • Data size scanned: 104 GB (89.6% smaller) • Only 10% of data changes each run (incremental model) • Effective scan: ~10 GB • Query time: ~5 minutes • Cost per run: ~0.33 credits x $2 ≈ $0.66 • Monthly (daily run): $0.66 x 30 ≈ $19.80 💥 Savings: From $240 → $19.80/month → ~92% cost reduction. ⸻ Choosing the right file format isn’t just about storage—it’s about controlling your Snowflake bill and speeding up dbt runs. #Snowflake #dbt #DataEngineering #Parquet #BigData #seniormanagers #topITforce Author - Kunwar-Sa