How file formats impact Snowflake costs and performance

View profile for Kunwarraj Meena

Snowflake Data engineer|AWS-Cloud engineer |Snowflake| DBT| matillion | Python | SQL | PowerBi |

From 1 TB to 56 GB: How file formats & transformations can make or break your Snowflake bill (and speed) When you’re working in Snowflake with dbt transformations, file size isn’t just a storage number—it directly impacts query cost, load times, and performance. The same dataset might take up 1 TB in CSV format, but shrink dramatically if you use optimized columnar formats like Parquet or ORC before loading into Snowflake. Typical size impact: ╰┈➤ CSV: 1 TB ╰┈➤ Avro: ~240 GB (~76% smaller) ╰┈➤ Parquet: ~104 GB (~89.6% smaller) ╰┈➤ ORC: ~56 GB (~94.4% smaller) That’s up to 95% smaller, before you even start optimizing queries in dbt. ⸻ 📄 CSV in Snowflake • Plain text, no compression. • Slower bulk loads into Snowflake stages. • Higher cloud storage & compute costs when transforming. • In dbt, transformations will have to scan all columns, even unused ones. ⸻ 📦 Avro in Snowflake • Row-based binary format—great for event ingestion. • Works well with Snowpipe or streaming loads into Snowflake. • Supports schema evolution (helpful for dbt incremental models). • But for analytics, columnar formats beat it on query speed. ⸻ 📊 Parquet in Snowflake • Columnar format = Snowflake can prune unnecessary columns during scans. • Highly compressed, faster to load via COPY INTO. • Perfect for dbt’s incremental models—less data scanned = lower costs. • Ideal for analytical workloads on large datasets. ⸻ 📘 ORC in Snowflake • Also columnar, with excellent compression. • Stores statistics (min/max) to speed up predicate pushdown. • Great for batch loads before dbt runs transformations. ⸻ 💡 Snowflake + dbt takeaway: • Store data in stages as Parquet/ORC for maximum compression and query pruning. • Use dbt incremental models to avoid reprocessing unchanged data. • Leverage Snowflake’s micro-partition metadata for lightning-fast queries. ⸻ 💰 Before-and-After Cost Simulation Let’s say your warehouse is Medium (4 credits/hour) and Snowflake charges $2/credit. Scenario 1 – CSV + Full dbt Refresh • Data size scanned: 1 TB • Full scan every run (no incremental) • Query time: ~1 hour • Cost per run: 4 credits x $2 = $8 • Monthly (daily run): $8 x 30 = $240 Scenario 2 – Parquet + dbt Incremental • Data size scanned: 104 GB (89.6% smaller) • Only 10% of data changes each run (incremental model) • Effective scan: ~10 GB • Query time: ~5 minutes • Cost per run: ~0.33 credits x $2 ≈ $0.66 • Monthly (daily run): $0.66 x 30 ≈ $19.80 💥 Savings: From $240 → $19.80/month → ~92% cost reduction. ⸻ Choosing the right file format isn’t just about storage—it’s about controlling your Snowflake bill and speeding up dbt runs. #Snowflake #dbt #DataEngineering #Parquet #BigData #seniormanagers #topITforce Author - Kunwar-Sa

  • No alternative text description for this image
Kunwarraj Meena

Snowflake Data engineer|AWS-Cloud engineer |Snowflake| DBT| matillion | Python | SQL | PowerBi |

2mo

Good approach to select file formats for cost effectiveness.

Like
Reply
SAIRAM ADABOINA

ML Data Associate | Amazon | PL-300 Certified | Transforming Data into Insights | Data analyst | Power BI |SQL | | Tableau | Business Intelligence | Data Visualization |

2mo

Thanks for sharing, Kunwarraj

Vijay Kumar Jagannadham

technical specialist at Birlasoft

2mo

💡 Great insight

Like
Reply
Thyago Manhaes

Data Engineer | Analytics Engineer | Python | SQL | DBT | PySpark | API | Web Scraping | GCP | Databricks

2mo

Amazing! Very useful, thanks!

Like
Reply
Sai Krishna Chivukula

Principal Data Engineer @ Altimetrik | Writes to 39K+ Followers | 🌟Top 1% Data Engineering Voice | Certified in SNOWFLAKE | 1x FABRIC | 2x AZURE | 2x Databricks | SQL | Informatica | Databricks | Ex Carelon, ADP, CTS

2mo

Definitely worth reading

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories