Pete Soderling’s Post

When a 75GB CSV brings everything else to its knees… engineers reach for DuckDB. A Redditor recently shared their struggle: - 75GB CSV, ~400 columns, 2B rows - Dirty data, special characters, duplication - On-prem SQL Server target - Tried SSIS and Python - best case: 8 days to process a single file. (eight!) The “best” reply: “You can query the file with DuckDB if you only need a subset of the columns and it will handle out-of-memory processing. It can query CSVs directly with FROM read_csv('[path]')... and then dump the result to Parquet.” And it worked. The OP processed all their CSVs into Parquet, cut down to the fields they actually needed, and loaded them in a fraction of the time. It’s awesome to see engineers start to default to using DuckDB. When a product becomes the obvious answer in threads like this, it stops being “just a tool” and starts becoming “infrastructure”. DuckDB is doing for data engineers what Postgres once did for app developers: becoming the trusted, go-to backbone for hairy data projects. And this is still the best type of marketing and product market fit: engineers recommending your product to other engineers. (MotherDuck, built around DuckDB, also happens to be a Zero Prime Ventures portfolio company. Seeing the community rally around tech like this is genuinely exciting!)

  • text
Tomasz Chudzik

Head of Delivery & Area Director at Unit8 SA

1mo

Love DuckDB. But sometimes plain old unix tools are the best solution. AWK/Sort/Cut/Unique/Sed can go through a CSV of any size

Jared Lander

Chief Data Scientist at Lander Analytics, Columbia Professor, Author of R for Everyone, Keynote Speaker, and Organizer of the World's Largest R Meetup.

1mo

I had a similar experience recently using DuckDB to run a query against a 500 million row, 48 GB CSV in about half a second. The responses to my Twitter post about it were full of people saying to install various databases and I was dumbfounded how they thought that would be faster than the half second DuckDB took.

Duckdb is the absolute tops. But what was in that csv file, exactly??

Ángel Narciso

Chief Data Officer | Economist | Advisor

1mo

Seems that parallel computing still needs to be discovered by some "engineers". 8 days...

Kyle Walker

Demographics | Geospatial | AI | Open Source

1mo

While everyone is consumed by generative AI, DuckDB is a superpower for those of us in the data world who know how to use it.

What a great example of the power of duckdb handling massive data

Benjamin SICARD

Lead Data Architect & Engineer | Scaling Data systems at the intersection of Fintech and AI

1mo

DuckDB for the win <3

Akash Deshpande

Engineering @ BlueCargo | Scaling Early-Stage Startups | Co-Founder | CTO | Investor

1mo

Duckdb is awesome!

See more comments

To view or add a comment, sign in

Explore content categories