When a 75GB CSV brings everything else to its knees… engineers reach for DuckDB. A Redditor recently shared their struggle: - 75GB CSV, ~400 columns, 2B rows - Dirty data, special characters… | Pete Soderling

1mo Edited

When a 75GB CSV brings everything else to its knees… engineers reach for DuckDB. A Redditor recently shared their struggle: - 75GB CSV, ~400 columns, 2B rows - Dirty data, special characters, duplication - On-prem SQL Server target - Tried SSIS and Python - best case: 8 days to process a single file. (eight!) The “best” reply: “You can query the file with DuckDB if you only need a subset of the columns and it will handle out-of-memory processing. It can query CSVs directly with FROM read_csv('[path]')... and then dump the result to Parquet.” And it worked. The OP processed all their CSVs into Parquet, cut down to the fields they actually needed, and loaded them in a fraction of the time. It’s awesome to see engineers start to default to using DuckDB. When a product becomes the obvious answer in threads like this, it stops being “just a tool” and starts becoming “infrastructure”. DuckDB is doing for data engineers what Postgres once did for app developers: becoming the trusted, go-to backbone for hairy data projects. And this is still the best type of marketing and product market fit: engineers recommending your product to other engineers. (MotherDuck, built around DuckDB, also happens to be a Zero Prime Ventures portfolio company. Seeing the community rally around tech like this is genuinely exciting!)

65 Comments

Tomasz Chudzik

Head of Delivery & Area Director at Unit8 SA

1mo

Love DuckDB. But sometimes plain old unix tools are the best solution. AWK/Sort/Cut/Unique/Sed can go through a CSV of any size

61 Reactions

Jared Lander

Chief Data Scientist at Lander Analytics, Columbia Professor, Author of R for Everyone, Keynote Speaker, and Organizer of the World's Largest R Meetup.

1mo

I had a similar experience recently using DuckDB to run a query against a 500 million row, 48 GB CSV in about half a second. The responses to my Twitter post about it were full of people saying to install various databases and I was dumbfounded how they thought that would be faster than the half second DuckDB took.