When a 75GB CSV brings everything else to its knees… engineers reach for DuckDB. A Redditor recently shared their struggle: - 75GB CSV, ~400 columns, 2B rows - Dirty data, special characters, duplication - On-prem SQL Server target - Tried SSIS and Python - best case: 8 days to process a single file. (eight!) The “best” reply: “You can query the file with DuckDB if you only need a subset of the columns and it will handle out-of-memory processing. It can query CSVs directly with FROM read_csv('[path]')... and then dump the result to Parquet.” And it worked. The OP processed all their CSVs into Parquet, cut down to the fields they actually needed, and loaded them in a fraction of the time. It’s awesome to see engineers start to default to using DuckDB. When a product becomes the obvious answer in threads like this, it stops being “just a tool” and starts becoming “infrastructure”. DuckDB is doing for data engineers what Postgres once did for app developers: becoming the trusted, go-to backbone for hairy data projects. And this is still the best type of marketing and product market fit: engineers recommending your product to other engineers. (MotherDuck, built around DuckDB, also happens to be a Zero Prime Ventures portfolio company. Seeing the community rally around tech like this is genuinely exciting!)
I had a similar experience recently using DuckDB to run a query against a 500 million row, 48 GB CSV in about half a second. The responses to my Twitter post about it were full of people saying to install various databases and I was dumbfounded how they thought that would be faster than the half second DuckDB took.
Duckdb is the absolute tops. But what was in that csv file, exactly??
Seems that parallel computing still needs to be discovered by some "engineers". 8 days...
While everyone is consumed by generative AI, DuckDB is a superpower for those of us in the data world who know how to use it.
What a great example of the power of duckdb handling massive data
DuckDB for the win <3
Duckdb is awesome!
Head of Delivery & Area Director at Unit8 SA
1moLove DuckDB. But sometimes plain old unix tools are the best solution. AWK/Sort/Cut/Unique/Sed can go through a CSV of any size