Apache DataFusion

Apache DataFusion · 2025-08-15T14:39:31.623Z

How to use external indexes with Apache DataFusion to make your queries even faster through improved predicate pushdown and reduced I/O and its benefits. Andrew Lamb gives an outstanding deep dive into external Parquet indexes https://lnkd.in/gaGn5Cim Acknowledgements Thank you to Qi Zhu, Adam Reeve, Jigao Luo, Oleksandr Voievodin 🇺🇦, Shehab Amin, Nuno Faria and Bruce Ritchie for their insightful feedback on this blog post.

Software Development

Apache DataFusion is a fast, feature rich and extensible query engine built on the Apache Arrow memory model.

Discover all 3 employees

About us

Apache DataFusion is a fast, feature rich and extensible query engine built on the Apache Arrow memory model. “Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available. DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

Website: https://datafusion.apache.org
External link for Apache DataFusion
Industry: Software Development
Company size: 51-200 employees
Type: Nonprofit
Founded: 2020

Employees at Apache DataFusion

See all employees

Updates

Apache DataFusion reposted this
Andy Grove

Original creator of Apache DataFusion. Apache Arrow & Apache DataFusion PMC Member.
3w
Report this post
This is an excellent example of the value of Apache Arrow & Apache DataFusion as the foundation for building new high-performance specialized databases.
Mo Sarwat

CEO @ Wherobots | We are hiring!
3w Edited

Today, we launched SedonaDB, a new open-source, single-node analytical database engine built in Rust that's designed to treat spatial data as a first-class citizen. Unlike its distributed counterparts, such as SedonaSpark, SedonaDB is optimized for small-to-medium data analytics, offering simplicity and speed for single-machine environments. Wherobots donates SedonaDB to the open source Apache Sedona community to be released under the ASF license 2.0 SedonaDB offers several features that make it a powerful tool for spatial analysis: - Spatial-Native Processing: SedonaDB is built from the ground up to handle spatial data side by side with non-spatial data. It supports spatial types, joins, coordinate reference systems (CRS), and functions without needing extensions or plugins. - Performance: It uses query optimizations, indexing, and data pruning to ensure high-performance spatial operations. - Ease of Use: It is easy to download, install, and embed into applications. It also provides familiar Python and SQL interfaces, with additional APIs for R and Rust. - Modern Engine: SedonaDB is built on top of Apache Arrow and Apache DataFusion, providing a modern, vectorized query engine. - Integration: It seamlessly integrates with GeoArrow, GeoParquet, and GeoPandas, making it easy to use with other popular geospatial libraries. It can query data stored locally or remotely in cloud storage such as AWS S3 SedonaDB and SedonaSpark are both necessary because they cater to different spatial data processing and AI needs based on scale and environment. SedonaSpark is ideal for large-scale workloads and production environments that already use Spark, such as joining 100 GBs to PBs of vector dataset with large raster datasets. Its distributed nature, however, introduces unnecessary overhead for smaller datasets, making local computations slower and more complex. In contrast, SedonaDB is optimized for smaller datasets and local computations, providing a faster and simpler solution. The two projects are being developed for full interoperability, ensuring that functions and SQL code can be easily transferred between them. SedonaDB github repo: https://lnkd.in/eB8suErW Apache Sedona blog: https://lnkd.in/eaRWJ2ug Wherobots announcement blog: https://lnkd.in/e7dhKSsi
Like Comment Share
Apache DataFusion reposted this
Kyle Barron

Cloud Engineer at Development Seed
1mo
Report this post
New Lonboard release and new demo! Integrating marimo and Apache DataFusion to visualize the NYC taxi dataset. https://lnkd.in/egQh7pa7 I've been working on geospatial extensions for the Apache DataFusion SQL query engine, using GeoArrow as the underlying compute layout. It's early, but I'm working on fleshing out the PostGIS API. And there are Python bindings too! https://lnkd.in/eX8eApKZ. This was my first time using marimo and it was a joy to use! And its interactivity plays really nicely with Lonboard. Lonboard's 0.12 release improved the support for GeoArrow data types, and is moving towards being fully GeoArrow-native. Shapely is no longer a required dependency! https://lnkd.in/ergExzmu

7 Comments

Like Comment Share
Apache DataFusion reposted this
Dipankar Mazumdar

Director-Data+GenAI @Cloudera | Apache Iceberg, Hudi Contributor | Author of “Engineering Lakehouses”
5d
Report this post
Apache DataFusion Comet ☄️ For years, there have been various initiatives to accelerate #ApacheSpark execution! For e.g. "Whole Stage Code Generation" in Spark 2.0 was introduced to replace the Volcano Model, which achieves 2x speed. Comet integrates into Apache Spark, replacing Spark’s execution engine with DataFusion’s native execution path for better performance. So, what does DataFusion Comet brings? ✅ Comet is a high-performance accelerator for Spark, built on top of the powerful Apache DataFusion query engine. ✅ It integrates seamlessly with the Spark ecosystem without requiring any code changes, allowing Spark users to benefit from faster performance with zero modifications. ✅ It enables dual execution paths - JVM (Spark) and Native (DataFusion). If a query cannot be processed natively, it falls back to Spark’s standard execution. ✅ Apache Arrow columnar format is used to share data between JVM and native execution space. ✅ The native execution path uses SIMD (Single Instruction, Multiple Data) to accelerate columnar execution. ✅ Comet uses commodity hardware, eliminating the need for specialized hardware accelerators like GPUs or FPGAs, thus ensuring cost-effectiveness and scalability for Spark deployments. Comet's architecture allows Spark workloads to benefit from optimized columnar execution without requiring any changes to existing Spark jobs. Benchmark: Running TPC-H queries on 100GB of #Parquet data using a single executor (8 cores) has shown close to 2x speed improvements. Comet is a relatively new project, but it has already had multiple releases, with 0.11.0 being the latest. Would highly encourage checking the Github repo out. I’ll drop some detailed reading links in the comments. #dataengineering #softwareengineering
1 Comment

Like Comment Share
Apache DataFusion

2,098 followers
3w
Report this post
User Defined Types in DataFusion. 🚀 https://lnkd.in/gefEsusY

Implementing User Defined Types and Custom Metadata in DataFusion datafusion.apache.org

Like Comment Share
Apache DataFusion

2,098 followers
1mo
Report this post
🚀 Introducing Comet 0.10.0 — Apache DataFusion’s latest accelerator release: Spark 4.0.0 support, better Spark compatibility, enhanced Iceberg support & improved SQL function coverage. Find out more about https://lnkd.in/gyxdimGQ

Apache DataFusion Comet 0.10.0 Release datafusion.apache.org

1 Comment

Like Comment Share
Apache DataFusion

2,098 followers
1mo
Report this post
Improving TopK queries with Apache DataFusion? Dynamic filters can help https://lnkd.in/giatY_WS

Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries datafusion.apache.org

Like Comment Share
Apache DataFusion reposted this
Andrew Lamb
2mo
Report this post
Join Nga Tran, Dewey Dunnington November 12 in Boston for a pre 🦃 Apache DataFusion meetup. Thanks to Datadog for sponsoring the space and food. What better way to spend an evening than talking about Databases, Geospatial data, and more? https://lu.ma/w9pw5rce

Boston Apache DataFusion Meetup · Luma lu.ma

Like Comment Share
Apache DataFusion reposted this
Apache DataFusion

2,098 followers
2mo Edited
Report this post
How to use external indexes with Apache DataFusion to make your queries even faster through improved predicate pushdown and reduced I/O and its benefits. Andrew Lamb gives an outstanding deep dive into external Parquet indexes https://lnkd.in/gaGn5Cim Acknowledgements Thank you to Qi Zhu, Adam Reeve, Jigao Luo, Oleksandr Voievodin 🇺🇦, Shehab Amin, Nuno Faria and Bruce Ritchie for their insightful feedback on this blog post.

Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet datafusion.apache.org

Like Comment Share
Apache DataFusion

2,098 followers
2mo Edited
Report this post
How to use external indexes with Apache DataFusion to make your queries even faster through improved predicate pushdown and reduced I/O and its benefits. Andrew Lamb gives an outstanding deep dive into external Parquet indexes https://lnkd.in/gaGn5Cim Acknowledgements Thank you to Qi Zhu, Adam Reeve, Jigao Luo, Oleksandr Voievodin 🇺🇦, Shehab Amin, Nuno Faria and Bruce Ritchie for their insightful feedback on this blog post.

Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet datafusion.apache.org

Like Comment Share
Apache DataFusion reposted this
Nga Tran

Database and Distributed System
2mo Edited
Report this post
Apache DataFusion Meetup is coming to Boston Datadog on Nov 12, 2025 Join us for an evening of DataFusion, community, and pizza! Hear from Carl Yeksigian on DataDog use cases, Andrew Lamb on DataFusion history, Dewey Dunnington on Integrating DataFusion at Wherobots and more. Register at https://lu.ma/w9pw5rce If you are interested in speaking, let us know at https://lnkd.in/eXe7JR_W #DataFusion #OpenSource #BostonTech #Database #DataEngineering #MCP

Boston Apache DataFusion Meetup · Luma lu.ma

Like Comment Share

LinkedIn respects your privacy

Apache DataFusion

Software Development

Apache DataFusion is a fast, feature rich and extensible query engine built on the Apache Arrow memory model.

About us

Employees at Apache DataFusion

Sufian Qayyum

Senior AI/ML Engineer & Consultant | Agentic AI Systems · LLM Orchestration · Autonomous Support Platforms

Jiashu Hu

Computer Science, Open to work at United States or Mexico

Aditya Singh Rathore

GSSoC’25 |Fabric certified Data Engineer | 5 x Azure certified |Data engineering top voice 💡

Updates

Join now to see what you are missing

Similar pages

Apache Arrow

Velox

marimo

Apache Gluten

Apache Iceberg

InfluxData

bauplan

DuckDB

Apache Doris

Apache Sedona