Variant ratified in Apache Parquet community for open data

1,051,295 followers

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm

23 Comments

Awadelrahman Ahmed

Data & AI Architect – Strategy, Platforms & Solutions | Databricks MVP | Databricks Technical Council Member | MLflow Ambassador

Thanks to Variant, they just made Delta, Iceberg, and Spark agree on something🙈

8 Reactions

Gaurav Harpale

This is a huge step forward for the open data ecosystem! Standardizing Variant means true interoperability for semi-structured data. Faster queries, simpler pipelines, and consistent handling of flexible data types; this will unlock massive value for lakehouse architectures and AI-driven analytics!

3 Reactions

Sagar Shinde

Director @ CloudEthiX

Exciting progress toward true interoperability in the open data ecosystem!

2 Reactions

ziggiz

We’ve been using this feature for some time at ziggiz, and it’s been a game changer. As a managed Cyber Lakehouse, we’re thrilled to see the open ecosystem moving so quickly — Variant gives us incredible flexibility when working with variable fields across diverse security and IT telemetry sources. Exciting times ahead for truly open, analytical data models.

2 Reactions

iForce.uk

Fantastic progress for open standards and the lakehouse community! Excited to see how Variant improves flexibility and performance. 🚀

2 Reactions

Kobai

Fantastic milestone for the open data ecosystem and for everyone building smarter data architectures! At Kobai we’re extending the power of the Lakehouse with a semantic layer that understands Variant natively — connecting structured and semi-structured data into meaningful knowledge graphs. The result? * Faster context-aware insights * Unified data semantics across Delta, Iceberg, and Spark * A foundation for truly intelligent, explainable AI Learn how Kobai makes your Lakehouse even smarter: kobai.io #Databricks #KnowledgeGraph #Lakehouse #AI #DataEngineering #OpenData

1 Reaction

TopSource Global

"Unified open standard for semi-structured data" — words that shouldn't be this exciting, but here we are 😄 Seriously though, this fixes a real problem. Nice work!

Amit Gupta

Senior Data Architect | Cloud Data Migration Consultant | Databricks Engineer | 2 x Azure | 1 x GCP Data Practioner

It's amazing 👏

Gaive Gandhi

Sr. Director - Strategy, Consulting and Delivery | Data and AI | Building Delivery Teams | I Am Here To Learn

This is awesome. I have a few questions though: 1. Is the Variant data type supported with custom Serializers like Kryo? 2. Since there is interoperability amongst Delta, Hudi and Iceberg, I am presuming Variant data written using one open table format can be read and processed by another.

Esdras Rocha

Engenheiro de Dados Sênior | Databricks, ADF, PySpark | Multi-Cloud (AWS & Azure) | Arquitetura Lakehouse | ETL & Governança de Dados

Huge step forward for the Lakehouse world! Variant is a game changer for handling semi-structured data seamlessly across Delta, Iceberg, and Spark

See more comments

To view or add a comment, sign in

More Relevant Posts

Sriram Reddy

Co-Founder @ Byte Analytics | Ex-Microsoft | Big Data & ML Specialist
1w
Report this post
Exciting milestone for the open data ecosystem! 🚀 The ratification of Variant as a native data type in Apache Parquet™ is a big step forward for managing semi-structured data. With unified support across Delta Lake, Apache Iceberg™, and Apache Spark™, this truly strengthens interoperability in the open lakehouse world. A major leap toward simplifying how we store, process, and query flexible data — making analytics faster and more consistent across platforms. #DataEngineering #ApacheParquet #DeltaLake #ApacheSpark #OpenSource #Lakehouse
Databricks

1,051,295 followers
1w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Anji Palla

Lead Architect at Fractal Analytics || MLOps || LLMOps || Azure || AWS
1w
Report this post
Databricks Variant type sets a new standard for speed and efficiency with semi-structured data, offering up to 8x faster performance compared to traditional JSON string storage. - What stands out most is both the remarkable query speed and reduced storage requirements, Variant uses 22% less storage than plain strings, saving significant time and cost. - Real-world benchmarks show dramatic gains: ETL jobs that once took hours now finish in minutes, and 1TB queries dropped from over 4 hours to just 20 minutes. This combination of ultra-fast querying and lower storage overhead makes Variant a clear leader in big data analytics.
Databricks

1,051,295 followers
1w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Dave Herrald

security leader and storyteller | adviser | former Splunk SURGe | Boss of the SOC (BOTS) co-creator | former Google | Google Cybersecurity Certificate contributing author/instructor | former CISO | GIAC GSE #79
1w
Report this post
Huge news for anyone building a Security Lakehouse. Security data like logs, events, and telemetry from various sources (endpoints, cloud infra, SaaS ) is inherently semi-structured. A single security event might contain dozens of nested fields, which can change frequently with new product versions or changes to logging config. Before Variant, handling this required kludgey workarounds: 👎 Storing all the JSON as a massive, opaque string. This is slow to query and wastes compute power. 👎 Trying to force a rigid schema on the data. This leads to brittle pipelines that constantly break when a new field appears. The new Variant type solves this by providing a unified, open standard for storing this kind of data natively and efficiently within Parquet. This means you can now: 🔥 Simplify Ingestion: Security pipelines become more resilient, as you don't need to preemptively flatten or strictly validate every piece of semi-structured data. 🔥Accelerate Investigations: You can query nested or evolving fields much faster without complex JSON parsing at query time. Quicker queries mean faster threat detection and response. 🔥Reduce Costs: More efficient storage and faster queries often translate directly into lower compute costs for your security platform. This move brings the flexibility needed for modern security data alongside the high performance and open standards of the Lakehouse architecture.
Databricks

1,051,295 followers
1w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
1 Comment
Like Comment
To view or add a comment, sign in
Wang Ryder

Account Executive
1w
Report this post
[New Release] Variant is really a good feature to manage semi-structured data like JSON and XML with Delta and Iceberg!
Databricks

1,051,295 followers
1w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Sylvain Chambon

Senior Solutions Architect, FSI at Databricks
1w
Report this post
JSON and similar document formats rule the application world, but are very inefficient for analytics. Bridging the gap required clever engineering and federating multiple open source communities. On the technical details, Variant itself is a no brainer (binary encoding of JSON with up-front offsets to allow efficient skipping / traversal, similar in concept if not in detail to BSON and others) but shredding (the ability to extract common fields in a column chunk) is the game changer for query performance at scale.
Databricks

1,051,295 followers
1w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Alexander Tsuman

Senior Data/Software Engineer, TL
1w
Report this post
🚀 Exciting update for the data community! Variant, the new native data type for semi-structured data, has been ratified in the Apache Parquet™ ecosystem — unifying support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant makes it dramatically easier to store, query, and analyze flexible data like JSON, telemetry, and logs without complex transformations. It introduces schema-on-read efficiency, type inference, and nested field indexing, allowing engines to access data directly with consistent semantics across formats. Early benchmarks show 8x–30x faster performance with new shredding and column projection optimizations — a major step toward simplifying how lakehouses handle semi-structured data at scale. A big win for open data standards and interoperability. 👉 Read more: Introducing Variant – Databricks Blog #Databricks #ApacheParquet #DeltaLake #ApacheSpark #ApacheIceberg #DataEngineering #OpenSource #Lakehouse #BigData #DataPerformance
Databricks

1,051,295 followers
1w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
FengYuan Zhang

Medical beauty institutional investors
6d
Report this post
A huge milestone for the open data ecosystem! 🌍 The ratification of Variant as a native Parquet data type marks a major step forward in unifying semi-structured data handling across Delta Lake, Apache Iceberg, and Apache Spark. This will simplify pipelines, improve performance, and accelerate innovation across open lakehouse architectures.
Databricks

1,051,295 followers
1w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Ismael Medina Muñoz

Empowering Businesses through AI: Data & AI Specialist | Solutions Architect | Business Strategist | Master’s in Data Science
1w Edited
Report this post
Shredding, a technique to columnarize commonly occurring fields in Variant data, improves read performance 8x compared to using regular Variant and 30x compared to using string. Introducing Variant: A New Open Standard for Semi-Structured Data in Apache Parquet™, Delta Lake, and Apache Iceberg™ | Databricks Blog. https://lnkd.in/gB2sNUFt
Like Comment
To view or add a comment, sign in
Gordon Murray

Staff AWS Systems Engineer | Modernizing & Rebuilding Cloud Infrastructure | Terraform | Automation | Security
4d
Report this post
Two years ago I built a small CDC pipeline using Flink and Hudi then mostly forgot about it. I noticed the repo still gets a few regular visits and clones. So tonight I updated it: Flink 1.19.1 and Hudi 1.0.2. Hudi is a good choice if you want your data lake to behave a bit more like a database, that is able to handle updates, deletes, and keep things consistent It’s a complete, working example with Docker Compose, MariaDB CDC, MinIO instead of S3, and a Flink SQL job handling real-time updates. Nice to see it’s still helping a few people out there. Repo: https://lnkd.in/ex7zMpzd
Like Comment
To view or add a comment, sign in
Gerard Grundler

Passionate Servant Leader. Bar raising (Direct, Channel, Alliance) Leader. Expertise in Services Development/Delivery & Solution Sales. I build partner ecosystems that multiply GTM Success! * GenAI Patent Pending *!
1w
Report this post
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/ebpzE7b8 https://search.app/tw5td

Introducing Variant: A New Open Standard for Semi-Structured Data in Apache Parquet™, Delta Lake, and Apache Iceberg™ databricks.com
Like Comment
To view or add a comment, sign in

1,051,295 followers

View Profile Follow

LinkedIn respects your privacy

Variant ratified in Apache Parquet community for open data

Explore content categories