Google's Open Lakehouse: The Foundation For Enterprise AI Data

Businesses have always relied on data, but they never were able to get full value out of them when they were siloed by structure, system, or storage. Today’s enterprise demands flexible, well-governed environments that support both operational and analytical workloads. That makes AI an integral part of all business strategies.

“The question is no longer if you will adopt AI, but how fast and how effectively you can use it,” declares Geeta Banda, head of outbound product management for data and analytics at Google Cloud.

Businesses understand this, as evidenced by the outlook for AI growth. McKinsey believes that AI could increase US productivity growth by 1.5 percent annually over a ten-year period. But participating in that growth requires a new way to tap into the value of data.

That’s where Google Cloud’s open lakehouse comes in. It’s the latest development in an architecture called the lakehouse, which combines structured and unstructured data. Google built a version of this lakehouse architecture on its BigLake, a storage engine that provides a foundation for building open data lakehouses.. It uses open data formats and is engineered for AI deployments at scale. Google Cloud promises that it will help accelerate model development, improve data governance, and simplify complex tool chains.

A Faulty Foundation For Enterprise Success

Google Cloud believes we need this because too many companies are trying to build AI on broken foundations. Decades of technical debt and architectural complexity have created multiple barriers to AI success.

In most enterprises, data is scattered across multiple clouds, SaaS applications, and legacy systems. Stitching it all together for even a single use case becomes an arduous task. The explosion of multi-modal data compounds this complexity.

“Your most valuable data is no longer just in rows and columns,” Banda notes. “It’s in customer call transcripts, product images, PDF contracts, and video feeds.”

“Traditional data warehouses are not up to the task of managing all that because they only work with highly structured data”, she asserts. The BI systems built in the early 2000s were unable to support unstructured data. They also have proved both inflexible and expensive to scale.

Companies tried to rectify the data shortfalls with data lakes that could take in massive amounts of raw data. Their lack of governance turned them into “data swamps.”

The Emergence Of The Open Lakehouse

Enterprises with AI aspirations needed to combine the capabilities of a data warehouse and data lake in one solution. That’s why Google developed the open lakehouse.

Early lakehouse versions brought transactional capabilities to data lake storage, though they still had their own limitations. CIOs had to decide if they would opt for open formats like Iceberg and self-manage the complex infrastructure, or if they would give up the flexibility of open and interoperable services to gain a fully managed one.

Banda hails Google’s open lakehouse as the best of both worlds; “a new standard to store, manage, and activate data for AI projects.”

Sitting atop innovations like BigLake, Iceberg-native storage, serverless Apache Spark, and the Dataplex Universal Catalog, this managed platform embodies Google’s commitment to unifying structured, semi-structured, and unstructured data across the entire data life cycle.

The Components Of A 360-Degree Perspective

Gaurav Saxena, product leader of BigQuery, identifies three main characteristics that differentiate Google’s open lakehouse architecture from others:

Google brings “planetary-scale” infrastructure to open source. “We’re bringing the best of Google infrastructure to open source,” he says.
It uses governance to direct AI to work on all relevant data. “What we’re helping enterprises do is get rid of silos, connecting all data to all use cases, fully tapping into all data whether it’s structured or unstructured,” he adds. “That’s where the value lies.”
Open lakehouse supports multimodal use cases, giving enterprises insight into data coming in through different sources and channels. “Google understands speech, audio, and all types of data, and we can extend that to a data platform that is multimodal, providing a 360-degree perspective on all data,” he concludes.

Open lakehouse integrates several interconnected components, includingApache Iceberg, which serves as the foundational open table format. It brings warehouse reliability to data lake storage in the form of ACID transactions, schema evolution, and time travel (which enables users to query historical snapshots).

BigLake manages unified storage, enforcing fine-grained access controls, performance acceleration, and data life cycle management without sacrificing openness.

The platform supports interoperable engines including BigQuery for high-performance SQL analytics and serverless Spark for large-scale data processing and machine learning. Crucially, both engines operate on the same Iceberg data managed by BigLake, eliminating data movement and duplication.

Dataplex Universal Catalog brings governance capabilities to the unified data store, thanks to its automatic discovery, cataloging, and metadata enrichment capabilities.

Unifying Diverse Data Types

As unstructured data has traditionally been siloed, extracting value from it required both a deep understanding of metadata and a way to bring it into a unified data platform. Multimodal tables eliminate that headache because they can combine unstructured and structured data, extending all governance capabilities seamlessly.

Dataplex Universal Catalog helps companies unify their governance by centralizing scattered, passive systems. It creates comprehensive catalogs spanning all data assets. Instead of static inventory, the catalog uses AI to automate discovery, ensure data quality, and track data lineage.

This capability changes the questions users can ask their systems because they are not limited to the information contained in structured data. Saxena offered the example of a retailer asking, “Which customers are complaining about performance issues on support calls?”

Flexibility And Interoperability

Ease of use is not just built in for the end user but for the developer as well. “The platform is designed to meet developers where they are, allowing them to collaborate without forcing them into one rigid tool chain,” Banda explains.

For example, “data analysts can use high-performance SQL and continue to use BigQuery, while data engineers and scientists can also use advanced analytics, whatever they want to use,” Banda says. Interface flexibility means developers aren’t locked into specific tools. It supports BigQuery Studio, Jupyter notebooks, and Looker connections.

The open formats are key to interoperability, as Saxena points out. “Apache Iceberg has become a leading open table format. We have made it part of our native formats and brought enterprise capabilities to it,” he says.

The open lakehouse integrates with Vertex AI, Google Cloud’s fully managed, unified AI development platform, to fuel Google’s AI platform. Governed, cataloged data provides trusted input for training models, while metadata grounds large language models, reducing hallucinations and improving accuracy.

Third-party support uses open standards like Iceberg and APIs, remaining engine-agnostic. Organizations can use other Iceberg-compatible engines and train models from any source, not just Vertex AI.

AI Accelerates Time-To-Value

“Simplifying architecture and reducing overhead, accelerating data management, democratizing development by providing flexibility for developers to use the tools of their choice, and optimizing cost and performance all accelerate results and value from AI,” Banda asserts.

The unified data foundation removes silos. Consequently, Saxena explains, “you can seamlessly connect data with any use case at scale without scarce engineering resources being a bottleneck.”

AI also accelerates coding, augmenting human capabilities to enhance productivity. As a result, Saxena points out, “what used to take months now takes mere days to complete.” This is key to competitive advantage: “Organizations now have more ability to experiment and take their products to market faster.”

Responding In Real-Time

Acceleration is not just the product of automation but of open lakehouse’s capability to adapt quickly. AI can respond to real-world events in real time, allowing businesses to address problems and fix them right away. Combining AI and human-in-the-loop capabilities for rapid response creates confidence for broader deployment.

This level of real-time insight is what businesses are coming to demand. Their data queries used to be limited to reports on what happened. But now, as Banda has observed from her conversations with customers, people want their data systems to answer the question: “What should I do next?”

The combination of a unified data platform and AI assistance enables them to get the right answer to that question. Google is hoping that as people strive for increasingly sophisticated AI use cases, its open lakehouse architecture will help support those applications by unlocking the value in structured and unstructured data while minimizing complexity.

Sponsored by Google.

Google’s Open Lakehouse: The Foundation For Enterprise AI Data