Ebook: The Data Store For AI
Ebook: The Data Store For AI
↳ for AI
Contents 01 → 05 →
Introduction Cost optimization
opportunities
02 →
The current state 06 →
of data architecture Analytics and data
science enhancements
03 →
The data 07 →
lakehouse defined IBM watsonx.data
04 → 08 →
Components of Next steps
the architecture
01
Introduction
This ebook will examine the latest open Data is at the center of every business. The data lake was supposed to fix all these
data management solution for data and It keeps applications running, powers issues; just land your data in a centralized
analytics leaders who want to significantly predictive insights and enables better place and process it. But it’s not so easy
reduce cost, simplify data access and experiences for customers and employees. to update the lakes, properly catalog
automate unified governance to scale AI. But the full benefit of data is elusive data or ensure good governance—and
It’s time for the data lakehouse. because of the way that data is stored the skillsets required for these tasks are
and accessed for analytics and AI. specific, rare and expensive. As a result,
data lakes have proven costly to build
You’re not alone if you rely on monolithic and maintain. A data warehouse does offer
repositories with multiple data warehouses high performance for processing terabytes
and data lakes, on premises and on cloud; of structured data. But warehouses can
82% of organizations are inhibited by become expensive, too, especially for new
data silos.1 And it’s about to get worse: and evolving workloads. Most organizations
according to IDC, the amount of stored run analytics and AI workloads in
data is expected to grow 250% by 2025.2 ecosystems that are complex and cost
inefficient. It’s time for a change.
SaaS SaaS
↳ Cloud warehouse ↳ Cloud lake
On premises On premises
↳ Warehouse ↳ Lake
↓ ↓
High performance Low performance
High cost Low cost
Small structure data Big unstructured data
Metastore
Object storage
Unstructured data
Components of
the architecture
Infrastructure Open table formats
The infrastructure is where your lakehouse Open table formats, such as Apache Iceberg,
will be deployed—fully managed across any help you provide structure, and deliver the
cloud or on-premises environment. reliability and simplicity of SQL with big
data. These formats allow different engines
Storage to access the same data, at the same time—
The storage layer is where the data is which helps avoid vendor lock-in. Share data
physically stored, which is stored as files across multiple tools and data repositories,
and can be stored in open data formats, such as your data warehouse; a single copy
such as Apache Parquet and Avro. Open of data lets you reduce data duplication and
data formats are file specifications and break down silos.
protocols made available to the open-
source community so that anyone can Metadata store
ingest and enhance them. The metadata store is where you keep track
of various structures defined in the files such
Fit-for-purpose query engines as tables (Iceberg, Hive, Hudi). It is based
Multiple, fit-for-purpose query engines on open-source Hive Metadata Store, so
enable queries to be efficiently executed any external tools can easily connect to and
for large data sets. With workload access the metadata within watsonx.data,
optimization across multiple query engines ensuring interoperability.
and storage tiers, you can minimize the cost
of your data warehouse while providing
fast, reliable, and efficient big data
processing at scale.
Cost optimization
opportunities
If your organization has existing on premises level agreements (SLAs). Warehouses
big data implementations, a lakehouse are often expensive and proprietary—
offers a less-expensive alternative for storing but with a lakehouse, you can dramatically
data in open formats on object storage. reduce storage and compute costs. You
You’ll lower the cost of analytics, decrease can optimize warehouse workloads using
complexity and improve time to value. fit-for-purpose engines that are based
on your workload requirements. The open
If you have an existing warehouse nature of a lakehouse frees you from
implementation, a lakehouse approach proprietary warehouse technology, which
can represent a massively scalable, lower- means less vendor lock-in and a reduction
cost alternative for your large analytics in IT infrastructure overhead costs.
workloads that are less sensitive to service-
Watsonx.data helps you save on data
querying and processing by pairing the
right workload to the right engine across
multiple storage tiers. You can optimize
for price-performance with fit-for-purpose
query engines such as Presto C++, Presto,
and Spark and built-in query optimization
technology.
IBM watsonx.data
Scale AI workloads, for all your Access all of your data and maximize workload Reduce the cost of your data warehouse
coverage across all your hybrid-cloud by up to 50%4 through workload
data, anywhere. Watsonx.data
environments. Expect seamless deployment optimization across multiple query
is an open, hybrid, governed of a fully managed service across any cloud engines and storage tiers. Optimize costly
data store optimized for all data, or on-premises environment. Access any data warehouse workloads with fit-for-purpose
source, wherever it resides, through a single engines that scale up and scale down
analytics, and AI workloads, point of entry and combine it using open automatically. Reduce costs by eliminating
built on a data lakehouse data formats. Integrate into your existing duplication of data when you use low-cost
environment, including popular IBM databases object storage; extract more value from
architecture (see figure 1). and z/OS Mainframes, with open source and the data in ineffective data lakes.
open standards, and interoperability with IBM
and third-party services. Prepare data for AI. Data iUnify, curate,
and prepare vectorized embeddings
Accelerate time to trusted insights. Start for generative AI applications at scale
fast with built-in governance and automation; across trusted, governed data. Enhance
strengthen enterprise compliance and security the relevance and precision of AI
with unified governance across your entire outputs, including chatbots, personalized
ecosystem. A clear UX and click-and-go console recommendation systems, and image
helps your teams ingest, access and transform similarity search applications. Seamlessly
data and run workloads. Watch how quickly connect to trusted data in watsonx.data
they’ll embrace a dashboard that makes it from IBM watsonx.ai or another AI tool.
easier for them to save money and deliver
fresh, trusted insights.
Next steps
Request a demo
Previous chapter 16
1. Why Unstructured Data is the Future of © Copyright IBM Corporation 2023 Statement of Good Security Practices: No IT system
Data Management, Venturebeat, July 2021. or product should be considered completely secure,
IBM Corporation and no single product, service or security measure
2. Worldwide IDC Global DataSphere New Orchard Road can be completely effective in preventing improper
Forecast,2022-2026, IDC, May 2022. Armonk, NY 10504 use or access. IBM does not warrant that any systems,
products or services are immune from, or will make your
3. The rise of the data lakehouse: A new era Produced in the United States of America enterprise immune from, the malicious or illegal conduct
of data value, CIO Magazine, 18 August 2022 May 2023 of any party.
4. When comparing published 2023 list prices IBM, the IBM logo, and watsonx.data are trademarks The client is responsible for ensuring compliance with all
normalized for VPC hours of IBM watsonx.data or registered trademarks of International Business applicable laws and regulations. IBM does not provide
to several major cloud data warehouse vendors. Machines Corporation, in the United States and/or other legal advice nor represent or warrant that its services or
Savings may vary depending on configurations, countries. Other product and service names might be products will ensure that the client is compliant with any
workloads and vendors. trademarks of IBM or other companies. A current list law or regulation.
of IBM trademarks is available on ibm.com/trademark.
17