0% found this document useful (0 votes)
36 views17 pages

Ebook: The Data Store For AI

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views17 pages

Ebook: The Data Store For AI

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

The data store

↳ for AI
Contents 01 → 05 →
Introduction Cost optimization
opportunities
02 →
The current state 06 →
of data architecture Analytics and data
science enhancements
03 →
The data 07 →
lakehouse defined IBM watsonx.data

04 → 08 →
Components of Next steps
the architecture
01

Introduction

This ebook will examine the latest open Data is at the center of every business. The data lake was supposed to fix all these
data management solution for data and It keeps applications running, powers issues; just land your data in a centralized
analytics leaders who want to significantly predictive insights and enables better place and process it. But it’s not so easy
reduce cost, simplify data access and experiences for customers and employees. to update the lakes, properly catalog
automate unified governance to scale AI. But the full benefit of data is elusive data or ensure good governance—and
It’s time for the data lakehouse. because of the way that data is stored the skillsets required for these tasks are
and accessed for analytics and AI. specific, rare and expensive. As a result,
data lakes have proven costly to build
You’re not alone if you rely on monolithic and maintain. A data warehouse does offer
repositories with multiple data warehouses high performance for processing terabytes
and data lakes, on premises and on cloud; of structured data. But warehouses can
82% of organizations are inhibited by become expensive, too, especially for new
data silos.1 And it’s about to get worse: and evolving workloads. Most organizations
according to IDC, the amount of stored run analytics and AI workloads in
data is expected to grow 250% by 2025.2 ecosystems that are complex and cost
inefficient. It’s time for a change.

Previous chapter Next chapter 3


↑250%
The amount of stored
data is expected to
grow 250% by 2025.2

Previous chapter Next chapter 4


02

The current state


of data architecture
A combination of on-premises and cloud-
native warehouses and bespoke data lakes Cloud orientation
is common for enterprise architecture
today. You likely find that juggling cost,
siloed data and data governance are
Types of workloads
constant challenges.

SaaS SaaS
↳ Cloud warehouse ↳ Cloud lake

On premises On premises
↳ Warehouse ↳ Lake

↓ ↓
High performance Low performance
High cost Low cost
Small structure data Big unstructured data

Previous chapter Next chapter 5


The data lakehouse is
an emerging paradigm
shift in how enterprises
surface insights.
3

Previous chapter Next chapter 6


03

The data lakehouse


defined
The data lakehouse is an emerging that limit their ability to address the
Seek out a lakehouse solution architecture that offers the flexibility challenges of cost and complexity. For
of a data lake with the performance example, a single query engine that’s
that provides a modern data
and structure of a data warehouse. designed for business intelligence or
foundation to scale AI. Most lakehouse solutions offer a high- machine learning (ML) workloads could
performance query engine over low-cost well be ineffective when it’s used for
storage in conjunction with a metadata another workload type.
governance layer. Intelligent metadata
layers make it easier for users to categorize The IBM data and AI team believes that
and classify unstructured data, such as every workload is unique and should
video and voice, and semi-structured data, be optimized with the best-suited
such as XML, JSON and emails. environment that keeps cost at a minimum
and performance at a maximum. Choose a
The best data lakehouse will offer open- lakehouse that delivers an optimal level of
source technologies that reduce data performance for better decision-making,
duplication and simplify complex ETL along with the flexibility that’s necessary
pipelines. Be aware that some first- to unlock value from all types of data.
generation lakehouses have key constraints

Previous chapter Next chapter 7


Figure 1. How to best scale watsonx.data
and accelerate the impact of AI
Multiple query engines

Metastore

Object storage

Structured data Data lake

Unstructured data

Data warehouse Semi-structured data

Previous chapter Next chapter 8


04

Components of
the architecture
Infrastructure Open table formats
The infrastructure is where your lakehouse Open table formats, such as Apache Iceberg,
will be deployed—fully managed across any help you provide structure, and deliver the
cloud or on-premises environment. reliability and simplicity of SQL with big
data. These formats allow different engines
Storage to access the same data, at the same time—
The storage layer is where the data is which helps avoid vendor lock-in. Share data
physically stored, which is stored as files across multiple tools and data repositories,
and can be stored in open data formats, such as your data warehouse; a single copy
such as Apache Parquet and Avro. Open of data lets you reduce data duplication and
data formats are file specifications and break down silos.
protocols made available to the open-
source community so that anyone can Metadata store
ingest and enhance them. The metadata store is where you keep track
of various structures defined in the files such
Fit-for-purpose query engines as tables (Iceberg, Hive, Hudi). It is based
Multiple, fit-for-purpose query engines on open-source Hive Metadata Store, so
enable queries to be efficiently executed any external tools can easily connect to and
for large data sets. With workload access the metadata within watsonx.data,
optimization across multiple query engines ensuring interoperability.
and storage tiers, you can minimize the cost
of your data warehouse while providing
fast, reliable, and efficient big data
processing at scale.

Previous chapter Next chapter 9


04 Components of the architecture

Governance Technical metadata service Query engine


This component is required to understand This component is at the heart of the open
Metadata is also stored with
what data is available in the storage layer. data lakehouse. A query engine, which can
open table formats; it serves The query engine requires the metadata for be open source or proprietary, accesses
to define the file formats for the data and tables to provide full lineage data in open table format and is often
and know where it’s located, what it looks known as the compute component. Query
any tool that can read or write like and how to read it. engines typically come in two types: an
open data formats. SQL-based query engine, such as the open-
Data catalogs source Presto, or an open-source Apache
This component helps users find the Spark engine or its equivalent.
correct data for the job and delivers
semantic information for policies and rules. In an open lakehouse architecture, the
Expect to store business metadata such as query engine is fully modular, which
business terminologies and tags to enable means that the engine can be dynamically
search and data protection. scaled to meet workload demands and
concurrency. Query engines can also attach
Policy engine to any catalog and storage.
This component enables users to define
data protection policies and enables the
engine to enforce those policies. To create
a governance framework that's scalable,
a policy engine is often deployed with the
technical metadata service and the data
catalog.

Previous chapter Next chapter 10


↓50%
Now it’s possible to achieve faster,
trusted insights while you cut data
warehouse costs in half.4

Previous chapter Next chapter 11


05

Cost optimization
opportunities
If your organization has existing on premises level agreements (SLAs). Warehouses
big data implementations, a lakehouse are often expensive and proprietary—
offers a less-expensive alternative for storing but with a lakehouse, you can dramatically
data in open formats on object storage. reduce storage and compute costs. You
You’ll lower the cost of analytics, decrease can optimize warehouse workloads using
complexity and improve time to value. fit-for-purpose engines that are based
on your workload requirements. The open
If you have an existing warehouse nature of a lakehouse frees you from
implementation, a lakehouse approach proprietary warehouse technology, which
can represent a massively scalable, lower- means less vendor lock-in and a reduction
cost alternative for your large analytics in IT infrastructure overhead costs.
workloads that are less sensitive to service-
Watsonx.data helps you save on data
querying and processing by pairing the
right workload to the right engine across
multiple storage tiers. You can optimize
for price-performance with fit-for-purpose
query engines such as Presto C++, Presto,
and Spark and built-in query optimization
technology.

Previous chapter Next chapter 12


IBM watsonx.data is
an open, hybrid, and
governed data store
optimized for
all data, analytics,
and AI workloads.

Previous chapter Next chapter 13


06

Analytics and data


science enhancements
“We are moving in the Proprietary data formats and high storage Adam Ronthal, vice president and analyst
costs limit AI and ML model collaboration at Gartner, says that “We are moving in
direction where the and deployments within a data warehouse the direction where the data lakehouse
data lakehouse becomes environment; data lakes are challenged with becomes a best practice.”2 The best
low-performing data science workloads. approach will offer an open, collaborative
a best practice.”3
The isolation of these technologies has led and governed environment for the end-to-
Adam Ronthal to downstream infrastructure challenges, end management of data science workloads.
Vice President along with the security and governance
Gartner implications that come with the duplication Let’s examine IBM® watsonx.data™—the
and movement of data for development of open, hybrid, and governed data store
AI and ML models. that’s optimized for all data, analytics,
and AI workloads.
A data lakehouse is a great way to help
colleagues who are hungry for the insights
that lie waiting in your organization’s
data. If you’re serious about extracting
business value from the firehose of data
that’s coming at you, do consider the
lakehouse strategy.

Previous chapter Next chapter 14


07

IBM watsonx.data

Scale AI workloads, for all your Access all of your data and maximize workload Reduce the cost of your data warehouse
coverage across all your hybrid-cloud by up to 50%4 through workload
data, anywhere. Watsonx.data
environments. Expect seamless deployment optimization across multiple query
is an open, hybrid, governed of a fully managed service across any cloud engines and storage tiers. Optimize costly
data store optimized for all data, or on-premises environment. Access any data warehouse workloads with fit-for-purpose
source, wherever it resides, through a single engines that scale up and scale down
analytics, and AI workloads, point of entry and combine it using open automatically. Reduce costs by eliminating
built on a data lakehouse data formats. Integrate into your existing duplication of data when you use low-cost
environment, including popular IBM databases object storage; extract more value from
architecture (see figure 1). and z/OS Mainframes, with open source and the data in ineffective data lakes.
open standards, and interoperability with IBM
and third-party services. Prepare data for AI. Data iUnify, curate,
and prepare vectorized embeddings
Accelerate time to trusted insights. Start for generative AI applications at scale
fast with built-in governance and automation; across trusted, governed data. Enhance
strengthen enterprise compliance and security the relevance and precision of AI
with unified governance across your entire outputs, including chatbots, personalized
ecosystem. A clear UX and click-and-go console recommendation systems, and image
helps your teams ingest, access and transform similarity search applications. Seamlessly
data and run workloads. Watch how quickly connect to trusted data in watsonx.data
they’ll embrace a dashboard that makes it from IBM watsonx.ai or another AI tool.
easier for them to save money and deliver
fresh, trusted insights.

Previous chapter Next chapter 15


08

Next steps

Take advantage of the IBM team’s


data management and optimization
knowledge honed by decades of
handling the world’s most demanding
data workloads. See how quickly you
can gain value from watsonx.data.

Start your free trial

Request a demo

Previous chapter 16
1. Why Unstructured Data is the Future of © Copyright IBM Corporation 2023 Statement of Good Security Practices: No IT system
Data Management, Venturebeat, July 2021. or product should be considered completely secure,
IBM Corporation and no single product, service or security measure
2. Worldwide IDC Global DataSphere New Orchard Road can be completely effective in preventing improper
Forecast,2022-2026, IDC, May 2022. Armonk, NY 10504 use or access. IBM does not warrant that any systems,
products or services are immune from, or will make your
3. The rise of the data lakehouse: A new era Produced in the United States of America enterprise immune from, the malicious or illegal conduct
of data value, CIO Magazine, 18 August 2022 May 2023 of any party.

4. When comparing published 2023 list prices IBM, the IBM logo, and watsonx.data are trademarks The client is responsible for ensuring compliance with all
normalized for VPC hours of IBM watsonx.data or registered trademarks of International Business applicable laws and regulations. IBM does not provide
to several major cloud data warehouse vendors. Machines Corporation, in the United States and/or other legal advice nor represent or warrant that its services or
Savings may vary depending on configurations, countries. Other product and service names might be products will ensure that the client is compliant with any
workloads and vendors. trademarks of IBM or other companies. A current list law or regulation.
of IBM trademarks is available on ibm.com/trademark.

It is the user’s responsibility to evaluate and verify the


operation of any other products or programs with IBM
products and programs.

The performance data and client examples cited


are presented for illustrative purposes only. Actual
performance results may vary depending on specific
configurations and operating conditions. THE
INFORMATION IN THIS DOCUMENT IS PROVIDED
“AS IS” WITHOUT ANY WARRANTY, EXPRESS OR
IMPLIED, INCLUDING WITHOUT ANY WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND ANY WARRANTY OR CONDITION OF
NON-INFRINGEMENT. IBM products are warranted
according to the terms and conditions of the
agreements under which they are provided.

17

You might also like