0% found this document useful (0 votes)

36 views17 pages

Ebook: The Data Store For AI

Uploaded by

rakeshsinghparihar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views17 pages

Ebook: The Data Store For AI

Uploaded by

rakeshsinghparihar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

The data store

↳ for AI
Contents 01 → 05 →
Introduction Cost optimization
opportunities
02 →
The current state 06 →
of data architecture Analytics and data
science enhancements
03 →
The data 07 →
lakehouse defined IBM watsonx.data

04 → 08 →
Components of Next steps
the architecture
01

Introduction

This ebook will examine the latest open Data is at the center of every business. The data lake was supposed to fix all these
data management solution for data and It keeps applications running, powers issues; just land your data in a centralized
analytics leaders who want to significantly predictive insights and enables better place and process it. But it’s not so easy
reduce cost, simplify data access and experiences for customers and employees. to update the lakes, properly catalog
automate unified governance to scale AI. But the full benefit of data is elusive data or ensure good governance—and
It’s time for the data lakehouse. because of the way that data is stored the skillsets required for these tasks are
and accessed for analytics and AI. specific, rare and expensive. As a result,
data lakes have proven costly to build
You’re not alone if you rely on monolithic and maintain. A data warehouse does offer
repositories with multiple data warehouses high performance for processing terabytes
and data lakes, on premises and on cloud; of structured data. But warehouses can
82% of organizations are inhibited by become expensive, too, especially for new
data silos.1 And it’s about to get worse: and evolving workloads. Most organizations
according to IDC, the amount of stored run analytics and AI workloads in
data is expected to grow 250% by 2025.2 ecosystems that are complex and cost
inefficient. It’s time for a change.

Previous chapter Next chapter 3

↑250%
The amount of stored
data is expected to
grow 250% by 2025.2

Previous chapter Next chapter 4

The current state

of data architecture
A combination of on-premises and cloud-
native warehouses and bespoke data lakes Cloud orientation
is common for enterprise architecture
today. You likely find that juggling cost,
siloed data and data governance are
Types of workloads
constant challenges.

SaaS SaaS
↳ Cloud warehouse ↳ Cloud lake

On premises On premises
↳ Warehouse ↳ Lake

↓ ↓
High performance Low performance
High cost Low cost
Small structure data Big unstructured data

Previous chapter Next chapter 5

The data lakehouse is
an emerging paradigm
shift in how enterprises
surface insights.
3

Previous chapter Next chapter 6

The data lakehouse

defined
The data lakehouse is an emerging that limit their ability to address the
Seek out a lakehouse solution architecture that offers the flexibility challenges of cost and complexity. For
of a data lake with the performance example, a single query engine that’s
that provides a modern data
and structure of a data warehouse. designed for business intelligence or
foundation to scale AI. Most lakehouse solutions offer a high- machine learning (ML) workloads could
performance query engine over low-cost well be ineffective when it’s used for
storage in conjunction with a metadata another workload type.
governance layer. Intelligent metadata
layers make it easier for users to categorize The IBM data and AI team believes that
and classify unstructured data, such as every workload is unique and should
video and voice, and semi-structured data, be optimized with the best-suited
such as XML, JSON and emails. environment that keeps cost at a minimum
and performance at a maximum. Choose a
The best data lakehouse will offer open- lakehouse that delivers an optimal level of
source technologies that reduce data performance for better decision-making,
duplication and simplify complex ETL along with the flexibility that’s necessary
pipelines. Be aware that some first- to unlock value from all types of data.
generation lakehouses have key constraints

Previous chapter Next chapter 7

Figure 1. How to best scale watsonx.data
and accelerate the impact of AI
Multiple query engines

Metastore

Object storage

Structured data Data lake

Unstructured data

Data warehouse Semi-structured data

Previous chapter Next chapter 8

Components of
the architecture
Infrastructure Open table formats
The infrastructure is where your lakehouse Open table formats, such as Apache Iceberg,
will be deployed—fully managed across any help you provide structure, and deliver the
cloud or on-premises environment. reliability and simplicity of SQL with big
data. These formats allow different engines
Storage to access the same data, at the same time—
The storage layer is where the data is which helps avoid vendor lock-in. Share data
physically stored, which is stored as files across multiple tools and data repositories,
and can be stored in open data formats, such as your data warehouse; a single copy
such as Apache Parquet and Avro. Open of data lets you reduce data duplication and
data formats are file specifications and break down silos.
protocols made available to the open-
source community so that anyone can Metadata store
ingest and enhance them. The metadata store is where you keep track
of various structures defined in the files such
Fit-for-purpose query engines as tables (Iceberg, Hive, Hudi). It is based
Multiple, fit-for-purpose query engines on open-source Hive Metadata Store, so
enable queries to be efficiently executed any external tools can easily connect to and
for large data sets. With workload access the metadata within watsonx.data,
optimization across multiple query engines ensuring interoperability.
and storage tiers, you can minimize the cost
of your data warehouse while providing
fast, reliable, and efficient big data
processing at scale.

Previous chapter Next chapter 9

04 Components of the architecture

Governance Technical metadata service Query engine

This component is required to understand This component is at the heart of the open
Metadata is also stored with
what data is available in the storage layer. data lakehouse. A query engine, which can
open table formats; it serves The query engine requires the metadata for be open source or proprietary, accesses
to define the file formats for the data and tables to provide full lineage data in open table format and is often
and know where it’s located, what it looks known as the compute component. Query
any tool that can read or write like and how to read it. engines typically come in two types: an
open data formats. SQL-based query engine, such as the open-
Data catalogs source Presto, or an open-source Apache
This component helps users find the Spark engine or its equivalent.
correct data for the job and delivers
semantic information for policies and rules. In an open lakehouse architecture, the
Expect to store business metadata such as query engine is fully modular, which
business terminologies and tags to enable means that the engine can be dynamically
search and data protection. scaled to meet workload demands and
concurrency. Query engines can also attach
Policy engine to any catalog and storage.
This component enables users to define
data protection policies and enables the
engine to enforce those policies. To create
a governance framework that's scalable,
a policy engine is often deployed with the
technical metadata service and the data
catalog.

Previous chapter Next chapter 10

↓50%
Now it’s possible to achieve faster,
trusted insights while you cut data
warehouse costs in half.4

Previous chapter Next chapter 11

Cost optimization
opportunities
If your organization has existing on premises level agreements (SLAs). Warehouses
big data implementations, a lakehouse are often expensive and proprietary—
offers a less-expensive alternative for storing but with a lakehouse, you can dramatically
data in open formats on object storage. reduce storage and compute costs. You
You’ll lower the cost of analytics, decrease can optimize warehouse workloads using
complexity and improve time to value. fit-for-purpose engines that are based
on your workload requirements. The open
If you have an existing warehouse nature of a lakehouse frees you from
implementation, a lakehouse approach proprietary warehouse technology, which
can represent a massively scalable, lower- means less vendor lock-in and a reduction
cost alternative for your large analytics in IT infrastructure overhead costs.
workloads that are less sensitive to service-
Watsonx.data helps you save on data
querying and processing by pairing the
right workload to the right engine across
multiple storage tiers. You can optimize
for price-performance with fit-for-purpose
query engines such as Presto C++, Presto,
and Spark and built-in query optimization
technology.

Previous chapter Next chapter 12

IBM watsonx.data is
an open, hybrid, and
governed data store
optimized for
all data, analytics,
and AI workloads.

Previous chapter Next chapter 13

Analytics and data

science enhancements
“We are moving in the Proprietary data formats and high storage Adam Ronthal, vice president and analyst
costs limit AI and ML model collaboration at Gartner, says that “We are moving in
direction where the and deployments within a data warehouse the direction where the data lakehouse
data lakehouse becomes environment; data lakes are challenged with becomes a best practice.”2 The best
low-performing data science workloads. approach will offer an open, collaborative
a best practice.”3
The isolation of these technologies has led and governed environment for the end-to-
Adam Ronthal to downstream infrastructure challenges, end management of data science workloads.
Vice President along with the security and governance
Gartner implications that come with the duplication Let’s examine IBM® watsonx.data™—the
and movement of data for development of open, hybrid, and governed data store
AI and ML models. that’s optimized for all data, analytics,
and AI workloads.
A data lakehouse is a great way to help
colleagues who are hungry for the insights
that lie waiting in your organization’s
data. If you’re serious about extracting
business value from the firehose of data
that’s coming at you, do consider the
lakehouse strategy.

Previous chapter Next chapter 14

IBM watsonx.data

Scale AI workloads, for all your Access all of your data and maximize workload Reduce the cost of your data warehouse
coverage across all your hybrid-cloud by up to 50%4 through workload
data, anywhere. Watsonx.data
environments. Expect seamless deployment optimization across multiple query
is an open, hybrid, governed of a fully managed service across any cloud engines and storage tiers. Optimize costly
data store optimized for all data, or on-premises environment. Access any data warehouse workloads with fit-for-purpose
source, wherever it resides, through a single engines that scale up and scale down
analytics, and AI workloads, point of entry and combine it using open automatically. Reduce costs by eliminating
built on a data lakehouse data formats. Integrate into your existing duplication of data when you use low-cost
environment, including popular IBM databases object storage; extract more value from
architecture (see figure 1). and z/OS Mainframes, with open source and the data in ineffective data lakes.
open standards, and interoperability with IBM
and third-party services. Prepare data for AI. Data iUnify, curate,
and prepare vectorized embeddings
Accelerate time to trusted insights. Start for generative AI applications at scale
fast with built-in governance and automation; across trusted, governed data. Enhance
strengthen enterprise compliance and security the relevance and precision of AI
with unified governance across your entire outputs, including chatbots, personalized
ecosystem. A clear UX and click-and-go console recommendation systems, and image
helps your teams ingest, access and transform similarity search applications. Seamlessly
data and run workloads. Watch how quickly connect to trusted data in watsonx.data
they’ll embrace a dashboard that makes it from IBM watsonx.ai or another AI tool.
easier for them to save money and deliver
fresh, trusted insights.

Previous chapter Next chapter 15

Next steps

Take advantage of the IBM team’s

data management and optimization
knowledge honed by decades of
handling the world’s most demanding
data workloads. See how quickly you
can gain value from watsonx.data.

Start your free trial

Request a demo

Previous chapter 16
1. Why Unstructured Data is the Future of © Copyright IBM Corporation 2023 Statement of Good Security Practices: No IT system
Data Management, Venturebeat, July 2021. or product should be considered completely secure,
IBM Corporation and no single product, service or security measure
2. Worldwide IDC Global DataSphere New Orchard Road can be completely effective in preventing improper
Forecast,2022-2026, IDC, May 2022. Armonk, NY 10504 use or access. IBM does not warrant that any systems,
products or services are immune from, or will make your
3. The rise of the data lakehouse: A new era Produced in the United States of America enterprise immune from, the malicious or illegal conduct
of data value, CIO Magazine, 18 August 2022 May 2023 of any party.

4. When comparing published 2023 list prices IBM, the IBM logo, and watsonx.data are trademarks The client is responsible for ensuring compliance with all
normalized for VPC hours of IBM watsonx.data or registered trademarks of International Business applicable laws and regulations. IBM does not provide
to several major cloud data warehouse vendors. Machines Corporation, in the United States and/or other legal advice nor represent or warrant that its services or
Savings may vary depending on configurations, countries. Other product and service names might be products will ensure that the client is compliant with any
workloads and vendors. trademarks of IBM or other companies. A current list law or regulation.
of IBM trademarks is available on ibm.com/trademark.

It is the user’s responsibility to evaluate and verify the

operation of any other products or programs with IBM
products and programs.

The performance data and client examples cited

are presented for illustrative purposes only. Actual
performance results may vary depending on specific
configurations and operating conditions. THE
INFORMATION IN THIS DOCUMENT IS PROVIDED
“AS IS” WITHOUT ANY WARRANTY, EXPRESS OR
IMPLIED, INCLUDING WITHOUT ANY WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND ANY WARRANTY OR CONDITION OF
NON-INFRINGEMENT. IBM products are warranted
according to the terms and conditions of the
agreements under which they are provided.

Big Book of Data Warehousing and Bi v11 010925 Final
No ratings yet
Big Book of Data Warehousing and Bi v11 010925 Final
110 pages
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
No ratings yet
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
35 pages
AWR Mining V2 Trend Analysis: Maris Elsins The Pythian Group Inc. Riga, Latvia Keywords
No ratings yet
AWR Mining V2 Trend Analysis: Maris Elsins The Pythian Group Inc. Riga, Latvia Keywords
11 pages
Engr: Sajida Introduction To Computing
No ratings yet
Engr: Sajida Introduction To Computing
16 pages
The Delta Lake Series Lakehouse 012921
100% (1)
The Delta Lake Series Lakehouse 012921
19 pages
Data Science For Business What You Need PDF
0% (4)
Data Science For Business What You Need PDF
3 pages
Data Warehousing and Management
100% (1)
Data Warehousing and Management
7 pages
Bring Data Lakes and Data Warehouses Together
100% (1)
Bring Data Lakes and Data Warehouses Together
19 pages
Sheet7 - Trees - S2018 - Final - Solution
No ratings yet
Sheet7 - Trees - S2018 - Final - Solution
18 pages
Lakehouse: A Unified Data Architecture
No ratings yet
Lakehouse: A Unified Data Architecture
9 pages
Real Scenarios On Data Term 1722747078
No ratings yet
Real Scenarios On Data Term 1722747078
11 pages
IBM - IBM Watsonx - Data
No ratings yet
IBM - IBM Watsonx - Data
15 pages
DP-900 Dump
67% (6)
DP-900 Dump
64 pages
Course Fee
No ratings yet
Course Fee
4 pages
COMPUTER DATABASE jss3
No ratings yet
COMPUTER DATABASE jss3
3 pages
Event Log Analysis For Intrusion Detection
No ratings yet
Event Log Analysis For Intrusion Detection
60 pages
Nodejs Tutorial
No ratings yet
Nodejs Tutorial
5 pages
Apple Augmented Reality by Tutorials Compress
No ratings yet
Apple Augmented Reality by Tutorials Compress
359 pages
Database Chapter 2 Lecture Note
No ratings yet
Database Chapter 2 Lecture Note
8 pages
DL 1 - ComputerVision With PyTorch Notes
No ratings yet
DL 1 - ComputerVision With PyTorch Notes
304 pages
DBMS Lab Record
No ratings yet
DBMS Lab Record
42 pages
Deletion of Product Master or Location Product
No ratings yet
Deletion of Product Master or Location Product
3 pages
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
119 pages
Data Warehouse OLAP
No ratings yet
Data Warehouse OLAP
21 pages
Google Cloud Analytics Lakehouse
No ratings yet
Google Cloud Analytics Lakehouse
47 pages
Transaction
No ratings yet
Transaction
684 pages
Lake Data Warehouse Architecture For Big Data
No ratings yet
Lake Data Warehouse Architecture For Big Data
8 pages
Lec09-Data Warehousing
No ratings yet
Lec09-Data Warehousing
32 pages
23 Library Info Science PG STD
No ratings yet
23 Library Info Science PG STD
4 pages
06 - IBM Watsonx - Data Competitive Insights
No ratings yet
06 - IBM Watsonx - Data Competitive Insights
113 pages
PV Eng PHD
No ratings yet
PV Eng PHD
129 pages
Unit 6 NOSQL Databases and Data Warehousing
No ratings yet
Unit 6 NOSQL Databases and Data Warehousing
29 pages
Connect Microsoft Fabric Lakehouse With DBeaver
No ratings yet
Connect Microsoft Fabric Lakehouse With DBeaver
11 pages
Lecture 14 Data Warehouse and Data Lake Architecture Part 1
No ratings yet
Lecture 14 Data Warehouse and Data Lake Architecture Part 1
10 pages
Module 6
No ratings yet
Module 6
16 pages
Unit 5
No ratings yet
Unit 5
5 pages
Security Data Lake PDF
100% (1)
Security Data Lake PDF
37 pages
(Ebook) The AI Advantage. How AI Is Solving Problems For The Worlds Largest Retailers-1
No ratings yet
(Ebook) The AI Advantage. How AI Is Solving Problems For The Worlds Largest Retailers-1
19 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
Dbmsi Unit 4
No ratings yet
Dbmsi Unit 4
20 pages
Chapter 13 Slides
No ratings yet
Chapter 13 Slides
19 pages
SQLite
No ratings yet
SQLite
8 pages
R 5
100% (1)
R 5
70 pages
Data Management & AI On Databricks
No ratings yet
Data Management & AI On Databricks
14 pages
Tutorial 1 Answers For Data Mining and Warehousing (Universiti Malaya)
No ratings yet
Tutorial 1 Answers For Data Mining and Warehousing (Universiti Malaya)
4 pages
Warehouse Assignment MIM 106
No ratings yet
Warehouse Assignment MIM 106
8 pages
Data Warehousing and Dimensional Modeling Notes by Neil Bagchi
No ratings yet
Data Warehousing and Dimensional Modeling Notes by Neil Bagchi
33 pages
The Open Data Lakehouse
No ratings yet
The Open Data Lakehouse
12 pages
Unit I Introduction To Data Science
No ratings yet
Unit I Introduction To Data Science
79 pages
DWH
No ratings yet
DWH
7 pages
Preprints202206 0384 v1
No ratings yet
Preprints202206 0384 v1
12 pages
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
No ratings yet
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
36 pages
Virtual Try On
No ratings yet
Virtual Try On
12 pages
Generative AI and Machine Learning Course Content
No ratings yet
Generative AI and Machine Learning Course Content
19 pages
Logbook
No ratings yet
Logbook
2 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Lec 01 - Intro To Data Warehouse
No ratings yet
Lec 01 - Intro To Data Warehouse
54 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
26 pages
Data Warehousing - CH2
No ratings yet
Data Warehousing - CH2
26 pages
LLM4Decompile: Decompiling Binary Code With Large Language Models
No ratings yet
LLM4Decompile: Decompiling Binary Code With Large Language Models
9 pages
Pandas
No ratings yet
Pandas
2 pages
Introduction To Data Lakes
No ratings yet
Introduction To Data Lakes
6 pages
DBMS File
No ratings yet
DBMS File
32 pages
02 - Introduction To Data Lakehouse Open-Source Technologies
No ratings yet
02 - Introduction To Data Lakehouse Open-Source Technologies
42 pages
WWW Learnpytorch
No ratings yet
WWW Learnpytorch
14 pages
Advanced Database Management: Faculty Name: Dr. Dipti Jadhav
No ratings yet
Advanced Database Management: Faculty Name: Dr. Dipti Jadhav
56 pages
Unit-1.1 Data Warehouse
No ratings yet
Unit-1.1 Data Warehouse
29 pages
LakeHouse Architecture
No ratings yet
LakeHouse Architecture
23 pages
Data Warehouse
No ratings yet
Data Warehouse
143 pages
TDWI Checklist Report KPDL Databricks Tableau Halper Web
No ratings yet
TDWI Checklist Report KPDL Databricks Tableau Halper Web
9 pages
Whitepaper
No ratings yet
Whitepaper
8 pages
DATA WAREHOUSE - Pertemuan01
No ratings yet
DATA WAREHOUSE - Pertemuan01
20 pages
MIT Dremio A New Paradigm For Managing Data
No ratings yet
MIT Dremio A New Paradigm For Managing Data
8 pages
Dbms Viva PDF
No ratings yet
Dbms Viva PDF
14 pages
The Data Lakes: A Leap Forward Future of Data Warehousing
No ratings yet
The Data Lakes: A Leap Forward Future of Data Warehousing
5 pages
Lect 5 Data Warehousing I - 240924 - 033406
No ratings yet
Lect 5 Data Warehousing I - 240924 - 033406
38 pages
WP Dremio Definitive Guide To The Data Lakehouse
No ratings yet
WP Dremio Definitive Guide To The Data Lakehouse
20 pages
What Is A Data Warehouse - IBM
No ratings yet
What Is A Data Warehouse - IBM
9 pages
Ground: A Data Context Service
No ratings yet
Ground: A Data Context Service
12 pages
Data Warehouse-3
No ratings yet
Data Warehouse-3
3 pages
Hammad Anwar
No ratings yet
Hammad Anwar
5 pages
Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
No ratings yet
Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
19 pages
Unit 2 (Data Warehousing)
No ratings yet
Unit 2 (Data Warehousing)
39 pages
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
No ratings yet
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
8 pages
C Tutorial
No ratings yet
C Tutorial
4 pages
L5 DataWarehousing
No ratings yet
L5 DataWarehousing
13 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
Types of Data Analysis: Techniques and Methods
No ratings yet
Types of Data Analysis: Techniques and Methods
4 pages
What Is The Difference Between A Data Warehouse and Big Data
No ratings yet
What Is The Difference Between A Data Warehouse and Big Data
3 pages
Datastage 8 Dumps
No ratings yet
Datastage 8 Dumps
51 pages
MIT Dremio A New Paradigm For Managing Data
No ratings yet
MIT Dremio A New Paradigm For Managing Data
8 pages
DM Module 1
No ratings yet
DM Module 1
16 pages
In Chapter 9 You Created A Database For Kelly S Boutique
No ratings yet
In Chapter 9 You Created A Database For Kelly S Boutique
1 page
Build A True Data Lake With A Cloud Data Warehouse
No ratings yet
Build A True Data Lake With A Cloud Data Warehouse
15 pages
What Is SQL Loader and What Is It Used For?
No ratings yet
What Is SQL Loader and What Is It Used For?
5 pages
01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam
No ratings yet
01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam
32 pages
Data Warehousing Concepts
No ratings yet
Data Warehousing Concepts
50 pages
Data Access Methods in Oracle
No ratings yet
Data Access Methods in Oracle
4 pages
Business Intelligence: Multi-Dimensional Analysis Tools
No ratings yet
Business Intelligence: Multi-Dimensional Analysis Tools
35 pages
WA Data Warehouse
No ratings yet
WA Data Warehouse
16 pages
A Detailed View Inside Snowflake
No ratings yet
A Detailed View Inside Snowflake
14 pages
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
From Everand
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
Robert Johnson
No ratings yet

Ebook: The Data Store For AI

Uploaded by

Ebook: The Data Store For AI

Uploaded by

The data store

Previous chapter Next chapter 3

Previous chapter Next chapter 4

The current state

Previous chapter Next chapter 5

Previous chapter Next chapter 6

The data lakehouse

Previous chapter Next chapter 7

Structured data Data lake

Data warehouse Semi-structured data

Previous chapter Next chapter 8

Previous chapter Next chapter 9

Governance Technical metadata service Query engine

Previous chapter Next chapter 10

Previous chapter Next chapter 11

Previous chapter Next chapter 12

Previous chapter Next chapter 13

Analytics and data

Previous chapter Next chapter 14

Previous chapter Next chapter 15

Take advantage of the IBM team’s

Start your free trial

It is the user’s responsibility to evaluate and verify the

The performance data and client examples cited

You might also like