SlideShare a Scribd company logo
The Power of Data
May 2019
About me
 BI, Data Warehousing and Big Data Evangelist since 1983.
 Before joining Ultimate, I was Chief Big Data Architect at Visa and before that I was
VP of Data Architecture at Fidelity Investments
 My first job was with Bob Earle “The father of OLAP”.
 I worked in the Finance group at Coors Brewing Company where we created some
of the first data warehouses.
 I have given many presentations at IOUW, RMOUG, TDWI, Collaborate, Gartner
Group, Oracle Open World and the BI Summit. Also, I have given Metadata and
Data Governance presentations for HIMSS.
 I have a degree in statistics, MBA in Finance and Masters of Computer Science.
 I have authored Oracle Essbase & Oracle OLAP: A Guide to Oracle’s
Multidimensional Solutions Published by Oracle Press and Oracle Data
Warehousing published by SAMS.
2
Agenda
• Challenge
• Analytics – What is it?
• The Power of Data
• Data Governance
• Solutions
• The Data Lake – Cloudera – HDP 3.1
• LLAP and Vectorization
• DataPlane – Ranger and Atlas – DSS and DLM
3
Good judgment comes from
experience, and a lot of that comes
from bad judgment." - Will Rogers
You keep using that word. I
do not think it means what
you think it means.
What do you mean by “analytics”?
Challenge – Analytics and Data Governance
There are two parts to
“analytics”
The mathy stuff The query & reporting stuff
With Analytics I can Predict Behavior
Benford’s Law
Tesla and LinkedIn Think Resumes Are Overrated.
They Use Neuroscience-Based Games Instead
www.inc.com/kevin-j-ryan/pymetrics-replacing-resumes-with-brain-games.html
That's the philosophy touted by Frida Polli, co-founder and CEO of hiring startup
Pymetrics. The company makes games meant to determine whether a candidate
would be a good fit in a specific role at your company. Polli says that so far, the
platform has been more effective at finding the right hires than traditional resumes.
The results have been promising. Polli says that some companies have more than
doubled the percentage of candidates they hire out of those they invite for in-person
interviews. One-year retention rates have increased by between 30 and 60 percent.
And companies are reporting that job performance has improved among newly hired
candidates.
How to tell if someone will repay a loan?
What’s the smartest way to predict loan Payback?
What is a Data Lake?
11
A single place to store every type of data in its native format with no fixed limits on account size or file
size, high throughput to increase analytic performance and native integration with the Hadoop
ecosystem.
An architectural shift in the BI World that uses Hadoop to deliver deep insight across a large,
broad, diverse set of data at efficient scale.
The primary view of BI, self service is publishing data
The Power of Data
The Power of Data
Find Any Business Data in Sub-second
Each CPU scans
local in-memory
columns
Scans use super
fast SIMD vector
instructions
Billions of
rows/sec scan rate
per CPU core
May 25,
2018
GDPR – What is it?
4%
Or
€20MPotential Penalty
Per Infraction
Global
Impact
5 Key General Data Protection Regulation Obligations
Rights of EU
Data
Subjects
Security of
Personal
Data
Consent Accountability of
Compliance
Data Protection by
Design and by
Default
www.eugdpr.org
Access
Defining what
users and
applications can
do with data
Technical concepts:
Data Policies
Authorization
Data Protection
Protecting data in
the cluster from
unauthorized
visibility
Technical concepts:
Encryption,
tokenization, data
masking
Visibility
Reporting on
where data came
from and how it’s
being used
Technical concepts:
Auditing
Lineage
Knox
Identity
Guarding
access to the
cluster itself
Technical concepts:
Authentication
Network
isolation
Pillars of our comprehensive Data
Governance Solution
Discovery
Finding Data
Assets and
Definitions
Technical concepts:
Business Glossary,
Technical Glossary
and Search.
Access
Defining what
users and
applications can
do with data
Technical concepts:
Data Policies
Authorization
Data Protection
Protecting data in
the cluster from
unauthorized
visibility
Technical concepts:
Encryption,
tokenization, data
masking
Visibility
Reporting on
where data came
from and how it’s
being used
Technical concepts:
Auditing
Lineage
Knox
Ranger DataPlane & Atlas
Hardware, File and
Column Encryption
Identity
Guarding
access to the
cluster itself
Technical concepts:
Authentication
Network
isolation
Pillars of our comprehensive Data
Governance Solution
Discovery
Finding Data
Assets and
Definitions
Technical concepts:
Business Glossary,
Technical Glossary
and Search.
DataPlane & AtlasKnox/Active
LDAP Kerwberos
ACCESS - Establish and Implement Data Policies
▪ Accomplish: Manage and automate the information lifecycle from ingestion to purge, cradle to
grave, based on the unified metadata catalog
- Role Based Authorization
- Allow an Analyst to see PII data but not Developer
- Allow for Masking of Data
- Allow for automate enforcement of Data Retention Policies such as 7 days in Kakfa
Dynamic Row Filtering & Column Masking: Apache Ranger with Apache Hive
User 2: Ivanna
Location : EU
Group: HRUser 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National ID CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx xxxx null John Doe
US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic Ranger
Policies: Filter rows by region & apply
relevant column masking
Users from US Analyst group see data for
US persons with CC and National ID (SSN)
as masked values and MRN is nullified
Country National ID Name MRN
Germany T22000129 Ernie
Schwarz
876452830A
EU HR Policy Admins can see
unmasked but are restricted by
row filtering policies to see data
for EU persons only
Original Query:
SELECT country, nationalid,
name, mrn FROM
ww_customers
Analysts
HR Marketing
Visiability - Apache Ranger Audits - Data Access
⬢ Comprehensive scalable audit logging
⬢ Audits for:
⬢ Resource Access Events with user context
⬢ Policy Edits/Creation/Deletion
⬢ User session information
⬢ Component plugin policy sync operations
Tag (Classification) Based Masking
Masking Policy
For any Hive columns tagged as containing PII:
• Allow HR to see data in the clear for any type of
PII
• Apply ‘Nullify’ mask to columns classified as
type ‘MRN’ for Analysts
• Apply ‘Hash’ as masking option to columns
classified as type ‘Password’
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for
distribution of column values
Data Steward Studio (DSS)
DataPlane DSS - Understanding
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – Where is my PII?
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – What Tables are Accessed?
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – PII Trends
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – Data Lineage
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DLM as Backup and DR
Questions and Answers
A
Q&

More Related Content

PPTX
Free Servers to Build Big Data System on: Bing’s Approach
PPTX
Shaping a Digital Vision
PPTX
Security Framework for Multitenant Architecture
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Intuit Analytics Cloud 101
PPTX
The Life of an Internet of Things Electron
PPTX
Obfuscating LinkedIn Member Data
PPTX
How big data and AI saved the day: critical IP almost walked out the door
Free Servers to Build Big Data System on: Bing’s Approach
Shaping a Digital Vision
Security Framework for Multitenant Architecture
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Intuit Analytics Cloud 101
The Life of an Internet of Things Electron
Obfuscating LinkedIn Member Data
How big data and AI saved the day: critical IP almost walked out the door

What's hot (20)

PPTX
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
PPTX
Big Data at Geisinger Health System: Big Wins in a Short Time
PDF
Summary introduction to data engineering
PDF
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
PPTX
Pouring the Foundation: Data Management in the Energy Industry
PPTX
Building a Scalable Data Science Platform with R
PPTX
Hadoop Journey at Walgreens
PPTX
From Events to Networks: Time Series Analysis on Scale
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
PPTX
Analysis of Major Trends in Big Data Analytics
PPTX
Loan Decisioning Transformation
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
PPTX
Optimizing industrial operations using the big data ecosystem
PPTX
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
PDF
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
PDF
Modernizing to a Cloud Data Architecture
PDF
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Big Data at Geisinger Health System: Big Wins in a Short Time
Summary introduction to data engineering
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Pouring the Foundation: Data Management in the Energy Industry
Building a Scalable Data Science Platform with R
Hadoop Journey at Walgreens
From Events to Networks: Time Series Analysis on Scale
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
The key to unlocking the Value in the IoT? Managing the Data!
Analysis of Major Trends in Big Data Analytics
Loan Decisioning Transformation
High Performance Spatial-Temporal Trajectory Analysis with Spark
Optimizing industrial operations using the big data ecosystem
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Modernizing to a Cloud Data Architecture
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Ad

Similar to The Power of Data (20)

PDF
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
PDF
The Great Lakes: How to Approach a Big Data Implementation
PPTX
How to build a successful Data Lake
PDF
BAR360 open data platform presentation at DAMA, Sydney
PDF
How Can Analytics Improve Business?
PDF
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
PPTX
Balancing data democratization with comprehensive information governance: bui...
PDF
LinkedInSaxoBankDataWorkbench
PDF
Data lake benefits
PPTX
Architecting for Big Data: Trends, Tips, and Deployment Options
PDF
Active Governance Across the Delta Lake with Alation
PDF
Big Data Analytics Lecture notes pdf notes
PDF
(R17A0528) BIG DATA ANALYTICS.pdf
PDF
(R17A0528) BIG DATA ANALYTICS.pdf
PPTX
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
PDF
How to Consume Your Data for AI
PDF
2022 Trends in Enterprise Analytics
PDF
Trends in Data Modeling
PDF
Understanding Metadata: Why it's essential to your big data solution and how ...
PDF
Data Catalog as a Business Enabler
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
The Great Lakes: How to Approach a Big Data Implementation
How to build a successful Data Lake
BAR360 open data platform presentation at DAMA, Sydney
How Can Analytics Improve Business?
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Balancing data democratization with comprehensive information governance: bui...
LinkedInSaxoBankDataWorkbench
Data lake benefits
Architecting for Big Data: Trends, Tips, and Deployment Options
Active Governance Across the Delta Lake with Alation
Big Data Analytics Lecture notes pdf notes
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
How to Consume Your Data for AI
2022 Trends in Enterprise Analytics
Trends in Data Modeling
Understanding Metadata: Why it's essential to your big data solution and how ...
Data Catalog as a Business Enabler
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
PPTX
Applying Noisy Knowledge Graphs to Real Problems
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Applying Noisy Knowledge Graphs to Real Problems

Recently uploaded (20)

PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Modernising the Digital Integration Hub
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Architecture types and enterprise applications.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
STKI Israel Market Study 2025 version august
PDF
Unlock new opportunities with location data.pdf
Web Crawler for Trend Tracking Gen Z Insights.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
O2C Customer Invoices to Receipt V15A.pptx
Zenith AI: Advanced Artificial Intelligence
Chapter 5: Probability Theory and Statistics
Getting Started with Data Integration: FME Form 101
Modernising the Digital Integration Hub
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Enhancing emotion recognition model for a student engagement use case through...
A comparative study of natural language inference in Swahili using monolingua...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
WOOl fibre morphology and structure.pdf for textiles
Architecture types and enterprise applications.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Group 1 Presentation -Planning and Decision Making .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
STKI Israel Market Study 2025 version august
Unlock new opportunities with location data.pdf

The Power of Data

  • 1. The Power of Data May 2019
  • 2. About me  BI, Data Warehousing and Big Data Evangelist since 1983.  Before joining Ultimate, I was Chief Big Data Architect at Visa and before that I was VP of Data Architecture at Fidelity Investments  My first job was with Bob Earle “The father of OLAP”.  I worked in the Finance group at Coors Brewing Company where we created some of the first data warehouses.  I have given many presentations at IOUW, RMOUG, TDWI, Collaborate, Gartner Group, Oracle Open World and the BI Summit. Also, I have given Metadata and Data Governance presentations for HIMSS.  I have a degree in statistics, MBA in Finance and Masters of Computer Science.  I have authored Oracle Essbase & Oracle OLAP: A Guide to Oracle’s Multidimensional Solutions Published by Oracle Press and Oracle Data Warehousing published by SAMS. 2
  • 3. Agenda • Challenge • Analytics – What is it? • The Power of Data • Data Governance • Solutions • The Data Lake – Cloudera – HDP 3.1 • LLAP and Vectorization • DataPlane – Ranger and Atlas – DSS and DLM 3 Good judgment comes from experience, and a lot of that comes from bad judgment." - Will Rogers
  • 4. You keep using that word. I do not think it means what you think it means. What do you mean by “analytics”? Challenge – Analytics and Data Governance
  • 5. There are two parts to “analytics” The mathy stuff The query & reporting stuff
  • 6. With Analytics I can Predict Behavior
  • 8. Tesla and LinkedIn Think Resumes Are Overrated. They Use Neuroscience-Based Games Instead www.inc.com/kevin-j-ryan/pymetrics-replacing-resumes-with-brain-games.html That's the philosophy touted by Frida Polli, co-founder and CEO of hiring startup Pymetrics. The company makes games meant to determine whether a candidate would be a good fit in a specific role at your company. Polli says that so far, the platform has been more effective at finding the right hires than traditional resumes. The results have been promising. Polli says that some companies have more than doubled the percentage of candidates they hire out of those they invite for in-person interviews. One-year retention rates have increased by between 30 and 60 percent. And companies are reporting that job performance has improved among newly hired candidates.
  • 9. How to tell if someone will repay a loan?
  • 10. What’s the smartest way to predict loan Payback?
  • 11. What is a Data Lake? 11 A single place to store every type of data in its native format with no fixed limits on account size or file size, high throughput to increase analytic performance and native integration with the Hadoop ecosystem. An architectural shift in the BI World that uses Hadoop to deliver deep insight across a large, broad, diverse set of data at efficient scale.
  • 12. The primary view of BI, self service is publishing data
  • 15. Find Any Business Data in Sub-second Each CPU scans local in-memory columns Scans use super fast SIMD vector instructions Billions of rows/sec scan rate per CPU core
  • 16. May 25, 2018 GDPR – What is it? 4% Or €20MPotential Penalty Per Infraction Global Impact 5 Key General Data Protection Regulation Obligations Rights of EU Data Subjects Security of Personal Data Consent Accountability of Compliance Data Protection by Design and by Default www.eugdpr.org
  • 17. Access Defining what users and applications can do with data Technical concepts: Data Policies Authorization Data Protection Protecting data in the cluster from unauthorized visibility Technical concepts: Encryption, tokenization, data masking Visibility Reporting on where data came from and how it’s being used Technical concepts: Auditing Lineage Knox Identity Guarding access to the cluster itself Technical concepts: Authentication Network isolation Pillars of our comprehensive Data Governance Solution Discovery Finding Data Assets and Definitions Technical concepts: Business Glossary, Technical Glossary and Search.
  • 18. Access Defining what users and applications can do with data Technical concepts: Data Policies Authorization Data Protection Protecting data in the cluster from unauthorized visibility Technical concepts: Encryption, tokenization, data masking Visibility Reporting on where data came from and how it’s being used Technical concepts: Auditing Lineage Knox Ranger DataPlane & Atlas Hardware, File and Column Encryption Identity Guarding access to the cluster itself Technical concepts: Authentication Network isolation Pillars of our comprehensive Data Governance Solution Discovery Finding Data Assets and Definitions Technical concepts: Business Glossary, Technical Glossary and Search. DataPlane & AtlasKnox/Active LDAP Kerwberos
  • 19. ACCESS - Establish and Implement Data Policies ▪ Accomplish: Manage and automate the information lifecycle from ingestion to purge, cradle to grave, based on the unified metadata catalog - Role Based Authorization - Allow an Analyst to see PII data but not Developer - Allow for Masking of Data - Allow for automate enforcement of Data Retention Policies such as 7 days in Kakfa
  • 20. Dynamic Row Filtering & Column Masking: Apache Ranger with Apache Hive User 2: Ivanna Location : EU Group: HRUser 1: Joe Location : US Group: Analyst Original Query: SELECT country, nationalid, ccnumber, mrn, name FROM ww_customers Country National ID CC No DOB MRN Name Policy ID US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424 US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984 Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909 Country National ID CC No MRN Name US xxxxx3233 4539 xxxx xxxx xxxx null John Doe US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe Ranger Policy Enforcement Query Rewritten based on Dynamic Ranger Policies: Filter rows by region & apply relevant column masking Users from US Analyst group see data for US persons with CC and National ID (SSN) as masked values and MRN is nullified Country National ID Name MRN Germany T22000129 Ernie Schwarz 876452830A EU HR Policy Admins can see unmasked but are restricted by row filtering policies to see data for EU persons only Original Query: SELECT country, nationalid, name, mrn FROM ww_customers Analysts HR Marketing
  • 21. Visiability - Apache Ranger Audits - Data Access ⬢ Comprehensive scalable audit logging ⬢ Audits for: ⬢ Resource Access Events with user context ⬢ Policy Edits/Creation/Deletion ⬢ User session information ⬢ Component plugin policy sync operations
  • 22. Tag (Classification) Based Masking Masking Policy For any Hive columns tagged as containing PII: • Allow HR to see data in the clear for any type of PII • Apply ‘Nullify’ mask to columns classified as type ‘MRN’ for Analysts • Apply ‘Hash’ as masking option to columns classified as type ‘Password’
  • 23. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values
  • 24. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values
  • 25. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for distribution of column values Data Steward Studio (DSS) DataPlane DSS - Understanding
  • 26. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DSS – Where is my PII?
  • 27. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DSS – What Tables are Accessed?
  • 28. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DSS – PII Trends
  • 29. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DSS – Data Lineage
  • 30. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DLM as Backup and DR