0% found this document useful (0 votes)
300 views10 pages

Cloud Data Lakes For Dummies Snowflake Special Edition V1 3

Uploaded by

CarlosVillamil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
300 views10 pages

Cloud Data Lakes For Dummies Snowflake Special Edition V1 3

Uploaded by

CarlosVillamil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

WHY MOVE TO A CLOUD

DATA LAKE?
Motivating factors for moving to a cloud-built data lake include the
business need to:

• Minimize capital expenses for hardware and software.


• Get new analytic solutions to market quickly.
• Eliminate data silos by consolidating multiple data types into a sin-
gle, unified, and infinitely scalable platform.
• Capture batch and streaming data in a common repository with
robust governance, security, and control.
• Simultaneously execute multiple workloads — data loading, ana-
lytics, reporting, and data science.
• Establish a robust, fully managed, and extensible environment.

Democratizing Your Analytics


Executives and professionally trained analysts thrive on data
exploration and data-driven decision-making. But to realize the
full potential of your data, your organization needs to make ana-
lytic activities available to the other 90 percent of business users.
Here are four benefits of cloud-built analytics:

»» Data exploration: This entails discovering trends and


patterns, but it’s difficult to know in advance precisely what
amount of computing resources you’ll need to analyze huge
data sets. The cloud offers on-demand, elastic scalability ideal
for this type of exploratory analysis.
»» Interactive data analysis: This pursuit provides the answer
to a single, random business question, which may lead to
other questions. The dynamic elasticity of the cloud gives
you the flexibility and adaptability to perform these addi-
tional queries without slowing down other workloads.
»» Batch processing: This refers to scheduling and sending a
specific set of queries to the data lake or data warehouse for
execution. Batch jobs can be huge, and can be a drain on

CHAPTER 2 Explaining Why the Modern Data Lake Emerged 15

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
performance. With a traditional data lake that doesn’t scale,
you often must schedule these large batch jobs at off-hour
times, when more compute resources are available.
»» Event-driven analytics: These demand constant data. They
incorporate new data to refresh reports and dashboards on a
continual basis, so managers can monitor business pro-
cesses. Ingesting and processing streaming data requires an
elastic data lake to handle variations and spikes in data flow.

A growing trend is to build analytics into cloud business applica-


tions, which serve many types of users and the queries (work-
loads) users run to analyze that data.

REPLACING HADOOP
Accordant Media makes advertising investments more successful for
marketers by unlocking the value of audience data. Its business
model involves analyzing billions of data points about audience pro-
files, ad impressions, clicks, and bids to help clients determine what
works for targeted audience segments.

Speed and accuracy are crucial to optimizing ad visibility and


response rates, but load performance was difficult to achieve with
Accordant Media’s Hadoop environment, especially when ingesting
data sets approaching 100 terabytes. Formerly, the analytics team
used MapReduce (a program model within the Hadoop framework
used to access big data) to evaluate, translate, and load queries into a
SQL table and had to manually verify data quality.

Now, Accordant Media no longer needs the resource-intensive and


error-prone SQL-to-MapReduce translation activities and subsequent
data-duplication processes. Its more nimble cloud-based data ware-
house uses standard SQL database comparison methodologies
instead, boosting performance and improving accuracy.

By shifting analytic workloads from Hadoop into a modern cloud-built


environment, the company has increased its analytic performance a
hundredfold, can manage five times the client environments with the
same staff, has eliminated error-prone data-translation processes,
and has minimized data quality risks.

16 Cloud Data Lakes For Dummies, Snowflake Special Edition

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Planning your data lake implementation

»» Complying with privacy regulations

»» Instituting robust data governance

»» Establishing comprehensive data security

»» Improving data retention, protection,


and availability

Chapter  3
Reducing Risk,
Protecting Data

Y
our organization’s data is incredibly valuable, and this book
is all about maximizing that value with the latest technolo-
gies for storing data, analyzing it, and gaining useful insights.
But if your data is valuable to you, it’s also valuable to those who
entrust you with their data and to other, malevolent actors. This
chapter explores how one of your most valuable assets can also be
one of your greatest risks, and what to do about it. It discusses the
need to plan carefully and deliberately as you set up your data lake
to deliver the best in data quality, security, governance, and legal
and regulatory compliance. As recent news indicates, sensitive
information can get into the wrong hands or be improperly
expunged. That can lead to regulatory penalties, lawsuits, and even
jail time. You may also lose your valuable customers.

If you’re planning to build or deploy a data lake, it is important to


identify several key issues:

»» What data should your data lake store?


»» How much effort will be required to secure and govern it?
»» How will you meet the European Union’s General Data Protection
Regulation (GDPR), the California Consumer Privacy Act (CCPA),
and other data sovereignty and data protection regulations?

CHAPTER 3 Reducing Risk, Protecting Data 17

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» What data retention policies are important, and how do you
manage the data in the lake to meet the requirements of
those policies?
»» How will you enforce data governance/access controls and
ensure business users can access only the data they’re
authorized to see?
»» How will you mitigate data quality or integrity problems that
can compromise vital business processes?

Implementing Compliance
and Governance
Privacy regulations are increasingly rigorous, and organiza-
tions can’t ignore them. Leading the way are Europe’s GDPR, the
United States’ Health Insurance Portability and Accountability Act
(HIPAA), and the CCPA, which PwC has called “the beginning of
America’s GDPR.”

Data governance ensures data is properly classified, accessed,


protected, and used. It also involves establishing strategies and
policies to ensure the data lake processing environment complies
with necessary regulatory requirements. Such policies also verify
data quality and standardization to ensure the data is properly
prepared to meet the needs of your organization. For example,
data governance policies define access and control of personal
identifiable information (PII). The types of information that fall
under these specific guidelines include credit card information,
Social Security numbers, names, date of birth, and other such data.

Implementing effective governance early in your data lake process


will help you avoid potential pitfalls, such as poor access control
and metadata management, unacceptable data quality, and insuf-
ficient data security.

Data governance isn’t a technology. Rather, it’s an organizational


commitment that involves people, processes, and tools. There are
five basic steps to formulating a strong data governance practice:

1. Establish a core team of stakeholders and data stewards to


create a data governance framework. This begins with an

18 Cloud Data Lakes For Dummies, Snowflake Special Edition

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
audit to identify issues with current data management
policies and areas needing improvement.
2. Define the problems you’re hoping to solve, such as better
regulatory compliance, increased data security, and improved
data quality. Then determine what you need to change, such
as fine-tuning access rights, protecting sensitive data, or
consolidating data silos.
3. Assess what tools and skills you will need to execute the data
governance program. This may include people with skills in
data modeling, data cataloging, data quality, and reporting.
4. Inventory your data to see what you have, how it’s classified,
where it resides, who can access it, and how it is used.
5. Identify capabilities and gaps. Then figure out how to fill
those gaps by hiring in-house specialists or by using partner
tools and services.

A data lake achieves effective governance by following proven data


management principles, including adding context to metadata
(data about the data) to make it easier to track where data is com-
ing from, who touched that data, and how various data sets relate
to one another; ensuring quality data is delivered across business
processes; and providing a means to catalog enterprise data.

Ensuring Data Quality


Like regulatory compliance, data security hinges on traceability.
You must know where your data comes from, where it is, who has
access to it, how it’s used, and how to delete it when required.
Data governance also involves oversight to ensure the quality of
the data your organization shares with its constituents. Bad data
can lead to missed or poor business decisions, loss of revenue, and
increased costs. Data stewards  — people charged with oversee-
ing data quality — can identify when data is corrupt or inaccu-
rate, when it’s not being refreshed often enough to be relevant, or
when it’s being analyzed out of context.

Ideally, you’ll assign these tasks to the business users who own
and manage the data, because they’re the people in the best posi-
tion to note inaccuracies and inconsistencies. These data stewards
can work with IT professionals and data scientists to establish
data quality rules and processes.

CHAPTER 3 Reducing Risk, Protecting Data 19

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Incorporating Protection, Availability,
and Data Retention
Cloud infrastructure can fail, accidental data deletions can occur,
other human error can happen, and bad actors can attack —
resulting in data loss, data inconsistencies, and data corrup-
tion. That’s why a cloud data lake must incorporate ­redundant
­processes and procedures to keep your data available and pro-
tected. Regulatory compliance and certification requirements
may also dictate that data is retained for a certain minimal
length of time, which can be years.

All cloud data lakes should protect data and ensure business
continuity by performing periodic backups. If a particular stor-
age device fails, the analytic operations and applications that
need that data can automatically switch to a redundant copy of
that data on another device. Data retention requirements call for
maintaining copies of all your data.

That’s not all. A complete data-protection strategy should go


beyond merely duplicating data within the same physical region
or zone of a cloud compute and storage provider. It’s important to
replicate that data among multiple geographically dispersed loca-
tions to offer the best possible data protection.

This is because the much-vaunted “triple redundancy” offered by


some cloud vendors won’t do you any good if all three copies of
your data are in the same cloud region when an unforeseen dis-
aster strikes.

Finally, pay attention to performance. Data backup and repli-


cation procedures are important, but if you don’t have the right
technology, these tasks can consume valuable compute resources
and interfere with production analytic workloads. To ensure the
durability, resiliency, and availability of your data, a modern
cloud data lake should manage replication programmatically in
the background, without interfering with whatever workloads are
executing at the time. Good data backup, protection, and replica-
tion procedures minimize, if not prevent, performance degrada-
tion and data availability interruptions.

20 Cloud Data Lakes For Dummies, Snowflake Special Edition

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Protecting Your Data with
End-to-End Security
All aspects of a data lake — its architecture, implementation, and
operation — must center on protecting your data in transit and at
rest. This should be part of a multilayered security strategy that
considers both current and new security threats.

Your protection strategy should address external interfaces,


access control, data storage, and physical infrastructure, in con-
junction with comprehensive monitoring, alerts, and cyber secu-
rity practices. Read on to find out more about the primary aspects
of data lake security.

Encrypting everywhere
Encrypting data, which means applying an encryption algo-
rithm to translate the clear text into cipher text, is a fundamental
aspect of security. Data should be encrypted when it is stored on
disk, when it is moved into a staging location for loading into the
data lake, when it is placed within a database object in the data
lake itself, and when it’s cached within a virtual data lake. Query
results must also be encrypted. End-to-end encryption should be
the default, with security methods that keep the customer in con-
trol, such as customer-managed keys. This type of “always on”
security is not a given with most data lakes, as many highly pub-
licized on-premises and cloud security breaches have revealed.

Managing the key


Once you encrypt your data, you’ll decrypt it with an encryption
key (a random string of bits generated specifically to scramble
and unscramble data). In order to fully protect the data, you also
have to protect the key that decodes your data.

The best data lakes employ AES 256-bit encryption with a


­hierarchical key model rooted in a dedicated hardware security
module. This method encrypts the encryption keys and insti-
gates key-rotation processes that limit the time during which
any single key can be used. Data encryption and key management
should be entirely transparent to the user but not interfere with
performance.

CHAPTER 3 Reducing Risk, Protecting Data 21

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Automating updates and logging
Security updates should be applied automatically to all pertinent
software components of your modern cloud data lake solution
as soon as those updates are available. If you use a cloud provider,
that vendor should perform periodic security testing (also known
as penetration testing) to proactively check for security flaws.

As an added protection, file integrity monitoring (FIM) tools can


ensure that critical system files aren’t tampered with. All secu-
rity events should be automatically logged in a tamper-resistant
security information and event management (SIEM) system. The
vendor must administer these measures consistently and auto-
matically, and they must not affect query performance.

Controlling access
For authentication, make sure your connections to the cloud pro-
vider leverage standard security technologies such as Transport
Layer Security (TLS) 1.2 and IP whitelisting. (A whitelist is a list of
email addresses or domain names from which an email blocking
program will allow messages to be received.) A cloud data lake should
also support the SAML 2.0 standard so you can leverage your exist-
ing password security requirements as well as existing user roles.
Regardless, multifactor identification (MFA) should be required to
prevent users from being able to log in with stolen credentials. With
MFA, users are challenged with a secondary verification request,
such as a one-time security code sent to a mobile phone.

Once a user has authenticated, it’s important to enforce author-


ization to specific data based on each user’s “need to know.” A
modern data lake must support multilevel, role-based access control
(RBAC) functionality so each user requesting access to the data lake
is authorized to access only data that he or she is explicitly per-
mitted to see. Discretionary and role-based access control should
be applied to all database objects including tables, schemas, and
any virtual extensions to the data lake. As an added restriction,
secure views can be used to further restrict access — for example,
to prevent access to highly sensitive information that most users
won’t need.

Certifying compliance and attestations


Data breaches can cost millions of dollars to remedy and perma-
nently damage relationships with customers. Industry-standard
attestation reports verify that cloud vendors use appropriate

22 Cloud Data Lakes For Dummies, Snowflake Special Edition

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
security controls and features. For example, your cloud vendors
need to demonstrate they adequately monitor and respond to
threats and security incidents, and they have sufficient incident
response procedures in place.

In addition to industry-standard technology certifications such


as ISO/IEC 27001 and SOC 1/SOC 2 Type II, verify your cloud pro-
vider also complies with all applicable government and indus-
try regulations. Depending on your business, this could include
PCI, HIPAA/Health Information Trust Alliance (HITRUST), and
FedRAMP certifications. Ask your providers to supply attestation
reports to verify they adequately monitor and respond to threats
and security incidents and have sufficient incident response pro-
cedures in place. Make sure they provide a copy of the entire
report for each pertinent standard, not just the cover letters.

Isolating your data


If your data lake runs in a multi-tenant cloud environment, you
may want it isolated from all other data lakes. If this added pro-
tection is important to you, make sure your cloud vendor offers
this premium service.

Isolation should extend to the virtual machine layer. The vender


should isolate each customer’s data storage environment from
every other customer’s storage environment, with independent
directories encrypted using customer-specific keys.

Work only with cloud providers that can demonstrate they


uphold industry-sanctioned, end-to-end security practices (see
Figure 3-1). Security mechanisms should be built into the foun-
dation of the cloud-built data lake-as-a-service. You shouldn’t
have to do anything extra to secure your data.

FIGURE 3-1: A complete security strategy considers both data and users.

CHAPTER 3 Reducing Risk, Protecting Data 23

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Facing Facts about Data Security
Effective security is complex and costly to implement, and cyber-
security professionals are hard to come by. Cloud-built data lakes
shift the responsibility for data center security to the SaaS cloud
vendor. A properly architected and secured cloud data lake can be
more secure than your on-premises data center.

But beware. Security capabilities vary widely among vendors. The


most basic cloud data lakes provide only rudimentary security
capabilities, leaving things such as encryption, access control,
and security monitoring to the customer.

COMPLYING WITH HIPAA


GUIDELINES
Amino helps people find the best possible health care by sharing
detailed insights about providers, estimates, and costs. The company
deals with sensitive information and must adhere to HIPAA guidelines
governing the storage, processing, and exchange of patient data.

Amino maintains more than 1 petabyte of data, including information


about 15 million people, 900,000 providers, and 5 billion patient-doctor
interactions. Previously, Amino used Apache Hadoop for batch process-
ing, in conjunction with MySQL and Postgres for interactive analytics. To
boost security and improve its analytic capabilities, the IT team consoli-
dated a its Hadoop/Hive cluster into a cloud-built data lake.

A batch job that took seven days to execute with Amino’s Hadoop/
Hive cluster takes less than an hour in the cloud, and the company’s
new data lake secures all data, at rest or in motion, to comply with
government mandates.

24 Cloud Data Lakes For Dummies, Snowflake Special Edition

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.

CHAPTER 2  Explaining Why the Modern Data Lake Emerged      15
These materials are © 2020 John Wiley & Sons, Inc. Any dissemi
16      Cloud Data Lakes For Dummies, Snowflake Special Edition
These materials are © 2020 John Wiley & Sons, Inc. Any dissem
CHAPTER 3  Reducing Risk, Protecting Data      17
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distr
18      Cloud Data Lakes For Dummies, Snowflake Special Edition
These materials are © 2020 John Wiley & Sons, Inc. Any dissem
CHAPTER 3  Reducing Risk, Protecting Data      19
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distr
20      Cloud Data Lakes For Dummies, Snowflake Special Edition
These materials are © 2020 John Wiley & Sons, Inc. Any dissem
CHAPTER 3  Reducing Risk, Protecting Data      21
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distr
22      Cloud Data Lakes For Dummies, Snowflake Special Edition
These materials are © 2020 John Wiley & Sons, Inc. Any dissem
CHAPTER 3  Reducing Risk, Protecting Data      23
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distr
24      Cloud Data Lakes For Dummies, Snowflake Special Edition
These materials are © 2020 John Wiley & Sons, Inc. Any dissem

You might also like