0% found this document useful (0 votes)
63 views172 pages

Performance and Tuning - 6

The document provides an overview of performance tuning for queries in Snowflake, emphasizing the importance of optimizing SQL to avoid unnecessary costs. It discusses tools like query history, query profiles, and account usage views for analyzing query performance, as well as best practices for writing efficient SQL queries. Additionally, it covers caching mechanisms in Snowflake, including Metadata Cache, Results Cache, and Warehouse Cache, which enhance query performance by storing relevant data and results.

Uploaded by

Prakash Js
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views172 pages

Performance and Tuning - 6

The document provides an overview of performance tuning for queries in Snowflake, emphasizing the importance of optimizing SQL to avoid unnecessary costs. It discusses tools like query history, query profiles, and account usage views for analyzing query performance, as well as best practices for writing efficient SQL queries. Additionally, it covers caching mechanisms in Snowflake, including Metadata Cache, Results Cache, and Warehouse Cache, which enhance query performance by storing relevant data and results.

Uploaded by

Prakash Js
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Performance and Tuning - Overview

Query performance Analys tool


Snowflake being a managed service with most low-level
tasks such as data partitioning, compression, and
encryption being handled automatically by the system.
But there are still areas we need to watch closely to ensure
optimal query performance.
Writing some poorly optimized SQL that keeps a warehouse
busy for an additional hour, preventing it from going into
the suspended state could contribute significantly to your
monthly bill.
It's therefore really important that we have some methods
to identify and analyze when something isn't running
optimally.
There are three places to go to get a better idea of the
performance of queries.
The first is query history found in both the classic console
and the snow site UI.
The second are the query history views and table functions
for programmatic analysis of query performance.
And lastly, the query profile, which provides a breakdown
of the steps involved in executing a query to pinpoint any
issues.
There's a couple of facts that could possibly come up
in the exam regarding the query history page.
Firstly, it only stores query history for the last 14
days, and although users can view other users'
query text,they cannot view their query results.
The query history page will be the first place to look to
identify a query and get some high level information about
it.
It's available to all users regardless of what role they
currently have active during the session.
It provides an interactive tabular view of queries
executed in the last 14 days.
You can see other users' queries if you have the operate or
monitor privilege on the warehouse used to execute that
query. However, you can't see another user's results.
The table entries can be filtered by a number of different
criteria, we can see all failed queries or we want to
identify long running queries.
In the duration column of the table, there's a high level
breakdown of the query stepsand how long each took.
Using this, we could compare the compilation time versus
the actual execution time.
The compilation time is very quick, and there's the query
engine performing operations like cost-based optimization
and micro partition pruning, execution time is made up of
the steps to actually process the data, the processing time
on the CPU, reading and writing to local and remote disc,.
We could add additional columns like bytes scanned and
rows to give us , size of the data we're computing with that
query.
If we click the query we executed at the beginning, we get
the query details page, with the query results available at
the bottom, we can export results locally, but bear in
mind there is a limit of a hundred megabytes, and
the results can only be exported as CSV or TSV.
In the next tab over, we have the query profile.This
is the most important tool for analyzing query
performance.
It breaks down a single query into the steps the query
engine took to execute it.
It's really helpful for understanding the mechanics of a
query, helping to diagnose any issues and areas for
improvement.
It's very similar to what you would get by running an
explain command, explain returns and execution plan of a
query .
It details the operations that Snowflake would perform to
execute a query, in a very similar way.
The query profile details the operations that Snowflake
actually did perform graphically represented.
Each selectable box here is an operator node. There's a line
representing a relationship between each operator node
with how many records passed between them shown over
the arrow together they form an operator tree.
If we have the operator nodes deselected, we have three
boxes on the right.
The first is most expensive box, gives us an idea of which
of the operator nodes was the most expensive in terms of
duration to complete.This can be quite helpful if there are
many nodes.
Next box is profile overview. We can now see a more
granular breakdown of the execution time in the duration
column from the query history page, processing is the
percentage of the total execution time spent on data
processing by the CPU of the virtual warehouse. Local disc
IO is the percentage of time processing was blocked by
local disc access. In other words, when the virtual
warehouse had to read or write data from the local SSD
storage.
Remote disc IO is when the virtual warehouse had to read
or write data from the remote storage layer, going down,
statistics Box - we get some statistics of the query as a
whole. Things like input output, operations performed, and
information around pruning of micro partitions
Each operator has some high level information on the card
itself, the operator type,
such as table scan - This represents an operation to read
from a single table. It also has the operator ID in square
brackets and the percentage of query time each node took.
If we click on our nodes, we get some additional
information with each operator type having its own
relevant statistics and attributes.
So if you did want more than 14 days of query history and
also get a bit more information than the history tab
provides, we can make use of the account usage views,
and the information scheme of views and table
functions.
The query history view in the account usage schema in the
Snowflake database gives us the ability to investigate
programmatically the query history within the last year.
By default, only the account admin has permissions to read
this view. The query history view has a latency of about 45
minutes. This means if a query is executed, there's no
guarantee it will appear until at least 45 minutes later.
let's do a select star on this view.
If we scroll across, we can see the sheet number of
columns at our disposal. but let's see how we could use this
view to investigate query performance.
Example Here we can get the 10 longest running queries
expressed in seconds. The same information can be found
in the information schema of each database.
However, the information schema table function
query history only has data going back seven days.
this table function has practically zero latency,

SQL Tuning
Instructor: One of the first things to consider when writing
SQL queries. especially if you're joining many tables, using
subqueries, or creating a set of steps in a pipeline.
The order of execution tells us in what order the query
optimizer will apply an operation.
It will first execute row operations, such as from, join,
and where, to prune away micro-partition files.
If there's grouping functionality with group by and
having, it will apply those next.
Next statement like select, distinct, order by, and
limit, the result a user sees.
This is why if you include a limit on a query that also has a
group by, it won't limit the underlying records used to
calculate the group by results. It just limits the result that is
returned to the user.
The reason row operations are executed first is to decrease
the total amount of data used to do more expensive
operations, such as group by. Less data equals faster
operations.
One concrete recommendation will be to consider
prioritizing including a where clause with focus on where
exactly you're using it.
If I have a subquery, am I filtering in the subquery or in the
outer query? Ensure that you use a where clause as early
as possible.

Best practices around joining

Let's say we have a table of orders, Each row has a unique


order. We also have a separate table which records the
price of each product. Using these tables together,we can
calculate the total cost.

SELECT *, (O.ORDER_AMOUNT * P.PRODUCT_PRICE) FROM ORDERS


O LEFT JOIN PRODUCTS P ON O.PRODUCT = P.PRODUCT;

In theory, the following query should do that for us.


A user might use a left join to ensure there is only one row
for each order. However, the output is not the expected
four rows, but six. A additional two rows have been
generated. This is because the column used in the
predicate of the join is not unique.
In the products table, there are two values for product
name, one for the most up-to-date cost and the other
outdated.
If we tried to sum up how much money we made from all
the orders, it would produce an incorrect result because of
this.
Now at this scale, the performance is fine. Yes, the
result is likely to be wrong, but will complete in good
time.
Let's look at the query profiles operator tree for this query.
we can see the number of rows is output to the next
operator.Right above our join operator, it shows an
additional two rows have been added.
You could imagine if we were to join tables with millions of
rows, this problem would be compounded, considerably
increasing query time.
So what can we do about this?
We can join on unique columns so that the row from the left
table joins to exactly one row in the right table.
In the real world, tables may not be perfectly normalized or
could come in weird structures.It's a good idea to
understand how fields in multiple tables relate to each
other and the impact of joining on them.
Limit and order by operation
If we have an order by operation, which is ordering a lot of
records, one way to speed this up is by including a
limit clause.
I have here a screen capture from the query profile, for a
select query on a large table in Snowflake's sample data
set.
Ordering an account balance on the customer's table, this
table is about 10 gig in size. This query took about five
minutes for an extra small warehouse to execute.
SELECT * FROM CUSTOMER ORDER BY C_ACCTBAL;
If we take a look at the operator tree we can see there is an
operator called sort, which takes 46.4% of the total
duration of this query. This is the order by clause doing its
work, and ideally, we'd like to get this number down.
When calculating a sort,the intermediate results need to be
kept somewhere. If a virtual warehouse is in-memory
storage runs out, it'll start spilling to local SSD
storage.
But include a limit.
SELECT * FROM CUSTOMER ORDER BY C_ACCTBAL LIMIT 10;

This will produce a new operator in the query profile called


sort with limit. It finishes in a fraction of the time the
original query took,and you can see there was no spilling to
local storage.

There could be a legitimate scenario


spilling
when you would like to order thousands or millions of rows,
but your query profile keeps indicating to you that you're
spilling tons of data to the slower local storage.
But simply the in-memory store is not large enough to hold
the intermediate results of a query. which leads to
performance degradation.
"bytes spill to local storage." Fill in local storage
Your virtual warehouse's local storage fills up and
Snowflake will begin to spill to remote storage. "bytes spill
to remote storage."
If you're on AWS, behind the scenes, they will have
provisioned a bucket where you can spill intermediate
results to. This, again, is even slower than local storage.
There are two solutions to this.
The first is simply to process less data. Consider throwing
in a where clause to reduce the number of micro-partitions
the query is scanning.
The second is to increase the size of your virtual
warehouse.
As we know, the larger the size of the virtual warehouse,
the more memory and more CPU the underlying servers
themselves have, so you'll be less likely to spill to local or
remote disks.
order by Position.
The position of an order by could be changed to improve
performance.
For example, if you have two order by, one in a subquery
and another in an outer query, the order by in the subquery
would be redundant and a waste of compute resources.
SELECT C_NATIONKEY, R.R_NAME, TOTAL_BAL FROM
(
SELECT
C_NATIONKEY,
COUNT(C_ACCTBAL) AS TOTAL_BAL
FROM CUSTOMER
GROUP BY C_NATIONKEY
ORDER BY C_NATIONKEY
) C JOIN REGION R ON (C.C_NATIONKEY = R.R_REGIONKEY)
ORDER BY TOTAL_BAL;

This is because you're reordering the result of the subquery


again in the outer query. Order by is an expensive
operation. recommendation would be to have just one
order by in the top-level select.
group by
The group by clause allows us to calculate aggregates
based on rows that share the same value.
For example, here we're grouping by country and
accounting all the customer's account balances.
When using a group by, it's important to consider the
cardinality of the column you're grouping by as this
drastically impact query performance.
Cardinality is a measure of how unique a column is. If we
use something like country, the values aren't very unique
as, of course, there's a limited number of countries in the
world.

SELECT C_NATIONKEY, COUNT(C_ACCTBAL)


FROM CUSTOMER
GROUP BY C_NATIONKEY; -- Low Cardinality

This is considered low to medium cardinality. If we look at


the query profile operator tree and profile overview,
we can see that the aggregate only took 3.5% of total
duration the query ran.
However, if we were to group by something with very high
cardinality, that is it has many distinct values, such as a
unique ID, timestamp, or full name, the group by operation
will become very expensive.
It would have to perform an aggregate, such as account,
on many small groups.

Caching
There are three storage mechanisms behind the scenes
that have caching-like behavior
In the services layer, we have the Metadata Cache and the
Results Cache.
The Results Cache is the most textbook-like cache in
Snowflake.
It stores the results of queries for reuse by subsequent
queries.
The metadata service, or Metadata Cache, refers to the
store of data and statistics on database objects, and things
like what the structure of a table is or how many rows are
in a table.
Snowflake has a high availability metadata store
which maintains metadata object information and
statistics.
We transparently make use of thiswhen executing certain
queries.
And finally, we have the Local Disk Cache, or the
warehouse cache. This refers to the local SSD storage of
the nodes in a virtual warehouse cluster. These keep a local
version of the raw micro partitions retrieved from the
storage layer used for computing the results of queries.
The Local Disk Cache can be reused by subsequent queries
to the same warehouse.
Metadata Cache
The Metadata Cache is a highly available service residing in
the cloud services layer, which maintains metadata on
object information and statistics.
For each cache type, the exam might use a variety of
names used interchangeably.
The Metadata Cache is sometimes also referred to
as the metadata store, cloud services layer, or just
services layer.
So what exactly does the Metadata Cache store?
Well, it stores important metadata for various entities in
Snowflake.
It will maintain an understanding of what tables we've
created, what databases those are in, what are the micro
partitions that make up those tables, clustering information
and much more.
This enables queries to be executed without the need of a
virtual warehouse.
For example, the row count of each table is stored in the
Metadata Cache.
Therefore, executing something like select count(*)
from table doesn't make use of a virtual warehouse.
System functions, context functions, describe
commands, and show commands also make use of
the Metadata Cache.
Results Cache
The second layer of caching is referred to as the Results
Cache, which also sits in the cloud services layer. It's also
referred to as the result set cache,the 24 hour cache, or
query result cache.
Look for the keywords result, 24 hour, or query in
the exam. This cache is the results of queries for a
non-configurable duration of 24 hours.
Each time a persisted result is reused, the 24 hour
retention period is extended up to a maximum of 31 days,
after which the result is purged from the cache.
The main benefit is avoiding having to regenerate costly
results when nothing has changed.
There are many rules which govern
when the Results Cache will be used.
1. The new query must syntactically match a previously
executed query. Even adding a limit will cause the
result cache not to be used.
2. If the table data contributing to the query result has
changed, the Results Cache won't be used.
3. The same role is used as the previous query.
4. If time context functions are used such as
CURRENT_TIME, the Results Cache will not be used.
By default, result reuse is enabled, but it can be overridden
at the account, user, and session level using the
USE_CACHED_RESULT parameter.
Warehouse Cache
while a virtual warehouse is running, it keeps micro
partition files on its local SSD as queries are processed.
This is referred to in the documentation and exam as
SSD cache, data cache, and raw data cache.
This is because the data isn't stored in an aggregated form.
It's the raw micro partition files themselves.
This enables improved performance for subsequent queries
as they can read from the cache, instead of having to read
from the slower remote blob storage.
And as you increase the size of a virtual warehouse, you
also increase the size of the local cache.
Even for the really large virtual warehouses, the cache is
finite in size,
Virtual Warehouses have local SSD storage which
maintains raw table data used for processing a
query.
It is purged when the virtual warehouse is resized,
suspended or dropped.
The larger the virtual warehouse the greater the
local cache.
Can be used partially, retrieving the rest of the data
required for a query from remote storage.
When a query cannot be satisfied by the Metadata Cache,
Results Cache, or Local Disk Cache, it'll retrieve the
required data from the remote disk in the storage layer.
This is the most costly operation.
Materialized Views
A materialized view is a pre-computed and persisted
dataset derived from a select query.
When you use a query as a standard view, it's equivalent to
running the SQL defined in the view yourself. You'll have to
read the data from the base table and process the results
using compute.
Materialized views, on the other hand, actually store the
result of the query defined in the view so the result isn't
generated when you query the view,it's already available.
A background maintenance process will periodically
recompute the view query, so the data is current and
consistent with the base table.
The automatic maintenance of a materialized view happens
transparently to the user.
In other words, no user managed virtual warehouse is
required to maintain a materialized view.
However, while the maintenance process is in
progress, the query results might be slower. This is a
trade-off between freshness of results and speed.
If the materialized view doesn't have the latest data,
'cause a periodic refresh hasn't occurred yet,
it'll go ahead and retrieve the most up-to-date data from
the base table or query runtime.
For this reason, snowflake doesn't recommend
creating materialized views on base tables with high
churn.That is lots of inserts, updates, and deletes.
Maintaining a large table with a lot of changes could
consume a lot of credits to keep it up to date.
Materialized views need to have readily available, a
query, which is complex and therefore is quite
computationally costly to run in a standard view.
materialized views are an enterprise edition and higher
feature.
Materialized views incur costs for both storage and
compute resources.
The background maintenance process we just discussed
uses serverless compute resources.
Serverless features like materialized views operate on
snowflake managed compute resources and cloud services,
both of which are charged by the number of compute hours
used.
it costs 10 snowflake credits for the snowflake managed
compute per hour and five credits for cloud services used
per hour. And because we store the result of the query
defined in the materialized view, this adds to the monthly
storage usage for your account.
We can use the materialized view refresh history table
function in the information schema or a view of the same
name in the account usage viewsto track credit
consumption.
A common use case is to put a materialized view on top of
an external table.
an external table is a type of table which references data in
an external location such as an S3 bucket.
Querying an external table is slower than querying a
permanent table,
so creating a materialized view on top of one will increase
the query performance.
some of the limitationsof materialized views.
A materialized view can only query a single table, so not
multiple tables and not another materialized view or
standard view or UDTF, just one table.
A materialized view cannot include joins in the body of the
view, and they're also limitedin what expressions they can
use.
They can't include UDFs, having, order by, limit, and
window functions.
but with the addition of the materialized keyword.
Suspending a materialized view pauses the updates, and
also makes it inaccessible and resuming does the opposite
of this.
It restarts the background process to keep the materialized
view up to date.
And finally here, the show command will give you a lot of
information about the materialized views in the current
namespace.
The output of this command includes a column called
refreshed on which references the time the materialized
view was last updated.
It also includes a column called behind by, which gives
an indication of how back sync a materialized view is from
the base table was created.

“A Materialized View is a pre-computed & persisted data set


derived from a SELECT query.”
MVs are updated via a background process ensuring data is
current and consistent with the base table.
MVs improve query performance by making complex
queries that are commonly executed readily available.
MVs are an enterprise edition and above serverless feature.
MVs use compute resources to perform automatic
background maintenance.
MVs use storage to store query results, adding to the
monthly storage usage for an account.
MATERIALIZED_VIEW_REFRESH_HISTORY
MVs can be created on top of External Tables to improve
their query performance.

MVs are limited in the following ways:


Single Table , JOIN, UDF, HAVING, ORDER BY, LIMIT,
WINDOW FUNCTIONS

CREATE OR REPLACE MATERIALIZED VIEW MV1


AS SELECT COL1, COL2 FROM T1;

ALTER MATERIALIZED VIEW MV1 SUSPEND;


ALTER MATERIALIZED VIEW MV1 RESUME;
SHOW MATERIALIZED VIEWS LIKE 'MV1%';
Clustering

Clustering is a way of describing the distribution of a


column's values.
These 32 random letters have been distributed into four
groups in alphabetical order. The letters are arranged
alphabetically, so that between each group, there is no
overlap. One letter doesn't appear in more than one group.
if I wanted to find the letter D, I could quite easily exclude
three of the groups and focus the search on one of the
groups.
This is an example of when data is optimally clustered or
distributed among the groups.

And here is an example of poor clustering. The values are


not ordered.
So to find all the instances of the letter D using the
summarized group ranges, each group would have to be
checked.

table data is partitioned and stored in the order it was


loaded. Because of this, data ordered prior to ingestion will
be naturally clustered as it'll be stored in micro-partitions
that are next to each other.
Let's take a look at an example.
We have a simple CSV file made up of three columns:
ORDER_ID, PRODUCT_ID, and ORDER_DATE.
Let's say we were to load this data file using the COPY INTO
statement, producing three micro-partitions. Once the file
is loaded into a table, we measure clustering on a per
column basis.
A sequential numerical type such as order ID here which
arrives ordered will be more evenly distributed amongst
micro-partitions. You can see here micro-partition one has
ORDER_ID values one to three, micro-partition two has
ORDER_IDs four to six and so on.
In other words, it's unlikely that an individual ORDER_ID will
appear in multiple micro-partitions. There isn't a lot of
overlap. This is also the case for ORDER_DATE, which will
also be more closely grouped. And therefore, a single date
will not span all the micro-partitions of a table, but just a
subset.
Snowflake maintained clustering metadata for the micro-
partitions in all tables.
This includes the total number of micro-partitions
that comprise a table.
And for a table column, the number of micro-
partitions containing values that overlap with each
other. Like we saw with the PRODUCT_ID, there were
values which overlapped all micro-partitions.
They also keep clustering metadata around the depth of
the overlapping micro-partitions. Depth is correlated with
their average overlap, but measures how many micro-
partitions
Understanding these metrics can help us assess
the quality of clustering of a table column.
So then how can we, as users,
get our hands on this metadata?
Snowflake have made available to system functions.

Snowflake maintains the following clustering metadata for


micro-partitions in a table:

 Total Number of Micropartitions


 Number of Overlapping Micro-partitions
 Depth of Overlapping Micro-partitions

SELECT system$clustering_information(‘table’,’(col1,col3)’);

Automatic Clustering

Over time, natural clustering can degrade. if that table is


large and experiences a lot of DML statements rearranging
the micro partitions so that clustering becomes worse.
We could manually sort data and reinsert it into a table,
creating a natural ordering and improving clustering again.
But this would be costly and time consuming.
Snowflake supports automating this task by
designating one or more table columns or
expressions as clustering keys for a table.
Clustering keys can also be defined on materialized
views.
Once a clustering key is applied to a table, it's considered
clustered. A background process will then periodically
reorder data aiming to co-locate data into the same micro
partitions.
Clustering is not suitable for all tables, however. It
improves performance of queries that frequently filter or
sort on the cluster keys.
Columns used in joins, where conditions, grouped
by, order by, can benefit from clustering with more
benefit generally coming from where conditions and
join operations.
If you have a table that is queried infrequently and
not using many of these types of operations, you
might not see a lot of benefit from automatic
clustering.
Tables in the terabyte range or those which are
suffering from increasingly poor performance, are
generally good candidates for specifying clustering
keys.
The true indication of when a clustering key is required is
whether queries are taking unreasonably long to execute,
especially when filtering and sorting on frequently used
columns.
Larger tables benefit from clustering simply because even
if small tables are poorly clustered, their performance is
still good.
Clustering is most effective for tables that are frequently
queried and change infrequently.
The more frequently a table is queried, the more benefit
you'll get from clustering.
However, the more frequently a table changes, the
higher the cost will be to maintain the clustering.
Choosing the correct clustering key is importantccbecause
it can have a significant effectcon performance and cost.
Snowflake recommends a maximum of three or four
columns or expressions per key.
Defining more than one column or expression has an
impact on how the data is clustered in the micro partitions.
Adding more than three or four tends to increase costs
more than benefits.
We should select columns using common queries which
perform filtering and sorting operations. This is where you
would see most benefit from a clustering key as filtering
and sorting well clustered micro partitions is a lot more
efficient.
We should also consider the cardinality of clustering
keys.
A high cardinality column would be something like a unique
ID or a timestamp down to the millisecond.
Low cardinality would be something like country or gender.
Choosing an extreme or either cardinality is
discouraged.
If it's too low, it won't allow for effective pruning because
the values will exist in many different micro partitions
Finding an F would require me to go through all three micro
partitions.
A cardinality that is too high will also negatively impact
pruning
because the values are too unique to be grouped into micro
partitions.
If multiple columns are chosen to be the clustering
key, it's suggested that the lowest cardinality is
selected first, followed by the higher cardinality.
Here are some code examples
of how you would apply automatic clustering to a table.

CREATE TABLE T1 (C1 date, c2 string, c3 number) CLUSTER BY (C1, C2);


CREATE TABLE T1 (C1 date, c2 string, c3 number) CLUSTER BY
(MONTH(C1), SUBSTRING(C2,0,10));
ALTER TABLE T1 CLUSTER BY (C1, C3);
ALTER TABLE T2 CLUSTER BY (SUBSTRING(C2,5,10), MONTH(C1));

You can specify a key at the time of table creation.


This key is made up of two columns.
You can use expressions in the key definitions.
In this example, we'd be ordering our data in the micro
partitions on the month of a date column and a substring of
another column.
You can add clustering to a table, which already exists with
an alter statement.
As DML operations, insert, update, delete, merge,
and copy are performed on a clustered table, the
data in the table might become less clustered.

In this simplified example, you insert three values into a


clustered table, which exists in all the pre-existing micro
partitions.
Re-clustering is a background process of reorganizing the
data in the micro partitions to restore the clustering along
the columns or expressions specified in a key.
Automatic clustering is a serverless feature and comes at a
cost. Just like the initial clustering, when a clustering key
is specified, re-clustering consumes Snowflake
Managed and Cloud Services Compute.
At the time of recording, this currently runs at two credits
per hour for Snowflake Managed Compute and one credit
per hour for Cloud Services Compute.
Re-clustering also results in storage costs.
As we saw in the above example, each time data is
reclustered, the rows are physically grouped based on the
clustering key for the table.
This results in Snowflake generating new micro partitions
for the table. Adding even a small number of rows to a
table can cause all micro partitions that contain those
values to be recreated.
Clustering is more expensive on large tables with a large
amount of DML operations performed on it,
so it's recommended to go for large tables, which don't
change too much, and the more frequently a table is
queried, the more benefit clustering provides.
However, the more frequently a table changes,
the more expensive it'll be to keep it clustered.
Therefore, clustering is on the whole more cost
effective for tables that are queried frequently and
don't change frequently.
Snowflake supports specifying one or more table
columns/expressions as a clustering key for a table.
Clustering aims to co-locate data of the clustering key in
the same micro-partitions.
Clustering improves performance of queries that frequently
filter or sort on the clustered keys.
Clustering should be reserved for large tables in the multi-
terabyte range.
Snowflake recommended a maximum of 3 or 4 columns (or
expressions) per key.
Columns used in common queries which perform filtering
and sorting operations.
As DML operations are performed on a clustered table, the
data in the table might become less clustered.
Reclustering is a background process which transparently
reorganizes data in the micro-partitions by the clustering
key.
Initial clustering and subsequent reclustering operations
consume compute & storage credits.
Clustering is recommended for large tables which do not
frequently change and are frequently queried.
Search Optimization
Snowflake is best suited for analytic workloads, operations
such as aggregations on large amounts of data.
However, there are some scenarios when a user might
need to look up an individual value. On a very large table,
say terabytes in size, this operation can be costly and time
consuming.
The search optimization service is a table level
property aimed at improving the performance of
these types of queries called selective point lookup
queries.
These typically return a single row or a small group of rows.
The search optimization service speeds up equality
searches.
These take two forms.
The first uses the equal sign to find a specific row for a
value, like in this example,
SELECT NAME, ADDRESS FROM USERS
WHERE USER_EMAIL = ‘semper.google.edu’;

The second uses the IN operator to test whether a column


value is a member of an explicit list of values.
SELECT NAME, ADDRESS FROM USERS
WHERE USER_ID IN (4,5);

It supports searching the following data types:


fixed point numbers, so INTEGER and NUMERIC,
DATE, TIME, and TIMESTAMP, VARCHAR and BINARY.
The search optimization service is an enterprise
edition and higher feature.
Let's take a look at how this works.
A background process creates and maintains something
called a search access path. The search access path
records metadata about the entire table to understand
where all of the data resides in the underlying micro
partitions petition.
Selective point lookup queries can use this metadata to
find the relevant micro partitions faster than the usual
pruning mechanism.
It's quite simple to implement as it's a table level
property,
we can run the following ALTER TABLE command to add
search optimization to the table,
ALTER TABLE MY_TABLE ADD SEARCH OPTIMIZATION;

and here is the opposite command to remove search


optimization from a table.
ALTER TABLE MY_TABLE DROP SEARCH OPTIMIZATION;

And lastly here, we can run a SHOW TABLES command to


check the status of search optimization and its progress.
ALTER TABLE MY_TABLE DROP SEARCH OPTIMIZATION;

This shows the percentage of the table that has been


optimized so far.
To run these commands, you must have the ownership
privilege on the table. As well as this, you must have the
add search optimization privilege on the schema that
contains the table.
this is a serverless feature, it has its own unique credit
cost.
The access path data structure requires space for each
table on which search optimization is enabled.
This depends on the number of distinct values in the table,
however, it's normally around a quarter of the original table
size.
So if it's a big table with a lot of columns with distinct
values, you could end up paying quite a bit.
Creating and maintaining search optimization also
consumes compute resources.
This increases as there's more DML on the table.
It costs 10 credits per Snowflake-managed compute hour
and five credits per cloud services compute hour.
Search optimization service is a table level property aimed
at improving the performance of selective point lookup
queries.
A background process creates and maintains a search
access path to enable search optimization.
The access path data structure requires space for each
table on which search optimization is enabled. The larger
the table, the larger the access path storage costs.
10 Snowflake credits per Snowflake-managed compute
hour 5 Snowflake credits per Cloud Services compute hour.
Data Loading and unloading
Its important topics of data loading and unloading. It comes
in at about 10 to 15% of questions on the exam. Here
we cover stages, data loading methods, such as insert,
copy into table, and the serverless features, Snowflake.
Data Loading Simple Methods

Data Movement

This section, learn on various methods of data ingress and


egress in Snowflake,
we have bulk data loading via the copy to table statement,
continuous data loading with the serverless feature,
Snowpipe, and data unloading with the copy to location
command.
Now, the simplest way to get data into a table in Snowflake
is surely the insert statement. The insert statement allows
us to append records to a table.
The first method is inserting into a tablevfrom a select
statement, not selecting from a table,but just creating
values in the command itself.
INSERT INTO MY_TABLE SELECT ‘001’, ‘John Doughnut’, ‘10/10/1976’;

To load specific columns, individual columns can be


specified
INSERT INTO MY_TABLE (ID, NAME) SELECT ‘001’, ‘John Doughnut’;

Moving on, we can make use of the values keyword.


INSERT INTO MY_TABLE (ID, NAME, DOB) VALUES
(‘001’, ‘John Doughnut’, ‘10/10/1976’),
(‘002’, ‘Lisa Snowflake’, ‘21/01/1934’),
(‘003’, ‘Oggle Berry’, ‘01/01/2001’);

We can specify the values of each row, enclosed in


brackets. The same can be done to insert multiple
rows,each separated by a comma.

Another table can be used to insert rows into a table.


INSERT INTO MY_TABLE SELECT * FROM MY_TABLE_2;

And finally, for insert, let's take a look at overwrite. The


overwrite keyword can be added to truncate the table and
insert either from values or a select statement, effectively
clearing a table down and repopulating it.
The keyword OVERWRITE will truncate a table before new
values are inserted into it.
INSERT OVERWRITE INTO MY_TABLE SELECT * FROM MY_TABLE_2;
so let's take a look at uploading files via the UI. We'll go
into the databases tab, switch our session role to sysadmin,
and click on our film database and then our film table.
From here, select the default warehouse.This will do the
processing task required to load the data into the storage
layer.
Select the CSV file included in the course files from your
local file system.
For now, what we need to know is that a file format
object contains information about what datais expected to
be loaded, for example, whether it's CSV or JSON.
This helps Snowflake parse the data files.
In this wizard,
Snowflake have included some handy defaults various file
formats we can load.
We'll need Delimited Files CSV.
Once we have this selected, under the heading edit
schema,
we now have the option to change the mappingof the
columns in the CSV file to the table,
and it should confirm we have now loaded five rows.Bear in
mind, there is currently a 250 megabyte limit on each file
you upload via Snowsite.

Stages
Stages are temporary storage locations for data files used
in the data loading and unloading process.

Stages form a crucial step in the data movement process.


They're an area to temporarily hold raw data files either
before being loaded into a table or after being unloaded
from a table.
Stages come in two broad groups, internal and
external.

Internal stages are stages that Snowflake provision and


manage with data files being stored internally within
Snowflake.
Much like table storage, they also use the underlying blob
storage of the cloud platform your account is deployed into.
External stages are cloud storage areas that are managed
by users outside of Snowflake.
So think of an S3 bucket for example.You can create your
own S3 bucket in your own AWS account, give Snowflake
permissions to read from it and use that in the process of
loading data files into Snowflake tables.
Internal stages are further subdivided into user
stages, table stages and named stages.
Every user and table comes with a stage automatically.
A user stage is only accessible to the user, so they can
stage data files to be loaded.
Likewise for the table stage, it's an area multiple users
can stage data files for loading,but can only be loaded into
that table.

The named internal stage is not automatically allocated


by Snowflake.
This is something we as users can create with a create
stage command. This is the most flexibleas potentially any
user can stage files to be uploaded to any table,given the
correct privileges.
External stage objects only come in one form,
named stages, basically meaning they can only be
created by users.
By default, each user is automatically allocated an internal
stage, which is only accessible to that user.
user stage.
You don't need to execute a create statement to create a
user stage.
Automatically allocated when a user is created.
To get a file into an internal stage from your local machine,
Snowflake provide the put command,
PUT ls @~;
User stages cannot be altered or dropped
user stage isn't appropriate if multiple users need access
to a stage.
Table stage.
Automatically allocated when a table is created.
each table in Snowflake is automatically allocated an
internal stage.
To get a file into a table stage, Snowflake provide the put
command.
PUT ls @%MY_TABLE;
Table stages cannot be altered or dropped,
To access a table stage, a user must have ownership
privileges on the table.
Internal named stages.
Internal named stages are database objects which you can
create and name yourself.
User created database object.
These are generally more flexible in the options you can set
and fit a greater number of use cases as they're not
restricted to one user or table.
we can use the put command to get a file into a named
internal stage.
PUT ls @MY_STAGE;
named stages are securable objects, meaning
privileges can be granted to roles to manage access to an
internal named stage.
Uncompressed files are automatically compressed
using GZIP when loaded into an internal stage,
unless explicitly set not to.
Supports copy transformations and applying file
formats.
Stage files are automatically encrypted using 128 bit keys.
External stages.
These reference data files stored in a location outside of
Snowflake in the cloud storage service, which we manage
ourselves. These could be Amazon S3 buckets, Google
Cloud Storage buckets or Microsoft Azure containers.
User created database object
External named stages are user-created with a create stage
DDL.
Files are uploaded using the cloud utilities of the
cloud provider.
For example, to upload a file to an S3 bucket, you might
use the UI or the AWS CLI.
ls @MY_STAGE;
Like named internal stages, things called copy options can
be set on stages, dictate behavior around loading data.
For example, controlling the behavior when errors are
encountered during loading or whether we should delete
files which have been successfully loaded from a bucket.
Copy options such as ON_ERROR and PURGE can be
set on stages.
Let's take a look at a create statement for an external
stage.
The DDL is comprised of an identifier for the stage. The
cloud storage service URL, a URL for an S3 bucket.
CREATE STAGE MY_EXT_STAGE
URL='S3://MY_BUCKET/PATH/'
STORAGE_INTEGRATION= CREDENTIALS=(AWS_KEY_ID=‘’
AWS_SECRET_KEY=‘’)
ENCRYPTION=(MASTER_KEY=‘’)

And if the bucket is protected, we'll need to specify access


credentials for the bucket so that snowflake has
permissions to read its contents.
You can hard-code SQLs like this
or you can attach to the stage an object called a
storage integration.
CREATE STORAGE INTEGRATION MY_INT
TYPE=EXTERNAL_STAGE
STORAGE_PROVIDER=S3
STORAGE_AWS_ROLE_ARN=‘ARN:AWS:IAM::98765:ROLE/MY_ROLE’
ENABLED=TRUE
STORAGE_ALLOWED_LOCATIONS=(‘S3://MY_BUCKET/PATH/’);

A storage integration is a reusable and securable Snowflake


object which can be applied across stages and is
recommended to avoid having to explicitly set sensitive
information for each stage definition.
A storage integration object encapsulates all the required
information to authenticate and gain access to a private
external storage service to perform actions like reading,
writing and deleting, typically using a generated identity
entity such as a role in AWS.
It can be applied across stages and is recommended to
avoid having to explicitly set sensitive information for each
stage definition.
The code here shows us creating a storage integration with
authentication information configured.

So how do we interact with stages within Snowflake?


Let's run through three helper commands
list command.
This lists the contents of a stage and includes in the output
the path of a stage file, the size of a stage file, the hash of
the stage file and the timestamp the file was last updated.
We can also append a path to the end of the list command
to return files within a specific directory of a stage.
LIST/ls @MY_STAGE;
LIST/ls @~;
LIST/ls @%MY_TABLE;
List the contents of a stage:
• Path of staged file
• Size of staged file
• MD5 Hash of staged file
• Last updated timestamp

All stages except user stages can include a database and


schema global pointer if you'd like to reference them
outside of the schema or database they were created in.
Select command
So we can query the contents of stage files directly using
standard SQL
for both internal and external stages.
SELECT
metadata$filename,
metadata$file_row_number,
$1, $2
FROM @MY_STAGE
(FILE_FORMAT => ‘MY_FORMAT’);

Snowflake exposed metadata columns such as file name


and row number for stage files, which you can include in
the query output.
Query the contents of staged files directly using
standard SQL for both internal and external stages.
Useful for inspected files prior to data
loading/unloading.
Reference metadata columns such as filename and
row numbers for a staged file.
Remove command.
This command removes files from either an external or
internal stage.
Like with the list command, we can optionally specify a
path for specific folders or files.
And again, we can optionally include a database and
schema global pointer.
REMOVE/rm @MY_STAGE;
REMOVE/rm @~;
REMOVE/rm @%MY_TABLE;

Remove files from either an external or internal


stage.
Can optionally specify a path for specific folders or
files.
Named and internal table stages can optionally
include database and schema global pointer.

Put command.
A user can execute a put command to upload data
files from a local directory on a client machine to
any of the three types of internal stages.
PUT cannot be executed from within worksheets.
Duplicate files uploaded to a stage via PUT are
ignored.
Uploaded files are automatically encrypted with a
128-bit key with optional support for a 256-bit key.
The top most command you see here shows us loading
from a Unix-type system, uploading a CSV file to a named
internal stage,
macOS / Linux
PUT FILE:///FOLDER/MY_DATA.CSV @MY_INT_STAGE;
PUT FILE:///FOLDER/MY_DATA.CSV @~;
PUT FILE:///FOLDER/MY_DATA.CSV @%MY_TABLE;
Windows
PUT FILE://c:\\FOLDER\\MY_DATA.CSV @MY_INT_STAGE;

Bulk Loading with COPY INTO Command


Now if you have a file which is greater than the size
threshold of the UI, or if you'd just like your upload
process to be programmatic, the COPY INTO <table>
command
This is the most important method to understand for the
exam
The COPY INTO <table> command or statement copies the
contents of an internal or external stage or external
location directly into a table.

Means copying the stage data to the storage layer, storing


it in a column, the format, and querying it via the logical
structure of a table.
Snowflake natively supports loading a variety of data
formats. Delimited files such as CSV, as well as semi-
structured file formats like JSON, Avro, ORC, Parquet and
XML.
The following file formats can be uploaded to Snowflake:
• Delimited files (CSV, TSC, etc)
• JSON
• Avro
• ORC
• Parquet
• XML
The different types of files are parsed using file
format options and file format objects.
The COPY INTO <table> statement requires a user
created virtual warehouse to execute.
Load history is stored in the metadata of the target table
for 64 days.
This metadata is used to deduplicate files as they're
loaded. If a file has the same name and contents,it will not
be loaded again.
Let's walk through some code examples
COPY INTO MY_TABLE FROM @MY_INT_STAGE;
Copy all the contents of a stage into a table.
COPY INTO MY_TABLE FROM @MY_INT_STAGE/folder1;
COPY INTO MY_TABLE FROM @MY_INT_STAGE/folder1/file1.csv;

Copy contents of a stage from a specific folder/file path.


COPY INTO MY_TABLE FROM @MY_INT_STAGE
FILE=(‘folder1/file1.csv’, ‘folder2/file2.csv’);

COPY INTO <table> has an option to provide a list of one


or more files to copy.

COPY INTO MY_TABLE FROM @MY_INT_STAGE


PATTERN=(‘people/.*[.]csv’);

COPY INTO <table> has an option to provide a regular


expression to extract files to load.
It will attempt to copy all the contents of the stage,
MY_INT_STAGE into table, MY_TABLE.
The path of a folder or a specific file can be provided. In the
case of a folder it will attempt to load everything in that
folder.
In this example, we're using the file option to provide a list
of one or more file names separated by commas to upload
from the stage.
There's also an option to provide a regular expression
pattern, which will extract the files to load from the stage.

COPY INTO <table> Load


Transformations

So Snowflake also allows us to perform some simple


transformations on data as it's loaded into a table.
Using this, we can make the data in the raw input files
conform to the structure of the target table it's expecting.

Load transformations allows the user to perform:


• Column reordering.
• Column omission. (so deciding not to bring every column
available in the underlying file)
• Casting. (Casting values to other data types, one to double and
the other to timestamp.)

• Truncate test string that exceed target length . ( truncate


text strings
that exceed a target column length using the enforce length or
truncate column options.)
COPY INTO MY_TABLE
FROM ( SELECT TO_DOUBLE(T.$1),
T.$2,
T.$3,
TO_TIMESTAMP(T.$4)
FROM @MY_INT_STAGE T);

let's take a closer look at this query. Instead of referencing


a stage after the from keyword, we have a subquery. The
subquery SELECT from a stage .
We specify a set of fields separated by commas to load
from the stage data files.
The dollar sign and column numbers in text you see here is
relevant for files of type CSV.
In the semi-structured we'll take a look at how we access
files like JSON. The only requirement here is that the
number of columns in the select query must match the
number of columns in the target table.

COPY External Stage/Location


Files are loaded from external stages much in the same
way as internal stages. Here all the external security and
authentication settings required to execute this COPY INTO
<table> command are encapsulated in the integration
object, which has been applied to the stage.
COPY INTO MY_TABLE FROM @MY_EXTERNAL_STAGE;
Some data transfer billing charges may apply when loading
data from files in the cloud storage service in a different
region or cloud platform from your Snowflake account.
There's also a method to load directly from S3 using the
COPY INTO statement,bypassing the need for a stage.
COPY INTO MY_TABLE
FROM S3://MY_BUCKET/
STORAGE_INTEGRATION=MY_INTEGRATION
ENCRYPTION=(MASTER_KEY=‘’);

Snowflake recommend encapsulating cloud storage service


in an external stage.

Copy options

These are properties which can be set to alter the behavior


of the COPY INTO command.

Here is the list of copy options available.


1. The first to look at is ON_ERROR.
This specifies the type of error handling for the load
operation.
The default for bulk loading such as we've shown so far is
'ABORT_STATEMENT'. - This will abort the load operation
if any error is found in a data file.
We also have CONTINUE. - Setting the option to this value
will keep loading a file if any errors are found.
The SKIP_FILE value comes in three flavors.
Just SKIP_FILE on its own will skip a particular file if
multiple files are being uploaded if an error is detected.
SKIP_FILE_<num> allows you to set a numerical upper
limit on how many failures a file can tolerate before it's
skipped.
SKIP_FILE_<num> % - It allows you to set a percentage
upper limit on how many failures a file can tolerate. For
example, if 10% of rows in a file have errors, skip that file
and do not load it.
2. SIZE_LIMIT - Number that specifies the maximum
size of data loaded by a COPY command (default null
(no size limit))
3. The next option to mention is PURGE. - Boolean that
specifies whether to remove the data files from the
stage automatically after the data is loaded
successfully. (default FALSE)
4. RETURN_FAILED_ONLY - Boolean that specifies
whether to return only files that have failed to load in
the statement result. (default FALSE)
5. MATCH_BY_COLUMN_NAME - String that specifies
whether to load semi-structured data into columns in
the target table that match corresponding columns
represented in the data.
6. ENFORCE_LENGTH - Boolean that specifies whether to
truncate text strings that exceed the target column
length. (default True)
7. TRUNCATECOLUMNS - Boolean that specifies whether
to truncate text strings that exceed the target column
length. (default FALSE)
8. FORCE - Boolean that specifies to load all files,
regardless of whether they’ve been loaded previously
and have not changed since they were loaded.
(default FALSE).
9. LOAD_UNCERTAIN_FILES - Boolean that specifies to
load files for which the load status is unknown. The
COPY command skips these files by default.

COPY INTO <table>


Output

When you execute a COPY INTO <table> command,it


returns the output we see here.
Data
Column Name Type Description
Name of source file and relative
FILE TEXT path to the file.
Status: loaded, load failed or
STATUS TEXT partially loaded.
Number of rows parsed from the
ROWS_PARSED NUMBER source file.
Number of rows loaded from the
ROWS_LOADED NUMBER source file.
If the number of errors reaches this
ERROR_LIMIT NUMBER limit, then abort
Number of error rows in the source
ERRORS_SEEN NUMBER file
FIRST_ERROR TEXT First error of the source file.
FIRST_ERROR_LINE NUMBER Line number of the first error.
FIRST_ERROR_CHARACT
ER NUMBER Position of the first error character.
FIRST_ERROR_COLUMN_
NAME TEXT Column name of the first error

It'll break down for each file the rows loaded and various
metadata concerning errors encountered during the load.
This can help invalidating the output of a load operation
and troubleshooting any issues.
COPY INTO <table>
Validation

let's look at validation.


when testing your COPY INTO statement, Snowflake provide
a couple of methods to validate an execution.
1. The first is validation mode.
Validation mode is an optional parameter for the COPY
INTO <table> statement that allows you to perform a dry
run of the load process, so not actually loading any files.
It'll test the files for errors and return a result based on one
of three values.
RETURN_N_ROWS specifies a number of rows to validate.
At the first error, the copy statement will fail.
RETURN_ERRORS returns all errors, parsing, conversion, et
cetera across all files specified in the copy statement.
RETURN_ALL_ERRORS. This returns all errors across all files
specified in the copy statement, including files with errors
that were partially loaded during an earlier load because
the ON_ERROR copy option was set to CONTINUE during
the load.
COPY INTO MY_TABLE
FROM @MY_INT_STAGE;
VALIDATION_MODE = ‘RETURN_ERRORS’;

Validation mode cannot be used with load


transformations.
2. The other method is the validate table function.
Validate is a table function to view all errors encountered
during a previous COPY INTO execution. Validate is like
validation mode in its functionality. However, it's intended
to be used against a past execution of the COPY INTO
<table> command.
Validate accepts a job id of a previous query or the last
load operation executed.

SELECT * FROM
TABLE(VALIDATE(MY_TABLE, J
OB_ID=>'5415FA1E-59C9-4DDA-B652-533DE02FDCF1'));
File Formats

How does Snowflake know how to pass all the potential file
types it natively supports during the loading process?
For example, how to identify column deliverers or new line
characters.
We need some mechanism to tell Snowflake what types of
files we've stored in a stage and what its properties are.
One way of achieving this is setting what is called
file_format options
directly on an internal stage or on the COPY INTO
statement.
File format options can be set on a named stage or
COPY INTO statement.
Explicitly declared file format options can all be
rolled up into independent File Format Snowflake
objects.
File Formats can be applied to both named stages
and COPY INTO statements. If set on both COPY
INTO will take precedence.
CREATE STAGE MY_STAGE
FILE_FORMAT=(TYPE='CSV' SKIP_HEADER=1);

CREATE FILE FORMAT MY_CSV_FF


TYPE='CSV’
SKIP_HEADER=1;
CREATE OR REPLACE STAGE MY_STAGE
FILE_FORMAT=MY_CSV_FF;

In the example code here,we can see we're describing the


type of file in this stage.In this case, it's storing CSV files
and they include a header, so let's instruct it to skip that
when we load.
An alternative and recommended method is to bundle all
the information required to pass and make sense of files
residing in the stage in a separate object called a
file_format.
File formats can be set on both named stages and COPY
INTO statements.

In the File Format object the file format you’re


expecting to load is set via the ‘type’ property with
one of the following values: CSV , JSON, AVRO, ORC,
PARQUET or XML.
Each type has its own set of properties related to passing
that specific file format.
On the right is a screen capture
of a described file format command.
The output of this command shows the default properties
for a file_format object of type CSV.
I won't go through them all, but let's look at a couple of key
properties that's good to know. As we've already seen
briefly, we have skip_header.
Each ‘type’ has it’s own set of properties related to
parsing that specific file format.
If a File Format object or options are not provided to
either the stage or COPY statement, the default
behaviour will be to try and interpret the contents of
a stage as a CSV with UTF-8 encoding.
Snowpipe and Loading Best Practises

Snowpipe is a serverless feature, continuous data


ingestion servers.
It's used to automatically load data files in near real time
into a table as soon as they're available in a stage.
For example, you upload a file to an S3 external stage via
AWS console.
Snowflake will be listening for a notification of this event
and will then spin up some compute to copy that new data
into a table.
The only action you perform as a user is to upload the
initial data file.
That's how it works at a high level.
Smaller batches of data being generated at a much quicker
pace as streaming systems like Kafka, Beam and Kinesis
are becoming more popular.

This functionality is enabled with an object called a


pipe.
Here is a DDL for the pipe object.
CREATE PIPE MY_PIPE
AUTO_INGEST=TRUE
AS
COPY INTO MY_TABLE
FROM @MY_STAGE
FILE_FORMAT = (TYPE = 'CSV');
A pipe defines a copy statement similar to what you would
manually run, specifying a stage in a target table. This is
what will execute on your behalf.
SnowPipe works in two modes, which are configured
by setting the auto ingest option to either true or
false.
These specify two different methods for detecting when a
new file has been uploaded to a stage.
If set to true, you'll be using SnowPipe, this method
involves configuring cloud blob storage like an S3 bucket to
send Snowflake a notification telling it that a new file has
been uploaded.
This will act as a trigger for Snowflake to go ahead and
execute the copy into statement defined in the pipe. This
method only works with external stages.
If on the other hand you set auto ingest a false,
you're telling Snowflake that you'll let them know yourself
when a new file is uploaded via a call to a Snowflake
resting point.
This method works both with internal and external
stages.

Automating SnowPipe
Snowpipe: Cloud Messaging
Automation SnowPipe using cloud messaging using AWS as
an example.
So firstly, we upload a file to an S3 bucket using the cloud
providers utilities. Via event notifications, the S3 bucket is
configured to send a per object notification to an SQS
Queue when a file is uploaded.
This SQS Queue is managed by Snowflake and was created
when you created the pipe object.

The put object notification will hold information on which


files to load.
It's via the SQS Queue that the pipe will know to spin up
some compute and perform the copy into table statement
configured in the pipe body.
Within a minute of uploading the file,
we should see data appear in the target table.

Snowpipe: REST Endpoint

Auto ingestion via event notifications is probably the most


commonly used method to trigger the loading of files.
However, we can also make use of a Snowflake rest
endpoint to notify a pipe that a file has been uploaded to a
stage.
This applies to both internal and external stages. Client
applications for which there are Java and Pythons SDKs
provided can call Snowflake via a public insert files rest
endpoint providing a list of file names that were uploaded
to the stage along with a reference to a pipe.

From this point, the previous flow is the same.


Let's go through some facts we'll need to know for the
exam.
SnowPipe is intended to load many small files quickly.
Snowflake estimate it will typically take about a minute for
a file to be loaded once they receive a notification, a new
file is present in the stage.
In contrast with bulk loading, which requires a user created
virtual warehouse to execute,
SnowPipe is a serverless feature. This means it will perform
the copy into operation behind the scenes using Snowflake
managed compute resources, not a user managed virtual
warehouse.
It'll scale the compute to meet the demands of the copy
operation.
SnowPipe load history is stored in the metadata of the pipe
for 14 days. This is used to prevent reloading the same files
and duplicating data in a table.
And finally, pipes can be paused. When a pipe is paused,
event messages received for the pipe enter a limited
retention period, which allows them to be processed when
the pipe is resumed. The period is 14 days by default.
If a pipe is paused for longer than 14 days it is considered
stable.
You might be asked in the exam to contrast bulk
loading and continuous data loading.
Let's summarize some of the key differences between the
two.

Bulk
Feature Loading Snowpipe
Relies on the
security
options
supported by
the client for When calling the REST endpoints:
authenticatin Requires key pair authentication with
g and JSON Web Token (JWT). JWTs are signed
initiating a using a public/private key pair with RSA
Authentication user session. encryption.
Stored in the
metadata of
the target
table for 64 Stored in the metadata of the pipe for
Load History days. 14 days
Requires a
user-
specified
warehouse to
execute
COPY Uses Snowflake-supplied compute
Compute Resources statements. resources.
Snowflake tracks the resource
consumption of loads for all pipes in an
account, with per-second/per-core
granularity, as Snowpipe actively
Billed for the queues and processes data files. In
amount of addition to resource consumption, an
time each overhead is included in the utilization
virtual costs charged for Snowpipe: 0.06
warehouse is credits per 1000 files notified or listed
Billing active. via event notifications or REST API calls.

Best practices data loading


Best practices for data loading to keep costs down and
ensure good performance.
These recommendations apply to both bulk loading and
continuous data loading with SnowPipe.
We should split large files into multiple files around
a hundred to 250 megabytes of compressed data.
This is because a single server in a warehouse can process
up to eight files in parallel.So providing one large file will
not utilize the whole warehouse's capacity.
Files exceeding a hundred gigabytes are not recommended
to be loaded and should be split.
The reverse is also true - if you have many files which are
very small,for example, messages of 10 kilobytes, it would
be a good idea to bundle these up into larger files to avoid
including too much processing overhead.
Organizing your stage data files by path lets you copy any
fraction of the data into Snowflake with a single command.
This allows you to execute concurrent copy statements that
match a subset of files taking advantage of parallel
operations.
Snowflake recommend having separate virtual warehouses
for data loading tasks and other tasks. This enables
workload isolation so that copy into table statements and
not being queued or causing other queries to queue.
If you cast your mind back to our section on micro
partitioning, the ordering of data as it's copied in is
important.
Organizing stage data has the benefit of creating a natural
order of data, which after copying into Snowflake,
translates into micro partitions that are more evenly
distributed and therefore have the potential for improved
pruning.
In other words, pre-sort your data will have a positive
impact on pruning potential.
As SnowPipe aims to load a file within one minute, it's
recommended not to upload files at a shorter frequency
than that.
If you were to load very small files more often than once
per minute, SnowPipe would not be able to keep up with
the processing overheard and the files would back up in the
queue in current costs.

Data Unloading Overview

Data unloading - Getting data out of Snowflake


Use COPY INTO location command, along with the GET
command.
You can unload the contents of a table to any internal
stage, external stage, and external location with the use of
the COPY INTO location command,
Table data can be unloaded to a stage via the COPY INTO
<location> command.
COPY INTO @MY_STAGE
FROM MY_TABLE;
Table - Stage
You can also copy the output of a query.
The file formats you can copy into a stage are a bit more
limited than those supported by the COPY INTO table
command.
COPY INTO locations supports the limited files, so file
formats like CSV and TSV, as well as the semi-structured
file formats, JSON and Parquet.
If you've copied a file to an internal stage, we can use the
GET command to retrieve it.
This specifies a source stage and a target directory on a
local file system.
Like the PUT command, it's a tool executed on a client
machine, not within Snowflake, so it won't work in the
worksheets tab with the UI.
The GET command is used to download a staged file to the
local file system.
GET @MY_STAGE
file:///folder/files/;

Stage - local

we can use the utilities provided by the cloud service.


By default, results are loaded to a stage, Using the COPY
INTO location command, are split into multiple files.
The exact number of files generated is impacted by the
size of the virtual warehouse executing the COPY INTO
command.
By default, the generated files are in CSV file format
compressed using Gzip and are always encoded using
UTFA.
All data files unloaded to a Snowflake internal location are
automatically encrypted using 128 bit keys.
The files are unencrypted when they're downloaded to a
local directory.
Data files unloaded to cloud storage can be encrypted if a
security key is provided to Snowflake.
Examples
Output files can be prefixed by specifying a string at the
end of a stage path.
COPY INTO @MY_STAGE/RESULT/DATA_
FROM (SELECT * FROM T1)
FILE_FORMAT = MY_CSV_FILE_FORMAT;

COPY INTO includes a PARTITION BY copy option to partition


unloaded data into a directory structure.
COPY INTO @%T1
FROM T1
PARTITION BY ('DATE=' || TO_VARCHAR(DT))
FILE_FORMAT=MY_CSV_FILE_FORMAT;

COPY INTO can copy table records directly to external cloud


provider’s blob storage.
COPY INTO 'S3://MYBUCKET/UNLOAD/'
FROM T1
STORAGE_INTEGRATION = MY_INT
FILE_FORMAT=MY_CSV_FILE_FORMAT;

The first snippet of code shows an example, The files


generated from executing this command can be prefixed
with a string by including at the end of the stage definition.
COPY INTO <location> Copy Options
we have a number of copy options to play around with.
Let's review four of the most important ones we should
know.
Copy Default
Option Definition Value
Boolean that specifies whether the COPY
command overwrites existing files with
matching names, if any, in the location ‘ABORT_STATEM
OVERWRITE where files are stored. ENT’
Boolean that specifies whether to
SINGLE generate a single file or multiple files FALSE
Number (> 0) that specifies the upper
size limit (in bytes) of each file to be
generated in parallel per thread. By
default, files will be about 16 megabytes
MAX_FILE_SIZE compressed. FALSE
Boolean that specifies whether to
uniquely identify unloaded files by
including a universally unique identifier
INCLUDE_QUERY (UUID) in the filenames of unloaded data
_ID files. FALSE

GET command.
The GET command is essentially the reverse of PUT, you
specify a source stage in a target local directory to
download the file to.
Typically, this command is executed after using the COPT
INTO location command.
GET cannot be used for external stages.
You would access those via their own utilities like the AWS
console or CLI.
GET can only be used with internal stages.
The GET command cannot be executed from within a
worksheet in the UI.
Downloaded files are automatically decrypted when using
GET.
There are two parameters to point out that you can use
with GET,
First the parallel parameter specifies the number of
threads to use for downloading the files.
Increasing the number of threads can improve performance
when downloading large files, the default is 10 threads.
The second is pattern, - this specifies a regular
expression pattern to select files to download from the
stage.
If no pattern option is specified, the GET command will try
and GET all the files in the stage.
GET @MY_STAGE
FILE:///TMP/DATA/ PARALLEL=99;

On the classic UI there is a limit of a hundred megabyte,


whereas with Snowsight, there's no hard limits.

Semi-structured Overview
semi-structured data
Data formats are often categorized as semi-structured.
If they share some typical features, JSON and XML, they
have a flexible schema, so the number of fields can vary
from one entity to the next.
Historically, semi-structured data has been difficult to
accommodate in data analysis systems, particularly data
warehouses.
Snowflake has extended the standard SQL language to
include purpose-built, semi-structured data types. These
allow us to store semi-structured data inside a relational
table.
Semi-structured data types can hold complex and variable
structures, like arrays and objects in a table column.
Snowflake have also extended SQL to include semi-
structured functions and notation for accessing data stored
in those types.
semi-structured data types
Beginning with array, then, array is its own data type. It's
a collection that can hold zero or more elements of data.
Each element in an array can be a different data type, and
is accessed by its position in the array.
Just like string, numerical, and date data types, the array
data type can be used to specify a type of a column in a
table.
Here, we have a very simple table composed of two
columns,

CREATE TABLE MY_ARRAY_TABLE


( NAME VARCHAR,
HOBBIES ARRAY
);
INSERT INTO MY_ARRAY_TABLE
SELECT 'Alina Nowak’, ARRAY_CONSTRUCT('Writing', 'Tennis',
'Baking');
The second semi-structured data type is object.
This is analogous to JSON object, a collection of key value
pairs.
CREATE TABLE MY_OBJECT_TABLE (
NAME VARCHAR,
ADDRESS OBJECT
);
INSERT INTO MY_OBJECT_TABLE
SELECT 'Alina Nowak', OBJECT_CONSTRUCT('postcode', 'TY5
7NN', 'first_line', '67 Southway Road');

From the output, we can see we've stored in one column an


even more complex data structure comprised of keys and
values.
The final data type can be a bit harder It's called a
variant.
The basic idea with the variant is that it can hold any other
data type, objects, arrays, numbers, anything.
Objects and arrays are really just restrictions of the variant
type.
For example, you could load an entire JSON or Parquet
document into a variant column,or split its elements up
into separate rows during loading.
Like any other data type, we can set a column in a table as
type VARIANT. We can then insert any value into that
column.
• VARIANT is universal semi-structured data type of
Snowflake for loading data in semi-structured data formats.
• VARIANT are used to represent arbitrary data structures.
• Snowflake stores the VARIANT type internally in an
efficient compressed columnar binary representation.
• Snowflake extracts as much of the data as possible to a
columnar form, based on certain rules.
• VARIANT data type can hold up to 16MB compressed data
per row.
• VARIANT column can contain SQL NULLs and VARIANT
NULL which are stored as a string containing the word
“null”.
This is why you might see it referred to as a universal semi-
structured data type in the documentation.

CREATE TABLE MY_VARIANT_TABLE (


NAME VARIANT,
ADDRESS VARIANT,
HOBBIES VARIANT
);

INSERT INTO MY_VARIANT_TABLE


SELECT 'Alina Nowak'::VARIANT,
OBJECT_CONSTRUCT('postcode', 'TY5 7NN', 'first_line', '67
Southway Road'),
ARRAY_CONSTRUCT('Writing', 'Tennis', 'Baking’);

However, you'll have to cast types which aren't object or


array.
As the output shows, we're storing various types of data in
the variant columns.
Semi-structured Data Formats
Snowflake have taken a few of the most popular semi-
structured file formats and figured out a way to load these
into their storage, intelligently extracting columns from
their complex structures and transforming them into
Snowflake's column and file format.
JSON
Plain-text data-interchange format based on a subset of the
JavaScript programming language (Load and Unload)
AVRO
Binary row-based storage format originally developed for
use with Apache Hadoop. (Load)
ORC
Highly efficient binary format used to store Hive data. .
(Load)
PARQUET
Binary format designed for projects in the Hadoop
ecosystem. (Load and Unload)
XML
Consists primarily of tags <> and elements. . (Load)
Loading and Unloading Semi-structured Data
The data loading process is broadly the same for
structured and semi-structured data. Data files are
uploaded to a stage and then a copy into command is
executed either by the user or via Snowpipe to move those
stage files into table storage.
When loading, any of the semi-structured data formats files
undergo a very similar process as structured data like a
CSV.
Repeating attributes like keys in JSON file are automatically
identified and extracted into a column of format.
However, there are certain rules which can prevent this
from happening.
For example, the value for repeating key has different
data types between elements as well as elements
having a string value of null.
Despite these limitations, the query performance is
comparable between semi-structured data types, like
variant and standard data types, as well as extracting
columns into snowflake's column, the file format, it's also
performing the familiar process of splitting input data into
micro partitions.
It also performs compression, encryption and gathering
statistics.
And how about file format objects?
Like structured data,the file format type must match the
files in the stage so that Snowflake knows what to pass.
As we've covered, each of the supported semi-structured
data formats are uniquely structured.
Some are binary and some are plain text.
Because of this, each file format object for each file format
type, say JSON or RRC, has its own set of options, just like a
CSV file format.
We can apply a file format either to the stage, the copy into
table command or the table itself, however it's typically
found on the stage or the copy into table command.

Let's take a quick look at a couple of example properties of


a JSON file format.

JSON File Format options

Option Details
Used only for loading JSON data into separate columns.
DATE_FORMAT Defines the format of date string values in the data files.
Used only for loading JSON data into separate columns.
TIME FORMAT Defines the format on time string values in the data files.
Supported algorithms: GZIP, BZ2, BROTLI, ZSTD, DEFLATE,
RAW_DEFLATE, NONE. If BROTLI, cannot use AUTO. you
configure which compression algorithm and already
compressed data file using so that it can be extracted
during loading.
COMPRESSION
ALLOW Only used for loading. If TRUE, allows duplicate object field
DUPLICATE names (only the last one will be preserved)
Only used for loading. If TRUE, JSON parser will remove
outer brackets. Strip outer array is quite an important
option, as it's a very convenient way to bypass the 16
megabyte limit on a variant column during data loading,
STRIP OUTER
ARRAY
STRIP NULL Only used for loading. If TRUE, JSON parser will remove
VALUES object fields or array elements containing NULL

There are three main approaches to loading semi-


structured data files.
The first is an ELT style approach,
which stands for extract, load, transform. So we're loading,
before transforming any of the data.
CREATE TABLE MY_TABLE ‘
( V VARIANT );

COPY INTO MY_TABLE


FROM @MY_STAGE/FILE1.JSON
FILE_FORMAT = FF_JSON;

Via the copying two table command,


you can load an entire semi-structured data file directly
into a single variant column. This really shows the power of
the variant data type.
It allows schemaless storage of hierarchical data in a table.
The idea here is that you load the file and later down the
line you can query it using semi-structured SQL extensions,
or alternatively, pick out columns from the variant column
and load those into a new table.
EOT is a good option when the structure or operations to be
performed on the data is not known upfront.
According to snowflake, typically tables used to store semi-
structured data consists of a single variant column.
The next option is an extract, transform load
approach.
It's quite similar to how we saw the extraction of columns
from a CSV file, but instead using semi-structured SQL
extensions.
The notation is slightly different between semi-structured
data format due to how they're organized internally.
In this example, we're passing some elements
CREATE TABLE MY_TABLE
(
NAME STRING,
AGE NUMBER,
DOB DATE
);

COPY INTO MY_TABLE FROM


(
SELECT V:name,
V:age,
V:dob
FROM @MY_STAGE/FILE1.JSON
)
FILE_FORMAT = FF_JSON;

from a JSON file values such as dates and timestamps are


stored as strings when loaded into a variant column.
By following the ETL approach and separating them into
columns, you can store these values as their native data
types.
This can improve pruning and decrease storage
consumption.
The third method is Automatic Schema Detection
automatically detect what columns are in a semi-structured
data file.
The first dimension is that in first schema table function, it
detects the column definitions in a set of stage data files,
which can then be passed to using template in a create
table DDO to define the columns.

CREATE TABLE MY_TABLE


USING TEMPLATE
( SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE( INFER_SCHEMA
( LOCATION=>'@MYSTAGE', FILE_FORMAT=>'FF_PARQUET' )
));

MATCH_BY_COLUMN_NAME

COPY INTO MY_TABLE


FROM @MY_STAGE/FILE1.JSON
FILE_FORMAT = (TYPE = ‘JSON’)
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

Match by column name is an option for the copy into


command.
Using match by column name, we can match columns in
the semi-structured data file to the corresponding columns
in a table during the loading process.
This is without needing to manually specify each element
like we did in the ETL approach.
However, it can be a bit tricky to use in practice as the
column names in your data file need to match exactly the
column naming convention you're using in Snowflake.
It can either match with case sensitivity or without.
Unloading your data works much in the same way for semi-
structured data as it does for structured.
You first execute a copy into location command to copy
data from a table or a query into a stage as a file.
This is followed by the get command. To effectively
download that data to your local machine.
You can apply a file format to either the copy into
statement, the stage,
or the definition of the table.
Unloading Semi-structured Data
Unloading your data works much in the same way for semi-
structured data as it does for structured.
You first execute a copy into location command to copy
data from a table or a query into a stage as a file.
This is followed by the get command.
To effectively download that data to your local machine.
You can apply a file format to either the copy into
statement, the stage, or the definition of the table.
However, as previously mentioned,
only two semi-structured file formats are able to be
unloaded to a stage JSON and Parquet.
Accessing Semi-structured Data

let's assume as users,we've selected the ELT approach to


loading semi-structured data and loaded this JSON
document into a single variant column.
Once this copy operation is complete and we have data
into this variant column, how do we query it and extract its
nested elements?
The variant type exposes the semi-structured data in its
native form, meaning when we query that variant column,
we'll see essentially the same structure as how it looked
prior to ingestion.
Behind the scenes, however, Snowflake is storing it in a
columnar form, which makes it more performant for us to
query.
To traverse semi-structured data, there are two
methods we can use.
The first to look at is dot notation.
With dot notation, a semicolon is placed after the variant
column name to extract the first level element.
Any subsequent elements in the path can be accessed with
a dot.
So using our ingested JSON, we can grab the name of the
employee with the following command,
select variant column, colon, first level element,so the
employee object, and then child object name with a dot.
SELECT src:employee.name FROM EMPLOYEES;

The column name is case insensitive, like all SQL


column names, but the element names are case
sensitive.
SELECT SRC:Employee.name FROM EMPLOYEES
so the above query would result in an error.

This would mean something like the following would be


incorrect given the first level element name is employee
with a lowercase e.
So the next option is bracket notation, how you might
access complex data types in a high-level programming
language.
The variant column name is followed by square bracket
enclosing
If you wanted to select an individual element from within
an array, which might exist inside a variant or an object
data type, the following syntax can be used.
element names.
SELECT SRC[‘employee’][‘name’] FROM EMPLOYEES;

For Repeating Element , Firstly, the variant column,


followed by an element and then square brackets with an
integer specifying the position in an array.
SELECT SRC:skills[0] FROM EMPLOYEES;

This will extract just a single element from an array.


Snowflake have also included a function called GET,
not to be confused with the GET command used to unload
data. This function allows us to perform the same task as
dotand bracket notation, as well as accessing repeating
elements just in the form of a function call.

SELECT GET(SRC, ‘employee’) FROM EMPLOYEE;


SELECT GET(SRC, ‘skills’)[0] FROM EMPLOYEE;

Casting Semi-structured Data

JSON files can represent the following types of values:


string literals, integers, arrays, objects, Booleans and null
values.
Snowflake maps the available JSON data types to SQL data
types.
For example, if we select from our variant column, the
number from the input JSON file is accurately displayed as
a number in the output.
This is true of Boolean and null values as well.
However, an issue arises
when we select JSON string literal values. The return result
we see in Snowflake isn't a string as we might expect, but a
string literal that is a sequence of characters enclosed in
double quotes.
SELECT src:employee.name, src:joined_on, src:employee.age,
src:is_manager, src:base_location FROM EMPLOYEE;

As JSON doesn't represent dates, this is also true for the


joined on column.

This is particularly important to consider if you're


comparing structured and semi-structured columns.
For example, doing something like a join between two
tables.
Say for example, you wanted to join on that JOINED_ON
date column. If we left this how it was, it would try to join
with those double quotes included.
To get around this issue, it's best practice to explicitly cast
the results from a variant column.
Here are three examples of doing this.
The most shorthand version is the double colon
notation.
This will cast a column value to whatever data type is on
the right-hand side of the double colon.
SELECT src:employee.joined_on::DATE FROM EMPLOYEE;

In this case, we're casting a string to a date.


The value can also be wrapped in a TO datatype
function, which effectively achieves the same thing.
TO_< datatype >()
SELECT TO_DATE(src:employee.joined_on) FROM EMPLOYEE;

The AS datatype function is the same in terms of


functionality to the TO datatype function.
AS_< datatype > ()
SELECT AS_VARCHAR(src:employee.name) FROM EMPLOYEE;
Semi-structured Functions

Lecturer: So, here's a complete list


of the functions Snowflake has made available.

Sub-category Function Notes


JSON and XML CHECK_JSON
Parsing
CHECK_XML Preview
feature.
Sub-category Function Notes
JSON_EXTRACT_PATH_TEXT
PARSE_JSON
PARSE_XML Preview
feature.
STRIP_NULL_VALUE
Array/Object ARRAY_AGG See
Creation and also Aggrega
Manipulation te functions.
ARRAY_APPEND
ARRAY_CAT
ARRAY_COMPACT
ARRAY_CONSTRUCT
ARRAY_CONSTRUCT_COMP
ACT
ARRAY_CONTAINS
ARRAY_DISTINCT
ARRAY_EXCEPT
ARRAY_FLATTEN
ARRAY_GENERATE_RANGE
ARRAY_INSERT
ARRAY_INTERSECTION
ARRAY_MAX
ARRAY_MIN
ARRAY_POSITION
Sub-category Function Notes
ARRAY_PREPEND

ARRAY_REMOVE

ARRAY_REMOVE_AT

ARRAY_REVERSE

ARRAY_SIZE

ARRAY_SLICE

ARRAY_SORT

ARRAY_TO_STRING

ARRAY_UNION_AGG See also Aggregate


functions.

ARRAY_UNIQUE_AGG See also Aggregate


functions.

ARRAYS_OVERLAP

ARRAYS_TO_OBJECT

ARRAYS_ZIP

OBJECT_AGG See also Aggregate


functions.

OBJECT_CONSTRUCT

OBJECT_CONSTRUCT_KEEP_NULL

OBJECT_DELETE

OBJECT_INSERT

OBJECT_PICK

Higher-order FILTER See also Use lambda


functions on data
with Snowflake
higher-order
functions.
Sub-category Function Notes
REDUCE See also Use lambda
functions on data
with Snowflake
higher-order
functions.

TRANSFORM See also Use lambda


functions on data
with Snowflake
higher-order
functions.

Map Creation MAP_CAT


and
Manipulation
MAP_CONTAINS_KEY

MAP_DELETE

MAP_INSERT

MAP_KEYS

MAP_PICK

MAP_SIZE

Extraction FLATTEN Table function.

GET

GET_IGNORE_CASE

GET_PATH , : Variation of GET.

OBJECT_KEYS Extracts keys from


key/value pairs
in OBJECT.

XMLGET Preview feature.

Conversion/ AS_<object_type>
Casting
Sub-category Function Notes
AS_ARRAY

AS_BINARY

AS_CHAR , AS_VARCHAR

AS_DATE

AS_DECIMAL , AS_NUMBER

AS_DOUBLE , AS_REAL

AS_INTEGER

AS_OBJECT

AS_TIME

AS_TIMESTAMP_*

STRTOK_TO_ARRAY

TO_ARRAY

TO_JSON

TO_OBJECT

TO_VARIANT

TO_XML

Type Predicates IS_<object_type>

IS_ARRAY

IS_BOOLEAN

IS_BINARY

IS_CHAR , IS_VARCHAR

IS_DATE , IS_DATE_VALUE

IS_DECIMAL
Sub-category Function Notes
IS_DOUBLE , IS_REAL

IS_INTEGER

IS_NULL_VALUE

IS_OBJECT

IS_TIME

IS_TIMESTAMP_*

TYPEOF

Specifically for handling semi-structured data.


I've highlighted what we'll cover. And what I think is most
useful to know for the exam.
PARSE_JSON
ARRAY_CONSTRUCT
OBJECT_CONSTRUCT
FLATTEN
GET
AS_CHAR , AS_VARCHAR
FLATTEN Table Function

takes a nested data structure, and explodes or flattens it.


Which is to say it produces a row for each item in a nested
data structure.
Flatten is a table function that accepts compound values
(VARIANT, OBJECT & ARRAY) and produces a row for each
item.
SELECT VALUE FROM TABLE(FLATTEN(INPUT => SELECT
src:skills FROM EMPLOYEE));

This can take several input parameters. The only required


one is input. Input is the thing you'd like to flatten. The
expression must be of data type, variant object or array.
Here we're passing in the skills array, access from our
variant column. We wrap the flatten table function, in a
function called table.
The output shows our skills array no longer as an array in a
single row, but three rows for each element.

Path, allows us to specify a path inside of an object to


flatten. So, not flattening the whole thing indiscriminately,
but taking a specific value.
Which in this code example would be an array.

Path
Specify a path inside object to flatten.
SELECT VALUE FROM TABLE(FLATTEN( INPUT =>
PARSE_JSON('{"A":1, "B":[77,88]}’), PATH => 'B'));

Recursive
The recursive option takes a boolean. And will determine if
the flatten function is performed on sub elements, as well
as the top level provided as input.
Flattening is performed for all sub-elements recursively.
SELECT VALUE FROM TABLE(FLATTEN( INPUT =>
PARSE_JSON('{"A":1, "B":[77,88]}’), RECURSIVE => true));

FLATTEN Output
The output of the flatten function contains six columns we
can use.
SELECT * FROM TABLE(FLATTEN(INPUT =>
(ARRAY_CONSTRUCT(1,45,34))));

The first column is sequence. This contains a unique


sequence number, associated with the input record. The
sequence is not guaranteed to be gap free, or ordered in
any particular way.
the key column in the output would hold the key of that
object. In our case, because we're flattening an array
without a key, is null.
The path column shows the path to the element within the
data structure being flattened. If it's an array, the index
column will hold the position of the element.
The value is generally what we want to select. It contains
the value of the element of the flattened array or object.
This contains the element being flattened. Useful in
recursive flattening.

LATERAL FLATTEN
The combination of a lateral join, and the flattened
function, can be a little bit complicated.
So, let's take a look at an example first,
SELECT src:employee.name::varchar,
src:employee._id::varchar, src:skills FROM EMPLOYEE;

This query extracts the name, ID and skills array from our
variant column.
Let's say we have a requirement to normalize this table, by
flattening the values of the skills array.
At the moment, this query violates the first normal form. As
it contains a composite or multi value attribute.
Here we have a querythat makes use of the lateral flatten
construct.
SELECT src:employee.name, src:employee._id, f.value FROM
EMPLOYEE e, LATERAL FLATTEN(INPUT => e.src:skills) f;

It will flatten the array, duplicating other column values


from the employee table, to create three rows.
So what's happening here?
The function of the lateral joint is to hand each row of the
employee table. But it could be a view or subquery, and
pass it to the flatten table function. From that row we
passed, we select the thing we want to flatten.
Summary of Snowflake Functions

So let's take a step back and go through each group, taking


a look at a few examples of how functions enable us to
transform our table data.
In this lecture, we'll skip user-defined functions and
external functions as we've already covered these.
Group of function like scaler, aggregate, window, table,
and system functions.

Scaler functions.
A scalar function is a function that returns one
value per invocation; these are mostly used for
returning one value per row.
They can be further divided into the 15 groups you can see
here.
Category Description
Bitwise expression Perform bitwise operations on
functions expressions.
Conditional Manipulate conditional expressions.
Category Description
expression
functions
Context functions Provide contextual information about
the current environment, session, and
object.
Conversion Convert expressions from one data
functions type to another data type.
Data generation Generate random or sequential values.
functions
Date & time Manipulate dates, times, and
functions timestamps.
Differential privacy Work with data protected by differential
functions privacy.
Encryption Perform encryption and decryption on
functions VARCHAR or BINARY values.
File functions Access files staged in cloud storage.
Geospatial Work with geospatial data.
functions
Hash functions Hash values to signed 64-bit integers
using a deterministic algorithm.
Metadata functions Retrieve data or metadata about
database objects (e.g. tables) or files
(e.g. staged files).
Notification Produce JSON-formatted strings that
functions you pass
to SYSTEM$SEND_SNOWFLAKE_NOTIFIC
ATION when sending a notification to a
queue or email address.
Numeric functions Perform rounding, truncation,
Category Description
exponent, root, logarithmic, and
trigonometric operations on numeric
values.
Semi-structured Work with semi-structured data (JSON,
and structured Avro, etc.).
data functions
String & binary Manipulate and transform string input.
functions
String functions Subset of strings functions for
(regular performing operations on items that
expressions) match a regular expression.

SELECT UUID_STRING();

This sample command invokes the unique ID function. It


produces a single universally unique identifier, which you
can see an example of in the output here.
Aggregate functions
Aggregate functions take multiple rows and produce a
single output. You'll most commonly see them used to
perform mathematical calculations like sum, average and
counting.
An aggregate function takes multiple rows (actually, zero,
one, or more rows) as input and produces a single output.
An aggregate function always returns exactly one
row, even when the input contains zero rows.
Typically, if the input contains zero rows, the output is
NULL. However, an aggregate function could return 0, an
empty string, or some other value when passed zero rows.

List of functions (by sub-category)


Function Name Notes

General Aggregation

ANY_VALUE

AVG

CORR

COUNT

COUNT_IF

COVAR_POP

COVAR_SAMP

LISTAGG

MAX

MAX_BY

MEDIAN

MIN

MIN_BY

MODE

PERCENTILE_CONT Uses different syntax than the other aggregate


functions.

PERCENTILE_DISC Uses different syntax than the other aggregate


functions.

STDDEV, STDDEV and STDDEV_SAMP are aliases.


STDDEV_SAMP

STDDEV_POP

SUM
Function Name Notes

VAR_POP

VAR_SAMP

VARIANCE_POP Alias for VAR_POP.

VARIANCE , Alias for VAR_SAMP.


VARIANCE_SAMP

Bitwise Aggregation

BITAND_AGG

BITOR_AGG

BITXOR_AGG

Boolean Aggregation

BOOLAND_AGG

BOOLOR_AGG

BOOLXOR_AGG

Hash

HASH_AGG

Semi-structured
Data Aggregation

ARRAY_AGG

OBJECT_AGG

Linear Regression

REGR_AVGX

REGR_AVGY

REGR_COUNT

REGR_INTERCEPT

REGR_R2

REGR_SLOPE

REGR_SXX
Function Name Notes

REGR_SXY

REGR_SYY

Statistics and
Probability

KURTOSIS

SKEW

Counting Distinct
Values

ARRAY_UNION_AGG

ARRAY_UNIQUE_AGG

BITMAP_BIT_POSITION

BITMAP_BUCKET_NUMB
ER

BITMAP_COUNT

BITMAP_CONSTRUCT_A
GG

BITMAP_OR_AGG

Cardinality
Estimation (using Hy
perLogLog)

APPROX_COUNT_DISTIN Alias for HLL.


CT

HLL

HLL_ACCUMULATE

HLL_COMBINE

HLL_ESTIMATE Not an aggregate function; uses scalar input


from HLL_ACCUMULATE or HLL_COMBINE.

HLL_EXPORT

HLL_IMPORT
Function Name Notes

Similarity
Estimation (using Min
Hash)

APPROXIMATE_JACCAR Alias for APPROXIMATE_SIMILARITY.


D_INDEX

APPROXIMATE_SIMILARI
TY

MINHASH

MINHASH_COMBINE

Frequency
Estimation (using Sp
ace-Saving)

APPROX_TOP_K

APPROX_TOP_K_ACCUM
ULATE

APPROX_TOP_K_COMBI
NE

APPROX_TOP_K_ESTIMA Not an aggregate function; uses scalar input


TE from APPROX_TOP_K_ACCUMULATE or APPROX_TO
P_K_COMBINE.

Percentile
Estimation (using t-
Digest)

APPROX_PERCENTILE

APPROX_PERCENTILE_A
CCUMULATE

APPROX_PERCENTILE_C
OMBINE

APPROX_PERCENTILE_E Not an aggregate function; uses scalar input


STIMATE from APPROX_PERCENTILE_ACCUMULATE or APPR
OX_PERCENTILE_COMBINE.

Aggregation Utilities

GROUPING Not an aggregate function, but can be used in


Function Name Notes

conjunction with aggregate functions to


determine the level of aggregation for a row
produced by a GROUP BY query.

GROUPING_ID Alias for GROUPING.

SELECT MAX(AMOUNT) FROM ACCOUNT;

Window functions.
Window functions are analytic functions that you can use
for various calculations such as running totals, moving
averages, and rankings.
Instead of using all the values of a column, we'll calculate
the max amount per account_ID.

Sub-category Notes
General window
ANY_VALUE
AVG
CONDITIONAL_CHANGE
_EVENT
CONDITIONAL_TRUE_E
VENT
CORR
COUNT
COUNT_IF
COVAR_POP
COVAR_SAMP
LISTAGG Uses WITHIN GROUP syntax.
MAX
MEDIAN
MIN
Sub-category Notes
MODE
PERCENTILE_CONT Uses WITHIN GROUP syntax.
PERCENTILE_DISC Uses WITHIN GROUP syntax.
RATIO_TO_REPORT
STDDEV, STDDEV and STDDEV_SAMP are aliases.
STDDEV_SAMP
STDDEV_POP
SUM
VAR_POP
VAR_SAMP
VARIANCE_POP Alias for VAR_POP.
VARIANCE , Alias for VAR_SAMP.
VARIANCE_SAMP
Ranking
CUME_DIST
DENSE_RANK
FIRST_VALUE
LAG
LAST_VALUE
LEAD
NTH_VALUE
NTILE
PERCENT_RANK Supports only RANGE BETWEEN window frames
without explicit offsets.
RANK
ROW_NUMBER
Bitwise aggregation
BITAND_AGG
BITOR_AGG
BITXOR_AGG
Boolean aggregation
BOOLAND_AGG
BOOLOR_AGG
BOOLXOR_AGG
Hash
HASH_AGG
Semi-structured
data aggregation
ARRAY_AGG
OBJECT_AGG
Counting distinct
values
Sub-category Notes
ARRAY_UNION_AGG
ARRAY_UNIQUE_AGG
Linear regression
REGR_AVGX
REGR_AVGY
REGR_COUNT
REGR_INTERCEPT
REGR_R2
REGR_SLOPE
REGR_SXX
REGR_SXY
REGR_SYY
Statistics and
probability
KURTOSIS
Cardinality
estimation (using Hy
perLogLog)
APPROX_COUNT_DISTIN Alias for HLL.
CT
HLL
HLL_ACCUMULATE
HLL_COMBINE
HLL_ESTIMATE Not an aggregate function; uses scalar input
from HLL_ACCUMULATE or HLL_COMBINE.
HLL_EXPORT
HLL_IMPORT
Similarity
estimation (using Min
Hash)
APPROXIMATE_JACCAR Alias for APPROXIMATE_SIMILARITY.
D_INDEX
APPROXIMATE_SIMILARI
TY
MINHASH
MINHASH_COMBINE
Frequency
estimation (using Sp
ace-Saving)
APPROX_TOP_K
APPROX_TOP_K_ACCUM
ULATE
APPROX_TOP_K_COMBI
Sub-category Notes
NE
APPROX_TOP_K_ESTIMA Not an aggregate function; uses scalar input
TE from APPROX_TOP_K_ACCUMULATE or APPROX_TO
P_K_COMBINE.
Percentile
estimation (using t-
Digest)
APPROX_PERCENTILE
APPROX_PERCENTILE_A
CCUMULATE
APPROX_PERCENTILE_C
OMBINE
APPROX_PERCENTILE_E Not an aggregate function; uses scalar input
STIMATE from APPROX_PERCENTILE_ACCUMULATE or APPR
OX_PERCENTILE_COMBINE.

We do this by specifying the account_ID column after


partition by.
SELECT ACCOUNT_ID, AMOUNT, MAX(AMOUNT) OVER
(PARTITION BY ACCOUNT_ID) FROM ACCOUNT;

This will calculate the max amount per window. In this


case, that would be per account_ID, and apply it to every
row of that window.
Table functions
For each input row, a table function can output multiple
rows, although it can technically return zero, one or
multiple rows.
Table functions return a set of rows for each input row. The
returned set can contain zero, one, or more rows. Each row
can contain one or more columns.
Simple examples of table functions
The following are appropriate as table functions:
 A function that accepts an account number and a
date, and returns all charges billed to that account on
that date. (More than one charge might have been
billed on a particular date.)
 A function that accepts a user ID and returns the
database roles assigned to that user. (A user might
have multiple roles, including “sysadmin” and
“useradmin”.)
There are two types of table function
Built-in table functions and user-defined table
functions
User-defined table functions are called “UDTFs” can see
later

List of system-defined table functions


Snowflake provides the following system-defined (i.e. built-
in) table functions:
Sub- Function Notes
category
Data INFER_SCHEMA For more
Loading information, refer
to Load data into
Snowflake.
VALIDATE
Data GENERATOR
Generatio
n
Data SPLIT_TO_TABLE
Conversio
n
STRTOK_SPLIT_TO_TABLE
Differentia CUMULATIVE_PRIVACY_LOSSES
Sub- Function Notes
category
l Privacy
Object GET_OBJECT_REFERENCES
Modeling
Parameter TO_QUERY
ized
Queries
Semi- FLATTEN For more
structured information, refer
Queries to Querying
Semi-structured
Data.
Query RESULT_SCAN Can be used to
Results perform SQL
operations on the
output from
another SQL
operation (e.g.
SHOW).
Query GET_QUERY_OPERATOR_STATS
Profile
Historical Includes:
& Usage  Snowflake
Informatio Information
n Schema
 Account
Usage
 LOCAL
User Login LOGIN_HISTORY , LOGIN_HISTORY_BY_USER
Queries QUERY_HISTORY , QUERY_HISTORY_BY_*
QUERY_ACCELERATION_HISTORY For more
information, refer
to Using the
Query
Acceleration
Service.
Warehous DATABASE_STORAGE_USAGE_HISTORY
e &
Storage
Usage
WAREHOUSE_LOAD_HISTORY
WAREHOUSE_METERING_HISTORY
STAGE_STORAGE_USAGE_HISTORY
Column- POLICY_REFERENCES
level &
Row-level
Sub- Function Notes
category
Security
Object TAG_REFERENCES Information
Tagging Schema table
function.
TAG_REFERENCES_ALL_COLUMNS Information
Schema table
function.
TAG_REFERENCES_WITH_LINEAGE Account Usage
table function.
Account REPLICATION_GROUP_REFRESH_HISTORY For more
Replicatio information, refer
n to Introduction to
replication and
failover across
multiple accounts
REPLICATION_GROUP_REFRESH_PROGRESS
,
REPLICATION_GROUP_REFRESH_PROGRESS
_BY_JOB
REPLICATION_GROUP_USAGE_HISTORY
Alerts ALERT_HISTORY For more
information, refer
to Setting up
alerts based on
data in
Snowflake.
SERVERLESS_ALERT_HISTORY
Database DATABASE_REFRESH_HISTORY For more
Replicatio information, refer
n to Replicating
databases across
multiple
accounts.
DATABASE_REFRESH_PROGRESS ,
DATABASE_REFRESH_PROGRESS_BY_JOB
DATABASE_REPLICATION_USAGE_HISTORY
Data COPY_HISTORY
Loading &
Transfer
DATA_TRANSFER_HISTORY
PIPE_USAGE_HISTORY
STAGE_DIRECTORY_FILE_REGISTRATION_HI
STORY
VALIDATE_PIPE_LOAD
Sub- Function Notes
category
Data AUTOMATIC_CLUSTERING_HISTORY For more
Clustering information, refer
(within to Automatic
Tables) Clustering.
Dynamic DYNAMIC_TABLES For more
Tables information, refer
to Working with
dynamic tables.
DYNAMIC_TABLE_GRAPH_HISTORY
DYNAMIC_TABLE_REFRESH_HISTORY
External EXTERNAL_FUNCTIONS_HISTORY For more
Functions information, refer
to Writing
external
functions.
External AUTO_REFRESH_REGISTRATION_HISTORY For more
Tables information, refer
to Working with
external tables.
EXTERNAL_TABLE_FILES
EXTERNAL_TABLE_FILE_REGISTRATION_HIS
TORY
Iceberg ICEBERG_TABLE_FILES Information
Tables Schema table
function.
ICEBERG_TABLE_SNAPSHOT_REFRESH_HIST Information
ORY Schema table
function.
Listings AVAILABLE_LISTING_REFRESH_HISTORY
LISTING_REFRESH_HISTORY
Materializ MATERIALIZED_VIEW_REFRESH_HISTORY For more
ed Views information, refer
Maintenan to Working with
ce Materialized
Views.
Notificatio NOTIFICATION_HISTORY For more
ns information, refer
to Using
SYSTEM$SEND_E
MAIL to send
email
notifications.
SCIM REST_EVENT_HISTORY For more
Maintenan information, refer
ce to Auditing SCIM
Sub- Function Notes
category
API requests
Search SEARCH_OPTIMIZATION_HISTORY For more
Optimizati information, refer
on to Search
Maintenan Optimization
ce Service.
Streams SYSTEM$STREAM_BACKLOG For more
information, refer
to Introduction to
Streams.
Tasks COMPLETE_TASK_GRAPHS For more
information, refer
to Introduction to
tasks.
CURRENT_TASK_GRAPHS
SERVERLESS_TASK_HISTORY
TASK_DEPENDENTS
TASK_HISTORY
Network NETWORK_RULE_REFERENCES Information
rules Schema table
function. For
details,
see Network
rules.
Data DATA_METRIC_FUNCTION_REFERENCES For more
Quality information,
Monitoring see Introduction
to Data Quality
and data metric
functions.
DATA_QUALITY_MONITORING_RESULTS
SYSTEM$DATA_METRIC_SCAN
Data GET_LINEAGE (SNOWFLAKE.CORE) For more
Lineage information, refer
to Data Lineage
in Snowsight.
Cortex CORTEX_SEARCH_DATA_SCAN For more
Search information, refer
to Cortex Search.
Syntax

There are seven Important categories of table functions.


Data Loading , Data Generation , Data Conversion , Object
Modelling Semi-structured , Query Results , Usage
Information
Let's take a look at data generation. We can combine here
scaler and tabular functions to produce a table of synthetic
test data.
SELECT RANDSTR(5, RANDOM()), RANDOM() FROM
TABLE(GENERATOR(ROWCOUNT => 3));

This command will create a tabular result with two columns


and four rows.
On the left, I'm selecting the scaler functions random and
randstr, and on the right,
I'm invoking the table function, generator. Generator
creates rows of data based either on a specified number of
rows, like in this example, or you can specify a number of
seconds to run and it will generate as many rows as
possible.

System functions.
These come in three broad classes.
Snowflake provides the following types of system functions:
 Control functions that allow you to execute actions in
the system (e.g. aborting a query - cancel a running
query, accepting a query ID as input).
 Information functions that return information about
the system (e.g. calculating the clustering depth of a
table).
 Information functions that return information about
queries (e.g. information about EXPLAIN plans).

Function Name Notes


Control
SYSTEM$ABORT_SESSION
SYSTEM$ABORT_TRANSACTION
SYSTEM$ADD_EVENT (for Snowflake
Scripting)
SYSTEM$AUTHORIZE_PRIVATELINK
SYSTEM$AUTHORIZE_STAGE_PRIVATELINK_
ACCESS
SYSTEM$BLOCK_INTERNAL_STAGES_PUBLI
C_ACCESS
SYSTEM$CANCEL_ALL_QUERIES
SYSTEM$CANCEL_QUERY
SYSTEM$CLEANUP_DATABASE_ROLE_GRAN
TS
SYSTEM$COMMIT_MOVE_ORGANIZATION_A
CCOUNT
SYSTEM$CONVERT_PIPES_SQS_TO_SNS
SYSTEM$CREATE_BILLING_EVENT
SYSTEM$CREATE_BILLING_EVENTS
SYSTEM$DEPROVISION_PRIVATELINK_ENDP
OINT
SYSTEM$DISABLE_BEHAVIOR_CHANGE_BU
NDLE
SYSTEM$DISABLE_PREVIEW_ACCESS
SYSTEM$DISABLE_DATABASE_REPLICATION
SYSTEM$ENABLE_BEHAVIOR_CHANGE_BUN
DLE
SYSTEM$ENABLE_PREVIEW_ACCESS
SYSTEM$FINISH_OAUTH_FLOW
SYSTEM$GLOBAL_ACCOUNT_SET_PARAMET
ER
SYSTEM$INITIATE_MOVE_ORGANIZATION_A
CCOUNT
SYSTEM$LINK_ACCOUNT_OBJECTS_BY_NAM
E
SYSTEM$MIGRATE_SAML_IDP_REGISTRATIO
N
SYSTEM$PIPE_FORCE_RESUME
SYSTEM$PIPE_REBINDING_WITH_NOTIFICAT
ION_CHANNEL
SYSTEM$PROVISION_PRIVATELINK_ENDPOI
NT
SYSTEM$REGISTER_CMK_INFO
SYSTEM$REGISTER_PRIVATELINK_ENDPOIN
T
SYSTEM$RESTORE_PRIVATELINK_ENDPOIN
T
SYSTEM$REVOKE_PRIVATELINK
SYSTEM$REVOKE_STAGE_PRIVATELINK_ACC
ESS
SYSTEM$SEND_NOTIFICATIONS_TO_CATAL
OG
SYSTEM$SET_APPLICATION_RESTRICTED_F
EATURE_ACCESS
SYSTEM$SET_EVENT_SHARING_ACCOUNT_
FOR_REGION
SYSTEM$SNOWPIPE_STREAMING_UPDATE_
CHANNEL_OFFSET_TOKEN
SYSTEM$START_OAUTH_FLOW
SYSTEM$TASK_DEPENDENTS_ENABLE
SYSTEM$UNBLOCK_INTERNAL_STAGES_PU
BLIC_ACCESS
SYSTEM$UNREGISTER_PRIVATELINK_ENDP
OINT
SYSTEM$UNSET_EVENT_SHARING_ACCOUN
T_FOR_REGION
SYSTEM$USER_TASK_CANCEL_ONGOING_E
XECUTIONS
SYSTEM$WAIT
Information
EXTRACT_SEMANTIC_CATEGORIES
GET_ANACONDA_PACKAGES_REPODATA
SHOW_PYTHON_PACKAGES_DEPENDENCIES
SYSTEM$ALLOWLIST
SYSTEM$ALLOWLIST_PRIVATELINK
SYSTEM$BEHAVIOR_CHANGE_BUNDLE_STA
TUS
SYSTEM$CLIENT_VERSION_INFO
SYSTEM$CLUSTERING_DEPTH
SYSTEM$CLUSTERING_INFORMATION
SYSTEM$CLUSTERING_RATIO Deprecated; use the other
clustering functions instead.
SYSTEM$CURRENT_USER_TASK_NAME
SYSTEM$DATA_METRIC_SCAN
SYSTEM$DATABASE_REFRESH_HISTORY Deprecated;
use DATABASE_REFRESH_HIST
ORY instead.
SYSTEM$DATABASE_REFRESH_PROGRESS , Deprecated;
SYSTEM$DATABASE_REFRESH_PROGRESS_ use DATABASE_REFRESH_PRO
BY_JOB GRESS ,
DATABASE_REFRESH_PROGRE
SS_BY_JOB instead.
SYSTEM$ESTIMATE_AUTOMATIC_CLUSTERI
NG_COSTS
SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION
_COSTS
SYSTEM$EXTERNAL_TABLE_PIPE_STATUS
SYSTEM$GENERATE_SAML_CSR
SYSTEM$GENERATE_SCIM_ACCESS_TOKEN
SYSTEM$GET_AWS_SNS_IAM_POLICY
SYSTEM$GET_CLASSIFICATION_RESULT
SYSTEM$GET_CMK_AKV_CONSENT_URL
SYSTEM$GET_CMK_CONFIG
SYSTEM$GET_CMK_INFO
SYSTEM$GET_CMK_KMS_KEY_POLICY
SYSTEM$GET_COMPUTE_POOL_STATUS
SYSTEM$GET_DIRECTORY_TABLE_STATUS
SYSTEM$GET_GCP_KMS_CMK_GRANT_ACCE
SS_CMD
SYSTEM$GET_ICEBERG_TABLE_INFORMATI
ON
SYSTEM$GET_LOGIN_FAILURE_DETAILS
SYSTEM$GET_PREDECESSOR_RETURN_VAL
UE
SYSTEM$GET_PREVIEW_ACCESS_STATUS
SYSTEM$GET_PRIVATELINK_AUTHORIZED_E
NDPOINTS
SYSTEM$GET_PRIVATELINK_CONFIG
SYSTEM$GET_PRIVATELINK
SYSTEM$GET_PRIVATELINK_ENDPOINTS_IN
FO
SYSTEM$GET_PRIVATELINK_ENDPOINT_REG
ISTRATIONS
SYSTEM$GET_SERVICE_DNS_DOMAIN
SYSTEM$GET_SERVICE_LOGS
SYSTEM$GET_SERVICE_STATUS — Deprecated; use the SHOW
Deprecated SERVICE CONTAINERS IN
SERVICE command instead.
SYSTEM$GET_SNOWFLAKE_PLATFORM_INF
O
SYSTEM$GET_TAG_ALLOWED_VALUES
SYSTEM$GET_TAG_ON_CURRENT_COLUMN
SYSTEM$GET_TAG_ON_CURRENT_TABLE
SYSTEM$GET_TAG
SYSTEM$GET_TASK_GRAPH_CONFIG
SYSTEM$INTERNAL_STAGES_PUBLIC_ACCE
SS_STATUS
SYSTEM$IS_APPLICATION_INSTALLED_FRO
M_SAME_ACCOUNT
SYSTEM$IS_APPLICATION_SHARING_EVENT
S_WITH_PROVIDER
SYSTEM$IS_LISTING_PURCHASED
SYSTEM$IS_LISTING_TRIAL
SYSTEM$LAST_CHANGE_COMMIT_TIME
SYSTEM$LIST_APPLICATION_RESTRICTED_F
EATURES
SYSTEM$LIST_ICEBERG_TABLES_FROM_CAT
ALOG
SYSTEM$LIST_NAMESPACES_FROM_CATALO
G
SYSTEM$LOG, SYSTEM$LOG_<level> (for
Snowflake Scripting)
SYSTEM$PIPE_STATUS
SYSTEM$QUERY_REFERENCE
SYSTEM$REFERENCE
SYSTEM$REGISTRY_LIST_IMAGES Deprecated; use the SHOW
IMAGES IN IMAGE
REPOSITORY command
instead.
SYSTEM$SET_RETURN_VALUE
SYSTEM$SET_SPAN_ATTRIBUTES (for
Snowflake Scripting)
SYSTEM$SHOW_ACTIVE_BEHAVIOR_CHANG
E_BUNDLES
SYSTEM$SHOW_BUDGETS_FOR_RESOURCE
SYSTEM$SHOW_BUDGETS_IN_ACCOUNT
SYSTEM$SHOW_EVENT_SHARING_ACCOUN
TS
SYSTEM$SHOW_MOVE_ORGANIZATION_AC
COUNT_STATUS
SYSTEM$SHOW_OAUTH_CLIENT_SECRETS
SYSTEM$STREAM_BACKLOG This function is a table
function.
SYSTEM$STREAM_GET_TABLE_TIMESTAMP
SYSTEM$STREAM_HAS_DATA
SYSTEM$TASK_RUNTIME_INFO
SYSTEM$TYPEOF
SYSTEM$VALIDATE_STORAGE_INTEGRATIO
N
SYSTEM$VERIFY_CATALOG_INTEGRATION
SYSTEM$VERIFY_CMK_INFO
SYSTEM$VERIFY_EXTERNAL_OAUTH_TOKE
N
SYSTEM$VERIFY_EXTERNAL_VOLUME
SYSTEM$WHITELIST Deprecated;
use SYSTEM$ALLOWLIST inste
ad.
SYSTEM$WAIT_FOR_SERVICES
SYSTEM$WHITELIST_PRIVATELINK Deprecated;
use SYSTEM$ALLOWLIST_PRIV
ATELINK instead.
Query Information

1 System functions provide a way to execute actions in the


system.
SELECT system$cancel_query('01a65819-0000-2547-0000-
94850008c1ee');

2 System functions provide information about the system.


SELECT system$pipe_status(‘my_pipe');

3 System functions provide information about queries.

SELECT system$explain_plan_json('SELECT AMOUNT FROM


ACCOUNT');
Estimation Functions

Estimation allows us to perform a calculation quicker but


with less accuracy.

Cardinality estimation group enables us to estimate the


number of distinct values in a set of values.
Similarity estimation enables us to estimate the
similarity of two or more sets of values.
Frequency estimation enables us to estimate with what
frequency a certain value appears in a set.
Percentile estimation enables us to estimate the
percentile of a set of values.
Cardinality estimation
Snowflake uses HyperLogLog to estimate the approximate
number of distinct values in a data set. HyperLogLog is a
state-of-the-art cardinality estimation algorithm, capable of
estimating distinct cardinalities of trillions of rows with an
average relative error of a few percent. which returns an
approximation of the distinct number of values of a column.
HyperLogLog can be used in place of COUNT(DISTINCT
…) in situations where estimating cardinality is acceptable.

SELECT APPROX_COUNT_DISTINCT(L_ORDERKEY) FROM LINEITEM;


Output: 1,491,111,415
Execution Time: 44 Seconds

SELECT COUNT(DISTINCT L_ORDERKEY) FROM LINEITEM;


Output: 1,500,000,000
Execution Time: 4 Minutes 20 Seconds

So why use an estimation function if we already have the


distinct keyword, which when used in conjunction with
count, will give us an accurate and exact output of the
number of distinct values in a column
Executing a count distinct operation requires an amount of
memory proportional to the cardinality, which could be
quite costly for very large tables.
So Snowflake have implemented something called the
HyperLogLog cardinality estimation algorithm. This returns
an approximation of the distinct number of values of a
column.
The main benefit being that it consumes significantly less
memory and is therefore suitable for when input is quite
large and an approximate result is acceptable.
When compared with count distinct, the average relative
error of Snowflake's HyperLogLog implementation is
approximately 1.6.
This means that if count distinct return 1 million,
HyperLogLog would typically return a result in the following
range, which is plus or minus 1.6% of 1 million.
There are six functions available to us.
cardinality estimating Functions
 HLL: Returns an approximation of the distinct
cardinality of the input.
 HLL_ACCUMULATE: Skips the final estimation step and
returns the HyperLogLog state at the end of an
aggregation.
 HLL_COMBINE: Combines (i.e. merges) input states
into a single output state.
 HLL_ESTIMATE: Computes a cardinality estimate of a
HyperLogLog state produced by HLL_ACCUMULATE
and HLL_COMBINE.
 HLL_EXPORT: Converts HyperLogLog states from
BINARY format to an OBJECT (which can then be
printed and exported as JSON).
 HLL_IMPORT: Converts HyperLogLog states from
OBJECT format to BINARY format.

The main one we'll take a look at is called HLL, Short for
HyperLogLog.
The others allow you to perform more advanced use cases
like incremental cardinality estimation, an advanced topic
that's out of scope for this lecture.
In this code example, we're seeing the more human
friendly alias
for the HLL function called Approx_Count_Distinct. We're
basing our calculation on the order key from the line item
table from the Snowflake sample data, which is
approximately 160 gigabytes in size.
I executed this command and got the following value,
completing in 44 seconds.
If we now compare this to account distinct on the same
column, we see the accurate number of distinct values is
exactly one and a half billion rows. As you can see, though,
it took significantly longer to compute.
So if this margin of error is acceptable, this function is great
for getting a rough idea of how many distinct values are in
a column.

Estimating similarity.
Snowflake have implemented a two-step process to
estimate similarity, without the need to compute the
intersection or union of two sets.

Snowflake uses MinHash for estimating the approximate


similarity between two or more data sets. The MinHash
scheme compares sets without computing the intersection
or union of the sets, which enables efficient and effective
estimation.

We might come across a need to compare two sets of rows


and give some indication of how similar they are.
The Jacquard similarity coefficient is a method used to find
the similarity and difference between two sets of values.
To compute its result, we find the ratio of the intersection
and the union
for two sets of elements.
We don't need to worry about the mathematics too much,
but what we should bear in mind is this can be quite a
computationally expensive operation.
For that reason, Snowflake have implemented a two-step
process to estimate similarity without the need to compute
the intersection or union of two sets.
The first step is to run the MINHASH function on two sets of
input rows. The output of this function is something called
the MINHASH state.
This is an array of hash values derived from the set of input
rows and forms the basis for the comparison between the
two data sets.
MINHASH two input parameters, K and an expression. K
determines how many hash values you like to be calculated
on the input rows.
The higher this number, the more accurate an estimation
can be.
But bear in mind, increasing this number also increases the
computation time.
The max value you can set is 1024.
SELECT MINHASH(5, C_CUSTKEY) FROM CUSTOMER;

-------------------------
|“MINHASH(5, C_CUSTKEY) |
-------------------------
|{ |
| "state": [ |
| 557181968304, |
| 67530801241, |
| 1909814111197, |
| 8406483771, |
| 34962958513
|
| ], |
| "type": "minhash", |
| "version": 1 |
|} |
-------------------------

SELECT APPROXIMATE_SIMILARITY(MH) FROM


( (SELECT MINHASH(5, C_CUSTKEY) MH FROM CUSTOMER)
UNION
(SELECT MINHASH(5, O_CUSTKEY) MH FROM ORDERS) );

-------------------------------
|“APPROXIMATE_SIMILARITY(MH)” |
-------------------------------
|0.8 |
-------------------------------

Once we have the two MINHASH states representing the


sets of rows we'd like to compute the similarity of, we pass
those to a function called approximate similarity.
In this code example, we union together two MINHASH
states and pass that column to the approximate similarity
function.
You can see here we get a floating point number returned
with a possible value of 0 to 1.
1 indicates that the sets are identical and 0 meaning the
sets have no overlap.
Here we're looking at a customer key ID from related
tables,hence why we have such a high similarity.
Frequency estimation.
Snowflake uses the Space-Saving algorithm, a space and
time efficient way of estimating approximate frequent
values in data sets.
The principle among them is called Approx_Top_K. This
function implements something called the space saving
algorithm used to produce an approximation of values and
their frequencies.
Snowflake provides an implementation of the Space-Saving
algorithm presented in Efficient Computation of Frequent
and Top-k Elements in Data Streams by Metwally, Agrawal
and Abbadi. It is implemented through
the APPROX_TOP_K family of functions.
Additionally, the APPROX_TOP_K_COMBINE function utilizes
the parallel Space-Saving algorithm outlined by Cafaro,
Pulimeno and Tempesta.

SELECT APPROX_TOP_K(P_SIZE, 3, 100000) FROM PART;

----------------------------------------
|“APPROX_TOP_K(P_SIZE, 3, 100000)” |
----------------------------------------
|[[13,401087],[38,401074],[35,401033]] |
----------------------------------------

 APPROX_TOP_K: Returns an approximation of frequent


values in the input.
 APPROX_TOP_K_ACCUMULATE: Skips the final
estimation step and returns the Space-Saving state at
the end of an aggregation.
 APPROX_TOP_K_COMBINE: Combines (i.e. merges)
input states into a single output state.
 APPROX_TOP_K_ESTIMATE: Computes a cardinality
estimate of a Space-Saving state produced by
APPROX_TOP_K_ACCUMULATE and
APPROX_TOP_K_COMBINE.

APPROX_TOP_K

SELECT P_SIZE, COUNT(P_SIZE) AS C FROM PART


GROUP BY P_SIZE
ORDER BY C DESC
LIMIT 3;

--------------------
|“P_SIZE” | “C” |
--------------------
|13 |401,087 |
|38 |401,074 |
|35 |401,033 |
--------------------
Let's look at an example to make this clearer. This function
has three input parameters. The column you like to
calculate the frequency of values for, the number of values
you'd like to be approximated, and the maximum number
of distinct values that can be tracked at a time during the
estimation process.
Increasing this max number makes the estimation more
accurate, and in theory uses more memory.
Although setting it to the max still has good performance in
my experience. If we take a look at the output, we're
seeing the three values of column P size with the most
values.
If you wanted to achieve an exact result instead of an
estimation, you could run the following group by query.
However, as you can see in this case, the approximation
was accurate.

Percentile Estimation
Estimate the percentile of values. To do this, Snowflake
have implemented the T digest algorithm for the functions
you see here.
Again, accumulate, combine and estimate are either helper
commands or produce intermediate steps of the algorithm.
So we'll stick with the main function, which does it all,
Approx_Percentile.
So taking a step back, what is a percentile?
It's a statistical method to express what percentage of a
set of values is below a certain value.
This can be a bit difficult to conceptualize if you're not
familiar with statistics.
Four function
APPROX_PERCENTILE
APPROX_PERCENTILE_ACCUMULATE
APPROX_PERCENTILE_COMBINE
APPROX_PERCENTILE_ESTIMATE

INSERT OVERWRITE INTO TEST_SCORES VALUES (23),(67),(2),(3),


(9),(19),(45),(81),(90),(11); SELECT APPROX_PERCENTILE(score, 0.8)
FROM TEST_SCORES;
---------------------------------
|“APPROX_PERCENTILE(score,0.8)” |
---------------------------------
|74 |
---------------------------------
Let's take a look at an example.
We'll insert test scores out of 100 for 10 students. We can
then run the Approx_Percentile function providing as input
the score column and the percentile value between 0 and
1.
You can think of 0.1 being the 10th percentile, 0.2 being
the 20th percentile, and so on.
If we wanted to know the 80th percentile, meaning in other
words, what score you'd have to achieve to get as good or
better than 80% of other students, we'd plug in 0.8.
For this group of 10 students, you'd have to achieve at
least 74 to be at the 80th percentile.

Table Sampling
Table sampling is a convenient way to read a random
subset of rows from a table.
We could use LIMIT to restrict the number of rows we
return, but using sampling is a helpful way of getting a
more representative group of rows to work with.
There are two ways we can determine how our sample of
rows is generated.
1. The first we'll cover is fraction-based.
We indicate we'd like to take a sample of the results of a
query by specifying either the keyword SAMPLE or TABLE
SAMPLE.
SELECT * FROM LINEITEM TABLESAMPLE/SAMPLE [samplingMethod]
();

We follow this with the sampling method and then a


probability expressed as a percentage in brackets.
If we don't include a sampling method,by default, the
probability would be calculated for each individual row.
However, we could choose to make this explicit by
including either the keyword ROW or BERNOULLI, named
after the Swiss mathematician Jacob Bernoulli, who did
work in probability.
This is followed by a decimal number ranging from 0 to
100.
If the sampling method is set to ROW, this number
represents a probability expressed as a percentage that a
specific row will appear in the result.
SELECT * FROM LINEITEM SAMPLE BERNOULLI/ROW (50);

For example, if I were to put 50 here, this would mean


there would be a 50-50 chance that an individual row
would be present in the result.
So the larger the number, generally speaking, the more
rows.
The resulting sample size is approximately the probability
divided by 100,vtimes the total number of rows this query
would ordinarily produce.
The other sampling method we can use is indicated by
including the keywords BLOCK or SYSTEM.
The idea here is that instead of applying the probability
percentage to individual rows, it'll be to larger blocks of
rows.
SELECT * FROM LINEITEM SAMPLE SYSTEM/BLOCK (50);

Hence why you might see it produce less random results,


particularly for smaller tables.
For fraction-based, we can also include something
called a seed.
Ordinarily, if you rerun one of these commands, it would
produce a different set of rows with each execution.
We can include a random integer called a seed to
generate the same results each time a sampling query is
repeated, making it deterministic.
We follow the probability with either the keyword SEED or
REPEATABLE, and then a positive integer in this range.
SELECT * FROM LINEITEM SAMPLE (50) REPEATABLE/SEED
(765);

Using a seed could produce different results between


executions if the table data is changed or you're sampling a
copy of a table.
2. The Second way we can determine how many
rows are produced is called fixed-size.
This method is a bit more straightforward. Here, we can
determine exactly how many rows we'd like to be produced
by including an integer between 0 and 1 million, and follow
that with the keyword ROWS.
SELECT * FROM LINEITEM TABLESAMPLE/SAMPLE ( ROWS);
SELECT L_TAX, L_SHIPMODE FROM LINEITEM SAMPLE
BERNOULLI/ROW (3 rows);
Output
-------------------------
|“L_TAX” | “L_SHIPMODE” |
-------------------------
|0.02 | REG AIR |
|0.02 | TRUCK |
|0.06 | TRUCK |
------------------------
In this code example, we're sampling three rows from the
LINEITEM table.
Both the BLOCK sampling method and seed are not
supported with fixed-size and will result in an error
if included.
Unstructured Data File Functions

So far, we've had a look at how we load and transform


structured data like CSV and semi-structured data like
JSON,
but how does Snowflake handle unstructured data?
Data that is not arranged according to any predefined data
model or schema?
This includes multimedia sources like image files, audio
files, and video, as well as documents like PDFs,
spreadsheets, or text-based like Word documents.
These file types either do not fit neatly or at all into the
data structure of a table.

Types of URLs available to access files


The following types of URLs are available to access files in
cloud storage:
Scoped URL
Encoded URL that permits temporary access to a staged
file without granting privileges to the stage.
The URL expires when the persisted query result
period ends (i.e. the results cache expires), which is
currently 24 hours.
File URL
URL that identifies the database, schema, stage, and file
path to a set of files. A role that has sufficient privileges on
the stage can access the files.
Pre-signed URL
Simple HTTPS URL used to access a file via a web browser.
A file is temporarily accessible to users via this URL using a
pre-signed access token. The expiration time for the access
token is configurable.

Snowflake have six file functions in the scalar


function category.

SQL Function Description


GET_STAGE_LOCATION Returns the URL for an external or
internal named stage using the stage
name as the input.
GET_RELATIVE_PATH Extracts the path of a staged file
relative to its location in the stage using
the stage name and absolute file path in
cloud storage as inputs.
GET_ABSOLUTE_PATH Returns the absolute path of a staged
file using the stage name and path of
SQL Function Description
the file relative to its location in the
stage as inputs.
GET_PRESIGNED_URL Generates the pre-signed URL to a
staged file using the stage name and
relative file path as inputs. Access files
in an external stage using the function.
BUILD_SCOPED_FILE_U Generates a scoped Snowflake file URL
RL to a staged file using the stage name
and relative file path as inputs.
BUILD_STAGE_FILE_UR Generates a Snowflake file URL to a
L staged file using the stage name and
relative file path as inputs.

We're going to walk through three of these, which are


directly involved in accessing and using unstructured files,
so at a high level, firstly, we upload an unstructured data
file, like an image to an internal or external name stage.
We can then leverage one of these three functions to
generate a URL from that file.
With each function having different requirements around
authorization and validity, this URL can then be used in
many different ways to access that staged unstructured
file.
Let's go through each function and walk through some
examples of how we can use the generated URLs.
1. Build scoped file URL
It takes two input parameters, the external or internal
name stage identifier, and a relative path for a file in the
stage and outputs an encoded URL granting access to a file
for 24 hours to the user that requested it.
Syntax
BUILD_SCOPED_FILE_URL( @<stage_name> ,
'<relative_file_path>' )

stage_name
Name of the internal or external stage where the file is
stored.
relative_file_path
Path and filename of the file, relative to its location on
the stage.

This code example shows how we can use this function in a


query.
SELECT
BUILD_SCOPED_FILE_URL(@images_stage,'/us/yosemite/half_dome.jpg');

We're providing a name for an internal name stage


calledimages_stage, and then the relative stage path to a
specific JPEG file.

If we were to run this, we'd get a string , a snowflake


hosted URL in which stage and file names are encoded.
If we're in Snow site and we're the user that called the
function, we can click this URL to download the file.
We could also retrieve the image by making a request to
snowflake's API, and when this function is used directly in a
query like this, the currently active role must have
permissions on the stage.
However, we can give other users and roles the ability to
generate a file URL without requiring privileges on the
stage.
We do this by including the function call in a UDF store
procedure or as part of a view definition.
CREATE VIEW PRODUCT_SCOPED_URL_VIEW AS SELECT
build_scoped_file_url(@images_stage, 'prod_z1c.jpg') AS
scoped_file_url;
SELECT * FROM PRODUCT_SCOPED_URL_VIEW;

For example, in this view,

Output:
-----------------------------------------------------------------------
|SCOPED_FILE_URL |
-----------------------------------------------------------------------
|https://go44755.eu-west-2.aws.snowflakecomputing.com/api/files/ |
|01a691df-0000-277e-0000-
9485000bc022/163298951696390/5fGgfDJX6kvA |
|qZx6tUJNjWDXEu%2f8%2b7a
%2fqQ5HFPCKKMs81o1MC5NSLKPzC6p2hy670VChIC7[…] |
-----------------------------------------------------------------------

The user that selects from this view can then download the
file we define in the view.

2. build stage file URL.


This is similar to the previous example, however, it's for
permanent access to a file. There is no validity period
attached to the type of URL generated by this function.
The function itself takes the same two input parameters,
however, the structure of the output URL is different.
Syntax
BUILD_STAGE_FILE_URL( @<stage_name> , '<relative_file_path>' )
Returns
The function returns a file URL in the following format:
https://<account_identifier>/api/files/<db_name>/
<schema_name>/<stage_name>/<relative_path>
account_identifier
Hostname of the Snowflake account for your stage.
b_name
Name of the database that contains the stage where
your files are located.
schema_name
Name of the schema that contains the stage where
your files are located.
stage_name
Name of the stage where your files are located.
relative_path
Path to the files to access using the file URL.

It's no longer encoded and includes the database, the


schema, the stage identifier, and the relative path of where
the file is located.
SELECT build_stage_file_url(@images_stage, 'prod_z1c.jpg');

And again, if you're in the snow site UI, not the classic
console, this URL can be clicked to download the file from
the stage.
This function also differs in that it requires privileges on the
underlying stage that is usage for external stages and read
for internal stages.
This applies if using it directly in a query or if it's part of a
view UDF or store procedure.
CREATE STAGE MY_STAGE ENCRYPTION = (TYPE =
‘SNOWFLAKE_SSE’);
ALTER STAGE MY_STAGE SET ENCRYPTION = (TYPE =
‘SNOWFLAKE_SSE’);

If files downloaded from an internal stage have any issues


like corruption, it could be because client side encryption is
set on the stage.
To fix this, we can set server side encryption with the
following command during stage creation or after the fact
with the alter command.
3. pre-signed. URL.
This is similar to what you'd get directly from a cloud
provider's blob storage.
It's a URL that gives access to a file without needing to
authenticate with snowflake.
For this function, we have an additional input parameter
along with our stage and relative path.
It's an expiration time expressed in seconds after
which access to the file or files is revoked.
Let's include the function in a query and set our expiration
time to 600 seconds or 10 minutes.
get_presigned_url( @<stage_name> , '<relative_file_path>' ,
[ <expiration_time> ] )
SELECT get_presigned_url(@images_stage, 'prod_z1c.jpg’, 600);
With this URL, we could go directly into a browser and
paste it into the address field to download our file.
We could also use it in a BI tool, for example, to display our
unstructured data. And again, to generate a pre-signed
URL, the role calling the function must have the usage
privilege on an external stage and the read privilege on an
internal stage.
Okay, let's take a look at a practical example of how we
could include unstructured data into our data analysis.
Using the pre-signed URL function as an example, our aim
is to present in one view a link to a PDF document
alongside some relevant metadata like author and the date
the document was published.
To achieve this, we'll create an internal name stage called
documents_stage in which we store the unstructured PDF
file itself as well as a JSON file describing the contents of
the document.
If we take a look in this file, we can see stored here, some
information describing the document, and crucially, we
store the relative path of the file in the stage.
The next step is to create a table we can store our
document metadata in. This is followed by a copy into
statement to pass the metadata JSON file in the stage and
load it into our table.
And finally here we create a view combining the metadata
and access to the unstructured data file by generating a
URL from the stage name and the relative path stored in
the metadata table.
If we select from this view, we now have access to the PDF
document sitting alongside relevant descriptive
information.
CREATE VIEW document_catalog AS
(
SELECT
author,
published_on,
get_presigned_url(@documents_stage, relative_path) as
presigned_url,
topics
FROM
documents_table
);

SELECT * FROM document_catalog

GET_STAGE_LOCATION
Retrieves the URL for an external or internal named stage
using the stage name as the input.
Syntax
GET_STAGE_LOCATION( @<stage_name> )

GET_RELATIVE_PATH
Extracts the path of a staged file relative to its location in
the stage using the stage name and absolute file path in
cloud storage as inputs.
Syntax
GET_RELATIVE_PATH( @<stage_name> ,
'<absolute_file_path>' )

GET_ABSOLUTE_PATH
Retrieves the absolute path of a staged file using the stage
name and path of the file relative to its location in the
stage as inputs.
Syntax
GET_ABSOLUTE_PATH( @<stage_name> ,
'<relative_file_path>' )

Directory Tables
A directory table is an implicit object layered on a stage
(not a separate database object) and is conceptually
similar to an external table because it stores file-level
metadata about the data files in the stage. A directory
table has no grantable privileges of its own.
A directory table is something you can enable on external
and internal name stages.
It's not actually a separate database object.
We can enable it on either of these stage types during
stage creation with this line or after the fact with the alter
stage command.
But what are they for? There's simply an interface we can
query to retrieve metadata about the files residing in a
stage.
Their output is somewhat similar to the list function we
reviewed when looking at stages.
And here is the syntax for querying a stage
SELECT * FROM DIRECTORY(@INT_STAGE)

it includes an additional field relevant to unstructured data


access called FILE_URL.
This is a Snowflake hosted file URL identical to the URL
you'd get if you executed the build stage file URL function.
We also have additional information like the file size, the
file hash, and the last modified time of a stage file.
So in short, with directory tables, we get a queryable data
set, which automatically comes with permanent file URLs to
stage files.
Directory tables need to be refreshed, which describes the
process of synchronizing the metadata with the latest files
in the stage.
Directory Tables must be refreshed to reflect the most up-
to-date changes made to stage contents. This includes new
files uploaded, removed files and changes to files in the
path.
ALTER STAGE INT_STAGE REFRESH;
If we upload a file to a stage and then query the directory
table without first refreshing, we won't see the new file in
the output.
How we perform a refresh depends on if we've enabled
directory tables on an external or internal stage.
CREATE STAGE INT_STAGE DIRECTORY = (ENABLE =
TRUE)

DESCRIBE STAGE EXT_STAGE;


One option relevant for both types of stages is to do a
manual refresh.
We can achieve this with the following command.
The second option, which is relevant for external stages
only, is to set up automated refreshers.
This is done by setting up event notifications in the cloud
provider in which your stage is hosted.
Let's run through at a high level the steps for an AWS
Snowflake account and an external S3 bucket.
The first thing we do is enable a directory table on an
external stage.
We then execute a describe command on that stage to
retrieve the Amazon resource name for a Snowflake
managed SQS Queue.
This queue allows the stage to receive notifications about
new files.
And lastly, we configure event notifications for the S3
bucket underpinning our external stage.
These will notify the SQS Queue about new or updated
files.
Refreshing the directory table metadata without you
needing to manually execute a stage refresh statement.

File Support REST API

Snowflake exposed many different endpoints to interact


with our accounts.
We previously mentioned the Snowpipe REST API.
There's also the SQL REST API, but what we'll look at now is
the file support REST API.
This allows us to download data files from either an internal
stage or an external stage programmatically, ideal for
building custom applications, like a custom dashboard
 Generate a scoped URL by calling
the BUILD_SCOPED_FILE_URL SQL function.
 Generate a file URL by calling
the BUILD_STAGE_FILE_URL SQL function.
Alternatively, query the directory table for the stage, if
available.
import requests
response = requests.get(url,
headers={
"User-Agent": "reg-tests",
"Accept": "*/*",
"X-Snowflake-Authorization-Token-Type": "OAUTH",
"Authorization": """Bearer {}""".format(token)
},
allow_redirects=True)
print(response.status_code)
print(response.content)

This takes either a scoped URL or a file URL, like those we


previously generated. It doesn't allow for downloading files
for pre-signed URLs.
Depending on which type of URL we provide, the
authorization to access files will be different.
For scoped URLs, only the user who generated the scoped
URL can download the stage file, and for a file URL, any
role that has privileges on the underlying stage can access
the file, that's USAGE for external stages and READ for
internal stages.
Okay, let's take a look at a simple Python client that makes
an HTTP GET request to this REST API.
This bit of code is a slightly modified example from the
Snowflake documentation.
In the first line here,
we're importing the requests Python module. This includes
all the methods required to send HTTP requests easily.

Our first argument to the GET method is our URL.


Here I've hardcoded the value, but we could do something
like use the SQL REST API to make a call to a stored
procedure that generates the URL for us.
This is followed by a series of headers that go along with
the request.
We won't go too deep into this section, but here we're
providing an OAuth token to authenticate with Snowflake.
The response we get from the API is stored in an object
from which its contents can be accessed, like an image or a
PDF file.
Storage Layer Overview

what we mean by storage in Snowflake. When we think


storage, we should primarily have in our mind the
centralized, scalable storage layer, which holds data for our
tables, organized by databases and schemas.
Under the hood, we're storing data in the scalable blob
storage of the cloud provider our Snowflake account is
deployed into.
For us, it'll be S3 storage. For Azure, it'll be Azure Blob
Storage. In GCP, it's simply called Cloud Storage.
When data is loaded into this layer, whether by the UI
INSERT statements or the COPY INTO <table> command, it
undergoes a reorganization process, transforming it into
Snowflake's proprietary columnar data format, which is
compressed, encrypted, and optimized for analytical
processing.
Structured and semi-structured data can be loaded into the
storage layer, and both undergo the same reorganization
process.
After we've ingested data into a table in the storage layer,
we no longer have access to the underlying files.
These are completely managed by Snowflake.
The main way to access the data is by SQL query
operations.
There is also an option to transform and download table
data as either CSV, JSON, or Parquet.
The optimized data files Snowflake store in the storage
layer are partition transparently to the user into what they
call micro-partitions.
We don't need to specify a static partitioning key as we
load data. It's all done automatically.

Micro-partitions
The process of loading table data and how it's physically
stored in the underlying blob storage.
Let's say we want to ingest this very simple CSV file
comprised of three fields. To get this file into a Snowflake
table and therefore the storage layer, we need to execute
this copy into table command.
When this command is executed, it kicks off the
reorganization process that transforms our input file into
Snowflake's own optimized file format.
At the same time,this process is also transparently
partitioning their input file rows, forming discrete files
Snowflake called micro_partitions, containing a subset of
the data.
And unlike traditional systems, the user doesn't define the
key by which the table is partitioned during loading. It's
done automatically by Snowflake.
So if we don't specify a key,what does Snowflake use to
partition our input data?
Snowflake automatically partitions based on the natural
ordering of input data as it's loaded, and they don't reorder
any of the input data.
They simply divide it based on the order in which it arrives.
These partitions are the physical files that are stored
together in the cloud blob storage.
They range in size from 50 to 500 megabytes of
uncompressed data, and they're named micro-partitions
because a large table could potentially have millions of
these relatively small partition files.
Micro-partitions undergo a reorganisation process into the
Snowflake columnar data format.

Our example here shows two micro-partitions being


generated from one very small CSV file. In reality, because
the minimum size for a micro-partition is 50 megabytes,
it would only form one micro-partition. I've split them here
just to make it a bit clearer.

The data format Snowflake used for their micro-partitions.


As part of the reorganization process, column values in a
micro-partition are grouped together, as we can see in the
example on the right.
This means all the data for one attribute or column is
stored contiguously on disc, as opposed to a row store in
which all the data of a row is stored together on disc.
This is called a column the store and provides two main
benefits.
You can apply compression to a single data type for a
column, allowing the optimal compression scheme for that
data type, keeping storage down.
And when querying in Snowflake, only the columns in the
projection need to be retrieved from a file, allowing a level
of column pruning in a micro-partition.
Each micro-partition file is immutable, meaning once a
partition is written, it cannot be altered. they are write
once and read many.

For example, if you're to execute a DML update operation,


a completely new partition will be written, forming the
latest version of a table.
The metadata service in the global services layer of the
Snowflake architecture is responsible for keeping track of
the micro-partition versioning metadata.
It also keeps track of many different pieces of metadata,
including both at the table and micro-partition level.
Micro-partition Metadata
For tables, Snowflake maintained metadata on things like
the structure of the table, the row count, and the table size
along with much more.
Relevant here, though, is what metadata is maintained for
each micro-partition.
The MIN and MAX values for each of the columns in the
micro-partition are stored.
If we take our example here, it will know that the MIN order
ID for micro-partition 1 is 1, and the MAX order ID is 3.
Because these values are in the metadata store, if we were
to run a MIN or MAX function on a column of this table, it
won't require a running virtual warehouse. It will pull the
result from the metadata store.
The metadata store also stores the number of distinct
values in a micro-partition, as well as how many micro-
partitions make up a table.
Essentially, the system will maintain an understanding of
the distribution of the data in the micro-partitions.
Micro-partition Pruning
The MIN and MAX metadata is key to understanding an
important concept called micro partition pruning.
Micro-partition metadata allows Snowflake to optimize a
query by first checking the min-max metadata of a column
and discarding micro-partitions from the query plan that
are not required
So let's say we keep loading files into our CSV table. These
new micro-partitions are being created as more data is
added to the table.
On the right of each micro-partition, I'm showing the MIN
and MAX value of the Order ID column.
Snowflake can optimize queries that incorporate a
constraint on this column by first checking the MIN and
MAX values in the metadata store and discarding any
micro-partitions from the query plan.
The metadata is typically considerably smaller than the
actual data, speeding up query time.
The key benefit here is that we limit reading micro-
partitions from the storage layer to only those needed to
compute the result of the query.
SELECT ORDER_ID, ITEM_ID
FROM MY_CSV_TABLE
WHERE ORDER_ID > 360 AND ORDER_ID < 460;

For example, if I ran this query with a predicate including


order ID, in which I only wanted orders with an order ID
greater than 360 and less than 460, it could safely prune or
ignore micro-partitions 1, 2, 3, and 6.
As it knows, the values we want aren't in those micro-
partitions.
The metadata is typically quite a bit smaller than the actual
data, hence querying it is quicker than scanning the values
of an entire micro-partition.

Time Travel & Fail-Safe


Snowflake is a fully updateable relational database with a
complete set of DML SQL capabilities, allowing the user to
update rows, delete rows, drop objects, there's a risk of
data loss.
To address this, Snowflake have implemented two unique
measures to safeguard against accidental errors, system
failures and malicious acts, what they call time travel and
fail-safe.
They form steps in the lifecycle of table data.
The first step is active data that makes up a table.
This comprises the most up-to-date micro partition
versions.
The second is time travel. Once the micro partitions are out
of date, either having been updated or deleted, they can sit
in storage for a configurable retention period, allowing us
to go back in time to view the data as it was in the past.
After the time travel retention period has elapsed,
Then you enter the last stage called fail-safe. In fail-safe,
Snowflake will maintain a copy of the data for seven days,
but it can only be recovered on request.
Time Travel
So let's take a look at time travel.
Time travel has three main functions.
Time Travel enables users to restore objects such as tables,
schemas and databases that have been removed. that
have been deleted by using the UNDROP command.
The second is to analyze table data at a point in the past
by querying it using special extensions to SQL.
Time Travel enables users to analyse historical data by
querying it at points in the past.
UNDROP DATABASE MY_DATABASE;

SELECT * FROM MY_TABLE AT(TIMESTAMP =>


TO_TIMESTAMP('2021-01-01'));

CREATE TABLE MY_TABLE_CLONE CLONE MY_TABLE AT (TIMESTAMP


=> TO_TIMESTAMP('2021-01-01'));

In this code example, we're selecting everything from a


table and using the AT keyword,choosing a specific point in
the past.
And lastly, we can also combine time travel and a feature
called cloning to create copies of objects from a point in the
past.
Time travel retention period.
This is a configurable number of days Snowflake will
maintain out-of-date micro partitions for a table, and
effectively defines a period in the past we can go back into
to perform the time travel data restoration task we just
mentioned.
The time travel retention period is configured with a
parameter called DATA_RETENTION_TIME_IN_DAYS.
ALTER DATABASE MY_DB
SET DATA_RETENTION_TIME_IN_DAYS=90;

we can see some sample codes setting this parameter on


the database level.
This would mean any permanent table created in that
database would be able to restore data up to 90 days in the
past.
The default retention period on the account, database,
schema and table level is 1 day.
On the Standard edition of Snowflake the minimum value is
0 and maximum is 1 day and for Enterprise and higher the
maximum is increase from 1 to 90.
Temporary and transient objects can have a max retention
period of 1 day across all editions.
This parameter can also be set at the account level,
schema level, and individual table level.
For the account level, it would require the account admin
role.
If this parameter was set on all levels, the child object
takes precedence. So if I created a permanent table in
this database and set the data retention time in days to 10
days, I could only use time travel features up to 10 days in
the past, not the 90 days it would ordinarily inherit from
the database.
The default retention period for all editions of Snowflake on
the account, database, schema and table level is one day.
This means that if you never play around with this
parameter, you'll always have one day to undrop tables
and look at table data up to 24 hours in the past.
If we're on the standard edition of Snowflake, the absolute
maximum number of days we can set for the retention
period is one, the same as the default.
The minimum number of days, regardless of which
Snowflake edition you're on, is zero.
Setting it to zero would effectively turn off the time travel
feature for a specific object.
If you were to set the retention period as zero for the
account level, all objects that don't have the retention
period specifically defined would effectively have time
travel disabled.
For Snowflake editions Enterprise and higher, the max
retention period gets boosted quite a bit to 90 days.
However, for temporary and transient objects across all
editions of Snowflake, the max retention period can either
be set to zero or one.
Accessing Data In Time Travel
Now let's take a look at how we as users work with time
travel.
Snowflake have extended the SQL language to include
three keywords, AT, BEFORE and UNDROP.
Adding the AT clause to the end of a SELECT query, like in
the code example shown here, allows us to view a table as
it was at the time of running a statement.
SELECT * FROM MY_TABLE
AT(STATEMENT =>
'01a00686-0000-0c47');
This includes the changes made by the statement we're
using as input.
The AT keyword allows you to capture historical data
inclusive of all changes made by a statement or transaction
up until that point.
For example, if this statement inserted 10 new rows into a
table, it would contain those in the result.
Instead of using a statement ID as input, like we're doing
here,
we can use parameters to specify a point in the past.
 TIMESTAMP
 OFFSET
 STATEMENT
We could use offset or timestamp. The offset specifies how
many seconds in the past I want to look back into.
Timestamp allows us to choose an exact time in the past.
The BEFORE keyword is very similar to AT.
The BEFORE keyword allows you to select historical data
from a table up to, but not including any changes made by
a specified statement or transaction.
One parameter is available to specify a point in the past:
• STATEMENT
This means that if we use a statement ID for a query that
inserted 10 rows into a table, the result of this code
example here would not include those 10 records.
SELECT * FROM MY_TABLE
BEFORE(STATEMENT =>
'01a00686-0000-0c47');

BEFORE has only one input parameter, a statement ID.


So third and final, we have UNDROP.
The UNDROP keyword can be used to restore the most
recent version of a dropped table, schema or database.
If an object of the same name already exists an error is
returned.
UNDROP TABLE MY_TABLE;
UNDROP SCHEMA MY_SCHEMA;
UNDROP DATABASE MY_DATABASE;
When a table, schema or database is dropped, it's kept for
the duration of its configured retention period.
During this time, the UNDROP keyword can be used to
restore the most recent version of a dropped table, schema
or database.
If a table, schema or database already exist with the name
of the dropped object, which is being restored, an error will
be returned.
To check which objects have been dropped,
you can use a SHOW TABLES HISTORY; command.
Fail-safe
So once data has moved out of the time travel retention
period, it enters what Snowflake call fail-safe.
Fail-safe is a non-configurable period of 7 days in which
historical data can be recovered by contacting Snowflake
support.
Fail-safe is intended to ensure historical data is protected in
the event of a system failure or other catastrophic event.
For example, hardware failure or a security breach. It's
really reserved for last resort data recovery and not
intended to be frequently accessed like time travel.
It could take several hours or several days for Snowflake to
complete recovery.
The seven-day period only applies for permanent objects
and not transient or temporary.
Data in fail-safe and time travel both contribute to data
storage costs.

Cloning
Cloning is the process of creating a copy of an existing
object within a Snowflake account.
you can clone an object at a specific point within the time
travel retention period.
Users can clone:
• DATABASES
• SCHEMAS
• TABLES
• STREAMS
• STAGES
• FILE FORMATS
• SEQUENCES
• TASKS
• PIPES (reference external stage only)
code example showing how simple it is to clone a table.
CREATE TABLE MY_TABLE (
COL_1 NUMBER COMMENT 'COLUMN ONE',
COL_2 STRING COMMENT 'COLUMN TWO',
);

CREATE TABLE MY_TABLE_CLONE CLONE MY_TABLE;


CREATE STAGE MY_EXT_STAGE
URL='S3://RAW/FILES/'
CREDENTIALS=();

CREATE STAGE MY_EXT_STAGE_CLONE CLONE


MY_EXT_STAGE;

CREATE FILE FORMAT MY_FF


TYPE=JSON;

CREATE FILE FORMAT MY_FF_CLONE CLONE MY_FF;

As part of the DDL, we provide an identifier for the clone,


and then the keyword clone. This is then followed by an
identifier for the source object you'd like to clone.
You can also clone databases and schemas. Cloning
containers like these is recursive and could include
objects which can't be directly cloned using the
clone keyword.
For example, if you cloned a schema with a view in it, the
cloned schema would also contain a view, despite the fact
you can't directly clone a view.
However, you can directly clone tables, streams, stages,
and file formats.
We can also clone sequence, task, and pipe objects. With
pipes, there's a limitation to mention.
Only pipes that reference an external stage can be cloned,
not pipes referencing an internal stage. More on pipes in
the data loading section.
And cloning is a metadata-only operation. It copies
the properties and structure of the source object.
Some objects, like file formats, are simpler to copy, as
they're just named wrappers for properties.
But what do we do with objects which manage
micropartitions?
When we execute a clone command on a table, for
example, we're not actually copying the data files of the
source table.
We're just making an identical metadata version of the
object, which points to the source table's micropartitions.
So really, we're not actually duplicating any data, and for
that reason, cloning does not contribute to storage costs
until data is modified, or new data is added to the clone.
Cloning does not contribute to storage costs until data is
modified or new data is added to the clone.
Zero-copy Cloning
when copying a table is called zero-copy cloning.
Changes made after the point of cloning then start
to create additional micro-partitions.
Changes made to the source or the clone are not
reflected between each other, they are completely
independent.
Clones can be cloned with nearly no limits.
Let's visualize how this works.

CREATE TABLE MY_CLONE CLONE MY_TABLE;


When we execute this clone table command, the clone will
initially just point to the existing micro partitions of the
source table.
If I were to query the clone table, it would return the results
pulled from the source table's micro partitions.
They would be identical in structure and data at this point.
Changes made to the clone, such as inserting new rows
after the point of cloning, then start to create additional
micro-partitions under the clone.

P4 and P5 micro-partitions are what would incur additional


storage cost.
So for a clone table, you end up with a mixture of shared
micro-partitions and independent micro-partitions.
And changes made to the source or the clone are not
reflected between each other.
They're completely independent objects. It's worth pointing
out too that you can clone an already cloned object with
nearly no limits.
Although the process of cloning is not instant, it depends
on what exactly you're cloning, it's very quick and allows
users to rapidly create a clone.
This is particularly good for testing purposes.
For example, it's useful if you want to quickly get a copy
from a live environment to perform some integration tests.
Cloning Rules
Firstly, a cloned object does not retain any granted
privileges from the source object without specifying an
additional command called copy Grants.
An exception to this is tables. A system admin, or the
owner of the cloned object, could also manually grant
privileges to the clone after it's created.
Cloning is recursive for databases and schemas. So cloning
a database will clone all the schemas and tables, but it also
clones other objects like views, pipes, stages, etc..
Each of these has their own specific rules when it comes to
cloning.
For example, like we've mentioned, when a pipe is cloned
as part of a clone database command, it will only copy
pipes, which reference an external stage, not an internal
one.
External tables and internal name stages are never cloned.
A clone table does not contain the load history of the
source table. In other words, the clone table has no
memory of what files were uploaded to make the source
table.
This could lead to a situation in which duplicate data could
be ingested into the clone.
Temporary and transient tables can only be cloned as
temporary or transient tables, not permanent tables.
Permanent tables, on the other hand, can be cloned as
temporary, transient, or permanent tables.

Cloning With Time Travel

Time Travel and Cloning features can be combined to


create a clone of an existing database, schema non-
temporary table and stream at a point within their
retention period.
It's possible to combine time travel and cloning to create a
table, schema, database, or stream clone from a point in
the past within the object's retention period.
CREATE TABLE MY_TABLE_CLONE CLONE MY_TABLE
AT (TIMESTAMP => TO_TIMESTAMP(‘2022-01-01'));

This command - We could also use the before clause here


too.
If the source object did not exist at the time specified in the
AT | BEFORE parameter an error is thrown.

Replication
Replication is a feature in Snowflake, which enables
replicating databases between Snowflake accounts within a
single organization.
Typically, replication is used to ensure business continuity
and disaster recovery if we're unlucky enough that an
entire Cloud providers region goes down cause of an
outage.
We might be tempted to compare this to cloning.

However, replication differs from cloning in two ways.


Firstly, it's between accounts within an organization,
whereas cloning is within one account. And with replication,
the data is physically moved.
Whereas in cloning, the cloned object only references the
underlying storage via metadata.
how do we go about setting up replication?
If you have the organization's feature enabled for your
account, a user with the org admin role can enable
replication by setting the account parameter, enable
account database replication to true for each source and
target accounts.
And if you don't have the organization's feature enabled,
you'll have to contact Snowflake support.
So once it's enabled, the first step is to select a database to
serve as the primary database.
And in the target account, you then create what we call a
secondary database as a replica of the primary database.
This is a read-only replica of the primary database,
ALTER DATABASE DB_1 ENABLE REPLICATION TO ACCOUNTS
ORG1.account2;
CREATE DATABASE DB_1_REPLICA
AS REPLICA OF ORG_1.account1.DB_1;

ALTER DATABASE DB_1_REPLICA REFRESH;

and running this command will kick off an initial process to


transfer database objects and data to the secondary
database. The secondary databases can be refreshed
periodically with a snapshot of the primary database,
replicating all data, as well as the DDL operations on the
database objects.
There are some additional bits of information about
replication that's good to keep in mind.
When creating a secondary database, currently, some
database objects are not replicated.
This includes external tables. The command we saw on the
previous slide to perform an incremental refresh of the
secondary database can be automated by configuring a
task to run the refresh command on a schedule.
And the refresh operation fails if a primary database
contains an event or external table type.
Privileges granted to database objects are not replicated to
the secondary database.
Billing for database replication is based on data transfer
and compute resources.
Initial database replication and subsequent synchronization
operations transfer data between regions, which Cloud
providers charge for. Compute resources refer to the
Snowflake resources required to copy the data between
accounts.
Storage Billing
It's a bit more straightforward than compute and cloud
services billing.
Data storage cost is calculated monthly based on the
average number of on-disk bytes for all data stored each
day in a Snowflake account.

When they say all data, they're referring to two areas.


The first is database table data, which comprises data
from each step in the data lifecycle.
The most current, up-to-date micro-partitions, older
versions of the micro partitions, enabling Time Travel, and
micro partitions Snowflake maintained to enable Fail-safe,
all of these contribute to cost.
If you have a 90-day retention period on all the tables you
create, you're going to have some very powerful data
recovery capabilities, but you could potentially also greatly
increase the amount you're being billed.
So as usual, when toggling the retention period,
it's good to weigh up the pros and cons.
The second area is the internal stage.
Anything we store in these stages will contribute to storage
costs. Unlike table data, we can decide how the files look in
stages.
We could upload a very large uncompressed file and forget
to delete it. This would contribute to our storage billing.
Now we know what we're billing, but how do we calculate
what we're storing if we can add or remove data at any
point during the billing window?
Snowflake will calculate every 24 hours the average
amount of data stored in these two areas.
At the end of a month, Snowflake will then take another
average across all the days in the month. That figure is
then billed based on a flat rate per terabyte.
The amount charged per terabyte can vary based on your
Region and whether you're on a Capacity or On-Demand
plan.

The amount, shown here in US dollars, is affected by the


cloud provider, so AWS, Azure, or GCP, also the region and
the pricing plan, so if it's On-Demand or Capacity.
So if an average of 15 terabytes is stored during a month
on an account deployed into the AWS London Region, it
would cost $630 as of November 2021.
Let's look at a second example.
If we stored an average of 14 terabytes during a billing
month on an account deployed into AWS AP Mumbai
Region, it will cost $644.
In this case, we're being billed more money for less data,
purely for being in a different region.
Storage Monitoring
Data storage usage can be monitored from the Classic and
Snowsight user interfaces.
Equivalent functionality can be achieved via the account
usage views and information schema views and table
functions.
Account Usage Views:
• DATABASE_STORAGE_USAGE
• STAGE_STORAGE_USAGE_HISTORY
• TABLE_STORAGE_METRICS

Data Sharing
Secure Data Sharing
Secure Data Sharing allows an account to provide read-only
access to selected database objects to other Snowflake
accounts without transferring data.
It's the ability for one Snowflake account to grant read-only
access to selected database objects like tables and views
to another Snowflake account without the need to transfer
any data between accounts.
The account sharing the objects.
In this example account A is called a data provider and the
account accessing that shared data is called a data
consumer. In this example, account B.
For example, in account A, we might say we want account
B to have the ability to read my table from account A's
perspective.
They simply manage which accounts have access to what
database objects and their job is done.
Account B would then have the ability to instantly read my
table from within their own account as if they'd created it.
However, they couldn't insert or update the data shared
with them.
The ability to share data instantly without the need to
transfer data between accounts is enabled by Snowflake's
Global Services layer, or essentially exposing some of our
own storage via the services layer to a different account
query.
A side effect of this is that the consumer account will only
have to pay for the compute resources required to execute
queries against the shared data, not any storage costs.
Secure data sharing opens up some interesting use cases.
For example, we could monetize our shared data by setting
up a pay for access model.
In fact, Snowflake have a portal on the Snow site UI called
the Data Marketplace just for that kind of use case.
More on that shortly.
Sharing is enabled with an account level object called a
share.
A share object is created by the data provider and consists
of two configurations.
 Grants on database objects
 Consumer account definitions
The first are grants on objects we like to share, like in this
example a table. This decides what we want to share.
The second is the accounts we'd like to make the share
visible to, this decides who we want to share with.

Sharing is not available on the VPS edition of Snowflake


step-by-step guide on setting up a share, and there are
some limitations on which database objects we can share.
An account can share the following database objects:
• Tables
• External tables
• Secure views
• Secure materialized views
• Secure UDFs
You'll notice the views in UDFs must be designated as
secure as we've seen previously.
This is an extra layer of security and helps hide the
definition of the views in UDFs as well as protect against
inadvertently displaying data from the source table.
This is a good use case for Secure Objects as you're
potentially sharing with a completely different organization.
So you like to be stricter with your data protection.
In fact, Snowflake strongly recommends sharing secure
views over tables.
Once a data provider adds another account to a share, the
data consumers can see that share in their account.
From this point, a data consumer creates a database from
the share inside which of the database objects the data
provider configured.
And just like that, the data consumer can then query
database objects from another account.
And lastly, secure data sharing is available to standard,
enterprise and business critical additions of Snowflake.
VPS or Virtual Private Snowflake does not support secure
data sharing.
To make this process a bit more concrete,
let's now take a look at the commands we'd run.
On the left is the data provider account
listed is a database and some objects
we could potentially share.
Let's walk through the steps required
for a provider to start sharing.
The first step in sharing data is creating a share object.
It can be created through the UI or with SQL commands.
A share is an empty account level object when first created
is just a container waiting
for configuration to be applied to it.
To execute this command, a user must have a role
with the create share privilege.
By default this is the account admin system defined role,
but it can be delegated.
Next, we add which objects we'd like the consumer
accounts to be able to read.
These come in the form of grants applied
to the share object itself.
First, we add permission to use a certain database.
Only one database can be added per share.
We then go down the object hierarchy
and add permissions to one
or more schemas, which contain the
objects we'd like to share.
Next, we define grants for the database objects
themselves.
The third step is to add the account we'd like to make
this share visible to, this can be a virtually
unlimited number.
Access to a share
or any objects in a share can be revoked at any time
by doing the opposite command, which is called remove.
Now let's take a look at what we need
to do on the consumer side.
A user with the import share privilege by default,
this is the account admin,
but can be delegated, can list the shares available
to the account using the show shares command.
Once the name for the share has been found,
which is the account locator
and the share name, a database can be created from the
share
with the following command.
This database will contain all the objects
that were granted from the data provider to the share.
Only one database can be created from a share,
to guide what objects you can query.
You can look in the UI under the newly created
database object or issue a described share command.
The consumer account can now view
and query the table shared by the provider
as if it was any other table in their account.
This includes joining it with their existing data.
In the certification exam is likely that some of the provider
or consumer behaviors will be asked,
so let's call out some of these points
Data Provider
Database objects added to a share become immediately
available to all consumers.
Any new object created within a shared database will not
automatically be added to a share.
You have to explicitly add the grant to a share to make it
accessible to consumers.
Future grants cannot be used in shares.
Only one database can be added per share.
Snowflake doesn't place any hard limits on the number of
shares a data provider can create or the number of
accounts a provider can add to a share.
And by default, a user must have assigned the account
admin role to create shares.
This responsibility can be delegated via the create share
privilege.
Access to a share or database objects in a share can be
revoked at any time.
If a data provider removes a grant from a share or drops a
shared object all data consumers access to that object will
be instantly removed.
It's currently not possible to share data between accounts
in different regions and cloud platforms.

If a provider would like to share with an account


outside of its region, for example, it would involve
replicating the database you'd like to share into an
account within the same region as the consumer
account and then creating a share from the
replicated database.
Data Consumer
Only one database can be created per share, time travel is
not available on shared tables.
And as a data consumer, you cannot clone a table which
has been shared.
A data consumer cannot use the Time Travel feature on
shared database objects.
A data consumer cannot create a clone of a shared
database or database objects
To create a database from a share a user must have a role
with the IMPORT SHARE privilege granted.

By default, this is granted to the account admin system


defined role.
Data consumers cannot create objects inside the shared
database.
Data consumers cannot reshare shared database objects.
Data consumers cannot create objects inside the shared
database.
Reader Account
A reader account provides the ability for non-Snowflake
customers to gain access to a providers data.
Let's now consider a scenario in which an organization a
data provider would like to share data with doesn't have a
Snowflake account.
In this case, a data provider can request Snowflake to
provision something called a reader account, also known as
a managed account.
This type of account is like a limited version of a standard
account whose sole purpose is to provide access to
database objects from the data provider to a non Snowflake
user.
They wouldn't have to enter an agreement with Snowflake.
They're simply provided an account URL, username and
password and can query the shared data as if they did
have a full account.
Reader accounts are quite limited in what they can do.
They can't insert new data or copy data into the reader
account.
They exist purely to query and view database objects
created from a share, and that share can only be from the
provider account which created it.
A reader account can only consume data from the provider
account that created it
The provider accounts assume complete responsibility for
the reader accounts they create and all the compute
resources used to query the shared database objects are
billed back to them.
There's currently a soft limit of 20 reader accounts per
provider account.
However, this cap can be increased by contacting
Snowflake support.

Data Marketplace

This is a public area only accessible in Snowsight in which


you can browse shares and import them into your account
as you would a direct share.
On the data marketplace, you can either be a data provider
or a data consumer.
you could create two different types of listings,
standard or personalized.
A standard listing is when you as a provider grant access to
one of your raw datasets to any account that clicks Get on
the marketplace. It's an option which allows for out-of-the
box access to anyone that requests it.
A personalized listing on the other hand allows for a bit
more control around the listing of a share. The data is
requested via form on the UI, allowing customers to
request specific datasets and the data provider processes
the requests in their own time.
This type of listing allows a data provider to charge for data
access.
Data Exchange

what if you'd like to share your data publicly but only to a


specific group of accounts?
A data exchange effectively allows you to create a private
version of a data marketplace on the Snowsight UI, and
invite other accounts to provide and consume data within
it.
That could be other departments in a company, suppliers,
any trusted party that wants to produce or consume data.
We can set up a data exchange by contacting Snowflake
Support and providing a name and description of the
exchange.
The data exchange feature is currently not available for
VPS editions of Snowflake as it uses secure data sharing
under the hood.
The individual Snowflake account that hosts the data
exchange becomes the data exchange admin account. Any
user with the account admin role can manage members'
access to the exchange and designate members as
providers and consumers.
Providers can list either personalized or standard shares,
just like in the data marketplace.
The Snowflake account hosting a Data Exchange is the
Data Exchange Admin and can manage members,
designate members as providers and consumers and
manage listing requests.
The data exchange admins also manage listing requests
from providers, having the ability to approve or reject
listings.

You might also like