What makes Snowflake's live data exchange so damn exciting?

What makes Snowflake's live data exchange so damn exciting?

Here is another episode of me explaining another revolutionary Snowflake feature. This one is pretty special because I initially was not fully grasping the full scope of what this really means to a business. After almost a year, this is the lone feature that ends up changing the conversation we are having with customers from a technology, speeds & feeds conversation to Snowflake being an integral part of how they want to run their business in the long term and us being a business partner with them rather than just a technology partner.

So you ask what is data exchange, how does it work & why did it become such a business driver for most of our largest customers?

Let me start by drawing a picture of what a typical manufacturer may be dealing with in terms of exchanging data with 3rd parties.

No alt text provided for this image

Data may be shared as such :

  • The manufacturer may share their current demand data with their suppliers & distributors so they are always ready to provide enough raw material and stock enough goods to handle upcoming demand from the manufacturer.
  • On the other hand, suppliers & distributors may share their data with the manufacturer so the manufacturer is aware of what kind of raw material is available from the supplier & the stock levels of distributors across the country so they can plan what products to produce & how much.
  • If the data is the new oil, mobile delivery apps are the gas stations for any CPG company. These guys have a gold mine of detailed customer data. They know every single customer in a very personal way, what they buy, when they buy, where they buy, where do they go, their hobbies, GPS locations & down to their mobile battery levels throughout the day. So they not only have a very detailed purchasing history of their customers from all of the retailers they work with but they also have treasure throwe of personalized info on every customer using App & IoT data collected from their apps. Easy access to this kind of info will enable any CPG company to easily create a highly targeted personalized shopping experience for their own customers.
  • Just like the suppliers & distributors, these mobile delivery apps would love to know detailed product current availability & pricing info from the CPG company so they can provide much more accurate and timely information to their own customers through their apps. Knowing certain products are back-ordered, not available, or will be discounted soon by using real-time CPG data is hugely valuable info for app developers.
  • How about online, national & local brick & mortar retailers? These guys usually don't have much info around individual customers unless they track them with StoreCards but what they have is their Point of Sale data from their cash registers. They have the pulse of customers that are shopping from a store and can provide a detailed list of products, the amount spent, store location, time, and payment method. Even if lacks personal customer info, retailer POS data is a major source of retail analytics for any CPG company for many different use cases.
  • Again, just like the delivery apps, these retail stores would love to have timely information on product availability, stock levels, pricing & discount from the CPG companies so they can place the right products in the right stores, at the right price & not risk running out of stock.
  • What about value-add partners of the CPG company? They may receive industry-specific data from 3rd party data providers to do analysis around new products, competitors, consumer behavior or category management. They may also send their purchasing history & customer details to highly specialized partners who could run complex machine learning algorithms and provide scoring, basket analysis or forecasting info back to the CPG company. Essentially CPG company giving them their data and receiving their data with additional information obtained from ML processes.
  • I am sure there are more but I will stop now because by now, you should be fully versed on how much data can potentially be exchanged back & forth between so many parties for a given business. This was a CPG example but even if you change CPG to another vertical like pharma, life-sciences, or finance & wealth management or something that is more generic like marketing you will see a very similar picture with the exception of names of data providers & consumers.

So, if a business tells you that they don't do any form of data exchange with anyone at all, you need to speak to someone else from that business. Every business deals with some form of 3rd party for data exchange whether it is regulatory, HR, payment processing, supplier, customer, vendor, or marketing.

Now you see how important is external data for businesses, let look at how it is being done today. I'll keep it simple and give an example of a somewhat modern business that is fully moved to the cloud and has no on-prem sources. (on-prem sources would just make this more challenging)

Again let's start with the picture for traditional data exchange among just 3 parties (CPG, Supplier & Delivery Service App) & I'll explain.

No alt text provided for this image

So if you wanted to set up a bi-directional data sharing across only 3 business parties where all three parties were fully on the cloud, we are looking at a bare minimum of 16 independent processes.

And, each one of these 16 processes would most likely be composed of multiple steps handled by multiple teams, run on multiple different schedules, using multiple user credentials, and creates multiple copies of data in multiple formats.

If we just zoom in to the ETL step of one of the providers sharing data with only one of their partners, it would probably look like below. You would likely see series of processes running one after another where each sub-process takes some time to complete, consumes X amount of limited server resources & adds another failure point that the data team has to monitor & fix. I would call this a very leaky data pipeline. Because this is such a time & resource drain to repeat it for every partner, most organizations simply don't have the resources to share their data with all of their partners and end up doing it with only the biggest ones if they do it at all (If the business is successful and leader in their market, trust me; they are definitely doing it at some scale )

No alt text provided for this image

If you think this is complex how about figuring out when to trigger a download & ingest process if you the consumer since you don't get a signal that the provider actually uploaded a new file. OR... how do you know if the provider uploads a file but then replaces it 5 mins after you processed it due to some bad data that was in the initial file. How would you know it, who would fix it and how quickly can it be fixed it.

This gets super complex, super-fast with many, many, many, many failure points along with time & compute sapping processes across the board.

As you can see, even if both parties are on the same cloud, they are still data silos where data is sitting in a different data warehouse server and a hard drive that is sitting in a different VPC network where the receiving side has no way of accessing from their VPC. That is why even though there is plenty of data to go around to fuel businesses, the amount of data exchange is still limited and the frequency is nowhere near real-time. In reality, most organizations are lucky if they can update their sources only once a day with only a limited number of partners.

I believe the technical term for this entire process is most commonly referred to as "This sucks!!!"

Now on to the Snowflake's revolutionary direct data exchange

No alt text provided for this image

Snowflake data exchange was born to get rid of the complexity, delays & difficulty of exchanging data between multiple parties. It was designed to use a single copy of data with near 0 maintenance and without any data, movement to exchange massive amounts of data between two or more parties

You are probably thinking...Did he say data exchange using only a single copy of data and with any data movement? How on earth is that possible?

Well....it is possible if all the stuff is on the cloud. Technically there are no different clouds if you are using the same cloud provider. If you are not then Snowflake can use some of its magic to make it look like you are all in one big cloud.

Essentially any party that is on the same cloud provider & region is within the same network and shares the same pool of resources. There is blob storage which is publicly accessible storage for anyone then there are EC2 servers used for various things whether they are a DB server, a node in a Hadoop cluster or a reserved machine that has some apps installed. When you turn on a machine from AWS, Azure or GCP, it is not like they are fetching one from a closet and hooking up to your network. Most of those servers are in the same place w/o any real physical separation, it is the software & network security protocols & using local storage on the servers that separate one EC2 server that company A is running from another one for Company B.

An overly simplified view would look something like this where various organizations are accessing their own compute & storage resources on a cloud provider where resources are walled off from each other by the network security built into these providers.

No alt text provided for this image

This brings us to what Snowflake Data Exchange is & how it works. What if all of the parties were running within a single cloud VPC network and the security was controlled by a single provider to make sure one account can't see resources of another. And all the storage was using a common blob storage layer instead of local drives on these servers? Essentially a shared SaaS environment where compute resources are dedicated to clients but the permanent storage only uses a common blob layer.

This is essentially what Snowflake is. Snowflake manages the servers that their clients use where all the data is stored on the blob layer and nothing gets permanently stored on actual servers. Servers are just temporary compute resources to execute jobs when they are needed and do not store any data permanently. So a simplified Snowflake looks something like this.

Few things to note:

  1. All resources are located within a single virtual network that is fully managed by Snowflake with no access from outside (customers or otherwise).
  2. Servers do not have local hard drives in the picture. This is for presentation purposes only. They actually do have SSD drives but they are only used for storing temp cache data and will be wiped clean as soon as the server pauses. Remember all permanent data gets stored in the blob layer. No permanent data lives on compute servers.
  3. Top Global Services(GS) Layer is the brains of Snowflake and is the only layer that customers can connect to. This layer knows & tracks exactly what data & compute resources belong to each account and will automatically secure the usage of these resources to the accounts they belong to. Furthermore, accounts never have direct access to either the servers in compute or the actual data files within storage layer. All the access, security, authentication & authorization is controlled by the GS layer and all of the communications use SQL commands. Results & Configuration & system statuses are always returned as SQL command results via Snowflake drivers. Essentially, all data & servers are within the same network controlled by Snowflake as well as the security which is controlled by the Snowflake GS layer without creating separate networks & physical silos between accounts.
No alt text provided for this image

So this was a picture of a typical Snowflake scenario where two parties were using Snowflake by two different Snowflake accounts. Since the security is controlled by the GS layer and not baked into actual VPNs, Firewalls, networks & switches; It allows Snowflake to do some pretty nifty stuff when there is a need for exchanging data between two parties whether they are using the same cloud provider(AWS, Azure, GCP) or accounts are on different providers(Cross-Cloud). After all, Snowflake is cloud-agnostic which means it looks, works & smells identical regardless on which cloud provider you are using.,

Let's see what happens when Company A (Red) wants access to a specific dataset from the Company B(Green) account. Remember, if this was any other platform, you would have to go through the pain of setting up a ton of complex processes to replicate, transform, export, upload for providers then download, ingest & transform for consumers. Not with Snowflake!

This is how it works using Snowflake's live direct data exchange. First, the provider (Green Company B) allows access to one of their tables & adds the account ID of Red Company A to a list. At that moment, Red Company A gains instant ability to use their own compute nodes to read & query those specific datasets from Company B's data stores. Just to clear things up, we are not moving data files, we are simply reconfiguring the security within the GS layer to allow access to specific datasets from one account to another account so they can query it using their own compute resources.

No alt text provided for this image


Doesn't look very impressive, right? You may say so what? I'll tell you what...

  1. Red Company can query those specific tables from Green company using their virtual warehouses(servers) and use any size they want. They can use XS or XXL or bigger depending on their performance needs.
  2. Data that red company has access to is live data. It is the same data that that green sees. Any time green updates their data those changes are live and immediate. No delays.
  3. Did I say the green company has the data? This means you don't need to move or replicate data to share between Snowflake accounts. This also means consumer accounts (Red company) do not have to pay a single penny for storage to have access to these datasets. It is true whether the data is 1,000, 10 billion, or 100 billion rows. The green company owns the data and is paying for storage, not the red company. Red only pays for compute while they query the data just like they would with their own internal data.
  4. Green does not have to run any additional processes (import, export, ETL, upload) to share their data. They just update it as they normally would for their own users and Red will see the changes immediately. Major savings in time & money for not having to run & maintain resources to run those extra steps per each exchange partner.
  5. Guess what? Red also does absolutely nothing to be able to read or refresh this data( major $$$ & time savings). Data is always refreshed because it is actually reading another account's data store. For Company Red; these shared datasets look & behave exactly like any of their internal databases & tables. They can query it in isolation or they can easily join them to their own internal datasets using standard query joins. No different than any other query.
No alt text provided for this image

6. It is much more secure. Data is always encrypted at all times and Green company can cut access at any time if they no longer wish to share it anymore where they won't have a bunch of copies floating around after the fact.

So you don't need ETL, don't need import & export, don't have to deal with file format or data type issues, don't have to pay storage, and best of all data is always live and never has to be refreshed. Just hook up your analytical processes to these shared tables and you are ready to go. Just set it & forget it. No more maintenance failed transfer or synch issues.

Plus because it is so quick, easy, cheaper and trouble-free that companies can now exchange data in a much broader scope with many other companies at the same time sharing many more datasets with each one and all in real-time.

Now the nirvana state of what I originally shared of a CPG company and all their partners sharing mission-critical data among each other in real-time to help each other so they operate better and be more profitable is finally a reality.

No alt text provided for this image


This was the technology bit. There is also the networking bit. As more and more companies are flocking to Snowflake, we are becoming more & more an integral business partner with our customers where we help connect them to other Snowflake customers which may end up being their partners, vendors, customers & suppliers. It is the pure networking effect of connecting clinical data providers or healthcare providers to life science companies OR connecting retailers, suppliers & marketing companies to manufacturers OR connecting service providers to businesses.

This is super exciting & making a huge impact on how today's organizations accelerate & widen their data sharing capabilities to make much more informed decisions based on not just their own internal data but also using external data coming from their business partners along with other public & private sources to make these rapid data-driven decisions based on live & timely data.

So next time you hear someone talking about FTP, import & Export to get or send some data that is used to run their business, think Snowflake Data Exchange. You won't regret it.

As always, if you think this article is helpful, feel free to hit the like button and share it within your network to spread the word.

Jeffrey Jacobs

CTO/Creator of AltaSQL (U.S. Pat. No. 11,977,539) Metadata-Driven SQL Automation & Governance | Driving Precision at Scale

4y

Thanks! Substantially shortens an article I was planning by referring to yours 😎

Purushottam (Puru) Uppu

Data and AI Engineering |Data Analytics, Agentic AI, Technology and Architecture | Strategy & Consulting@ Accenture

4y

Cool stuff about Data exchange. Will it replace ETL?

Like
Reply
Amine Zouhair

Project Finance Professional | Infrastructure & Renewable Energy | UC Berkeley, EDHEC, Ponts et Chaussées

4y

Very insightful, thanks a lot for sharing!

Like
Reply

To view or add a comment, sign in

More articles by Nick Akincilar

Others also viewed

Explore content categories