The 45c, Billion Row, Data Warehouse?
On 3rd December 2020 Microsoft announced General Availability of the Microsoft Synapse Analytics platform, Synapse is the "data lakehouse" offering that combines a traditional data warehouse capability with data lake, Apache Spark and a SQL serverless data processing service.
Not gonna lie, I've been a little hesitant to buy in on the whole "Azure Synapse" thing. When it was first announced we began building POC solutions on the platform and found the traditional data warehouse costs to be way higher than some of the other options available for similar functionality, the entry point for dedicated SQL on Synapse is around $1,600AUD per month for 100 DWUs (Dec 2020) so not exactly chicken feed to start out.
We found our customers (who are primarily coming from a Power BI viewpoint) struggled conceptually with the benefits of the service, and most others who "got it" already had Snowflake, if you are looking for a data warehouse solution right now, you may be experiencing the same when you look at Azure Synapse Analytics for your organisation.
In June 2020, Patrick LeBlanc from Guy in a Cube published to YouTube an impressive demo by Josh Caplan of Azure Synapse SQL Serverless processing a huge dataset with minimal ETL.
I have to admit, watching this was one of the first times I started to see the true cost saving possibilities of Synapse clearly... a solution where we could gain simple SQL based querying over sets of files stored in a data lake with minimal ETL Steps.
The fewer engineering steps means MUCH lower cost of implementation and operation, a simple flow means better throughput, less failure and more happy users... because #dataops
The thing that really got me though was when I looked at the cost... the solution in the video is 1.6B (that's a B!) rows or 50GB of data stored in Gen2 storage then processed by Azure Synapse in 30 secs... not sure what the exact total processing value was on that query but say it's 50GB, Synapse SQL Serverless is priced at $8.95 per TB of data processed... 9 bucks a terabyte!
When I pulled up a calculator and looked at what that demo would have cost... it's like 45c AUD for that one query on serverless... 45 Australian Cents to process 1.6 Billion rows... let that sink in.
So it seems to me that dedicated SQL is not mandatory on Synapse, and you could happily run SQL serverless as a low cost data "lakehouse" solution until it comes a time that, for performance reasons, you need a dedicated SQL server to materialise a large fact table/view or some other such scenario.
Worth looking into? In my opinion... yes, and I have already done some little experiments... more of that coming soon.
Let me know what you think! Does anyone know the true data processing cost of that query Josh Caplan runs? hit me up if you do!
I make music from the data🎵Data Mozart 🎵| MVP Data Platform | O'Reilly Author | Pluralsight Author | MCT
4yI've tested Synapse Serverless SQL pool in different scenarios for Power BI workloads and put everything in the blog post: https://data-mozart.com/power-bi-synapse-part-4-serverless-sql-how-much-will-it-cost-me/ My 2 cents: in general, be careful when using the Serverless SQL pool with Power BI! As Mimoune and Johannes already mentioned, lack of result set cache or any kind of materialization can potentially make this very expensive...I mean, yes, it looks awesome that you pay few cents for one query over 1.6B of rows...But, imagine running this query as a DQ from Power BI, with 20 users in parallel, 50 times a day...Every single time, you will pay the full price of the query (even more, as Mimoune said, you are paying for the amount of data processed, which includes data scanning and data movement). In my opinion, a Serverless SQL pool is a very cool feature and it definitely has its place for specific data workloads, like running some ad-hoc queries over huge and/or semi-structured data, without the need to transform and store the data in a "traditional" way. However, if your workload requires frequent querying of this data, I would definitely choose to materialize the results in a Dedicated SQL pool.
CDO & Co-Founder of Analytic Endeavors | Microsoft MVP | Owner of Sunny BI - Power BI Training
4yI’m excited for this. Going to open some huge possibilities!
for 50GB you will probably pay 100 GB, 50 GB to read the file and 50 GB to transfer the data to the end point, assuming you read everything using select *, other Serverless DW you pay only 50 GB, but that's not the issue really, Synapse does not support cache by default, if you run the same Query and the source data does not change, you pay again the full price, that's simply not good enough