Cost optimization
best practices for
BigQuery
TecHub
Data Analytics
Google BigQuery
Fully managed and serverless
for maximum agility and scale
Unique
Google Cloud Platform’s
enterprise data warehouse Real-time insights from streaming data
for analytics
Unique
Exabyte-scale data
warehousing Built-in ML and Geospatial for
predictive insights
Unique
Encrypted, durable, secure,
And highly available
High-speed, in-memory BI Engine
for faster reporting and analysis
Unique
BigQuery | Architectural Advantage
Decoupled storage and compute for maximum flexibility
SQL:2011
Compliant
Replicated, Distributed BigQuery High-Available Cluster
Storage Compute
Streaming (99.9999999999% durability) (Dremel) REST API
Ingest
Distributed Web UI, CLI
Memory
Shuffle Tier
Client
Libraries
Free Bulk In 7
Loading Petabit Network languages
BigQuery | Managed storage
Durable and persistent storage with automatic backup
● Tables are stored in optimized
Table 1 Table 2 Table 3
columnar format
● Each table is compressed and
encrypted on disk Region
● Storage is durable & each table is
replicated across datacenters 2 2
● You can do time travel on data 3 3
within 7 days 1 1 2 1 3
Zone A Zone B Zone C
BigQuery | Large stateless compute
Modern architecture for scalability and performance
● Superlinear horizontal scalability
● Immune to node/rack downtime
● Seamless maintenance
● Pipelined execution, dynamic work
repartitioning, speculative execution
Cost optimization techniques
Query Processing Storage
● Ondemand pricing ● Data Retention
○ Query the data you need
● Long term storage
○ Query cost controls
○ Partition and Cluster your tables ● Avoid duplicate storage - use
(includes zero maintenance federated data access model
auto-reclustering)
● Streaming Insert
● Flat-rate pricing
● Backup and recovery
01
Optimize
Querying
Query the data you need Optimize
querying
● Avoid SELECT * (Use preview option to explore your data - its free!) Query
1
● Denormalize your data (nested fields) *To bear in mind: BigQuery is a Data Warehouse required data
● Filter your query as early and as often as possible to improve performance and
reduce cost.
● Check how much your query is going to be charged
● Avoid SQL anti-patterns Enforce cost
2
control
Partition and
3
cluster
Flat-rate
4
pricing
Avoid human errors Optimize
querying
● Enforce MAX limits on bytes processed at query, user and project level. Query
1
required data
● Cancelling a query may cost $
● Use caching intelligently
Enforce cost
2
control
Partition and
3
cluster
Flat-rate
4
pricing
Partition & cluster your data Optimize
querying
Query
Partition your table to reduce the data sweeped 1
● required data
○ Enable required partition filter
● Cluster to further prune your data blocks
Enforce cost
2
control
Partition and
3
cluster
Partitioning Clustering
Flat-rate
4
pricing
Flat-rate & Reservations Optimize
querying
Query
● Think about flat-rate once your BigQuery processing cost > $10K 1
required data
○ Familiarize with BigQuery cost using our pricing calculator
● How many slots you should buy? - Visualize slot utilization in Stackdriver
Enforce cost
2
control
Partition and
3
cluster
Flat-rate
4
pricing
02
Optimizing
Storage
How long are you keeping your data? Optimizing
Storage
(TTL) 1 Data Retention
Long term
2
storage
Avoid duplicate
3
storage
Dataset level Table level Streaming
4
inserts
*Similar to dataset-level and table-level, you can also set up expiration at
Backup and
partition-level. Do checkout our public documentation for default behaviors. 5
Recovery
Be wary how you edit your data? Optimizing
Storage
● If your table or partition has not been edited for 90 days, the storage 1 Data Retention
price drops by 50% (Long-term storage)
○ Watchout for any actions that edits your table: Loading into
BQ, DML operations, streaming inserts, .. Long term
2
storage
● For long term archives with access frequency at most once a year -
leverage Coldline class in GCS.
Avoid duplicate
3
storage
Streaming
4
inserts
Backup and
5
Recovery
Avoid duplicate copies of data Optimizing
Storage
Leverage BigQuery’s federated data access model for
1 Data Retention
your data stored on:
● Cloud Drive
● Cloud BigTable Long term
Cloud Storage 2
● storage
● Cloud SQL
Avoid duplicate
Use cases: 3
storage
● Frequently changing small side inputs
● Ingestion with cleanup that needs to be archived
● Querying of large archives Streaming
4
● Querying is less performant - gotcha! inserts
Backup and
5
Recovery
Optimizing
Storage
Loading the data
1 Data Retention
● Batch upload is free. Use streaming inserts only if it consumed by
downstream processes in real time.
Long term
2
storage
Understanding DR and backup processes 3
Avoid duplicate
storage
● By default your 7-day history is tracked by BigQuery at the service level.
○ You can find examples in our public documentation for point in time restore.
Streaming
4
● If you delete your table, you cannot restore it after 2 days. inserts
Backup and
5
Recovery
BigQuery Materialized Views
Automatically synchronizes data refreshes with
Zero Maintenance data changes in base tables. No user inputs
required.
Always consistent with the base table. There will
Always fresh never be a situation when querying MV results in
stale data.
BigQuery will rewrite the query to use the MV for
Self tuning better performance and/or efficiency when
querying the base table directly.
Flexibility and choice across the BI process
Flexibility and choice across the BI process
Introducing BigQuery
BI Engine
Sub-second queries
Simplified architecture
Smart tuning
Visualize cost
● Create your own dashboard (example)
● Analyze spending trend & query trend over time
● Breakdown cost per project and per user
● Be proactive about tracking your expensive
queries and optimize them
● BQ Audit logs Queries repository (Github)
Blogpost
For more details
bit.ly/gcp-co-bq
Thank you
Appendix
Ingestion formats
Faster Avro
Avro (Compressed)
Avro (Uncompressed) Parquet
Parquet / ORC
CSV ORC
JSON BigQuery
CSV (Compressed) CSV
JSON (Compressed)
Slower JSON
Introducing
BigQuery Omni
A flexible, fully-managed, multi-cloud
analytics solution that lets you analyze data
across public clouds without leaving the
familiar BigQuery user interface.
Data integration partners
SaaS Data Sources
Databases
Data warehouses
B2B, EDI data
Resource Optimizations
● BigQuery Partitioning & Clustering
● Federation: Avoid duplication of data
● Data retention and clean up for active storage
● BigQuery Caching
Pricing Efficiency
● Flex Slots
● BigQuery Slots Recs