BigQuery Cost Optimization Interview questions

What are the various methods to reduce BigQuery cost

All the below methods are effective and contribute to the overall cost

  • Choosing optimum partition keys
  • Choosing optimum cluster keys
  • Query tuning / rewrite to make sure appropriate keys are used as filter
  • Materializing (physicalize) the views or query results
  • Reduce active storage cost by choosing right partition keys

How to choose the optimum partition key for a BigQuery table

The partition key,

  • Should produce optimum number of partitions. Ideally we want to avoid high number of partitions that are small in size (and the vice versa). e.g. Make sure the median partition is at least 10MB is size
  • Should produce even sized chunks of data (choosing a wrong column might result in skew)
  • Should be a column frequently used in the filter conditions in the reports
  • Should not result in rapidly changing dimension. If we choose the wrong partition key, there is a chance that our daily data loads might end up updating all the partitions in a table thereby resulting in high active storage costs

What are the partition key quotas

The maximum number of partitions per table cannot exceed 4000

How to choose the optimum cluster key

The Cluster key,

  • Should be frequently referenced in the filter conditions in the reports
  • You can choose up to 4 columns as part of the cluster keys. The cluster keys should be declared in the order of significance. The most popular column should be specified in the left, followed by the less popular column
  • There is no limit on the number of unique values a cluster key might have (as opposed to partition keys, subject to 4000 partition limit)
  • There is no cost or performance overhead due to clustering, so feel free to add cluster keys to all the tables

How does the query pruning work in BigQuery

When you execute a SQL in BigQuery, the query engine

  • Partition pruning - If the partition column is included as filter, then scan only the specified partitions.
  • Cluster pruning - Within the partitions, The data is sorted based on the cluster keys. So, if the query has a filter that references a cluster key, then the query engine doesn't have to scan the full partition, as it knows the data is sorted, so it only looks at the appropriate clusters only
  • Columnar pruning - Fetch only the columns within the clusters that are required for the query

Using metadata operations to reduce cost

There are certain metadata operations that don't cost anything. For example,

TRUNCATE TABLE --> is free of cost
DELETE FROM TABLE where TRUE;  --> incurs a cost

So, when possible, prefer the metadata SQL dialects

Use the Query cost estimator

The BigQuery WebUI has a query cost estimation tool. You can highlight the SQL Query and the tool will show you the approximate cost for the SQL Query. Make use of the tool to optimize your queries

Perform A/B testing

Always look for opportunities for performance and cost optimization and do some A/B testing to create different versions of SQL and use the one with better results

Use Snapshot tables wherever applicable

Snapshot tables are zero size tables, where is underlying data is cloned (not copied) thereby incurring reduced storage costs. So, its a good practice to leverage Snapshot tables wherever applicable

Reduce Active storage costs

Periodically check your tables and make sure the active storage % is less (< 20% should be a good measure). In order to reduce the active storage, you may have to play around with the partition keys

Do periodic housekeeping

Drop unused tables and data. Cleanup historical data. Remove backup tables diligently. Avoid duplicating data in multiple places. Follow the general housekeeping principles and periodically look for ways to cut costs