BigQuery Cost Optimization Interview questions
What are the various methods to reduce BigQuery cost
All the below methods are effective and contribute to the overall cost
- Choosing optimum partition keys
- Choosing optimum cluster keys
- Query tuning / rewrite to make sure appropriate keys are used as filter
- Materializing (physicalize) the views or query results
- Reduce active storage cost by choosing right partition keys
How to choose the optimum partition key for a BigQuery table
The partition key,
- Should produce optimum number of partitions. Ideally we want to avoid high number of partitions that are small in size (and the vice versa). e.g. Make sure the median partition is at least 10MB is size
- Should produce even sized chunks of data (choosing a wrong column might result in skew)
- Should be a column frequently used in the filter conditions in the reports
- Should not result in rapidly changing dimension. If we choose the wrong partition key, there is a chance that our daily data loads might end up updating all the partitions in a table thereby resulting in high active storage costs
What are the partition key quotas
The maximum number of partitions per table cannot exceed 4000
How to choose the optimum cluster key
The Cluster key,
- Should be frequently referenced in the filter conditions in the reports
- You can choose up to 4 columns as part of the cluster keys. The cluster keys should be declared in the order of significance. The most popular column should be specified in the left, followed by the less popular column
- There is no limit on the number of unique values a cluster key might have (as opposed to partition keys, subject to 4000 partition limit)
- There is no cost or performance overhead due to clustering, so feel free to add cluster keys to all the tables
How does the query pruning work in BigQuery
When you execute a SQL in BigQuery, the query engine
- Partition pruning - If the partition column is included as filter, then scan only the specified partitions.
- Cluster pruning - Within the partitions, The data is sorted based on the cluster keys. So, if the query has a filter that references a cluster key, then the query engine doesn't have to scan the full partition, as it knows the data is sorted, so it only looks at the appropriate clusters only
- Columnar pruning - Fetch only the columns within the clusters that are required for the query
Using metadata operations to reduce cost
There are certain metadata operations that don't cost anything. For example,
TRUNCATE TABLE --> is free of cost
DELETE FROM TABLE where TRUE; --> incurs a cost
So, when possible, prefer the metadata SQL dialects
Use the Query cost estimator
The BigQuery WebUI has a query cost estimation tool. You can highlight the SQL Query and the tool will show you the approximate cost for the SQL Query. Make use of the tool to optimize your queries
Perform A/B testing
Always look for opportunities for performance and cost optimization and do some A/B testing to create different versions of SQL and use the one with better results
Use Snapshot tables wherever applicable
Snapshot tables are zero size tables, where is underlying data is cloned (not copied) thereby incurring reduced storage costs. So, its a good practice to leverage Snapshot tables wherever applicable
Reduce Active storage costs
Periodically check your tables and make sure the active storage % is less (< 20% should be a good measure). In order to reduce the active storage, you may have to play around with the partition keys
Do periodic housekeeping
Drop unused tables and data. Cleanup historical data. Remove backup tables diligently. Avoid duplicating data in multiple places. Follow the general housekeeping principles and periodically look for ways to cut costs
Post Comment