Caching is the result of Snowflake's Unique architecture which includes various levels of caching to help speed your queries. This tutorial provides an overview of the techniques used, and some best practice tips on how to maximize system performance using caching
Imagine executing a query that takes 10 minutes to complete. Now if you re-run the same query later in the day while the underlying data hasn’t changed, you are essentially doing again the same work and wasting resources
Instead Snowflake caches the results of every query you ran and when a new query is submitted, it checks previously executed queries and if a matching query exists and the results are still cached, it uses the cached result set instead of executing the query. This can greatly reduce query times because Snowflake retrieves the result directly from the cache.
Snowflake Cache results are global and can be used across users. eg. If User A executes a query and the results are cached, When User B executes the same query the results will be served from cache
The Results cache holds the results of every query executed in the past 24 hours. These are available across virtual warehouses, so query results returned to one user is available to any other user on the system who executes the same query, provided the underlying data has not changed.
Whenever data is needed for a given query it's retrieved from the Remote Disk storage, and cached in SSD and memory of the Virtual Warehouse. This data will remain until the virtual warehouse is active. When there is a subsequent query fired an if it requires the same data files as previous query, the virtual warhouse might choose to reuse the datafile instead of pulling it again from the Remote disk
This is not really a Cache. Instead, It is a service offered by Snowflake. Snowflake automatically collects and manages metadata about tables and micro-partitions
For Micro-Partitions, Snowflake stores:
For Clustering, Snowflake Stores:
All DML operations take advantage of micro-partition metadata for table maintenance. Some operations are metadata alone and require no compute resources to complete, like the query below
SELECT MIN(L_SHIP_DATE), MAX(L_SHIP_DATE) FROM LINE_ITEM;
Micro-partition metadata also allows for the precise pruning of columns in micro-partitions. When pruning, Snowflake does the following:
Snowflake Cache results are invalidated when the data in the underlying micro-partition changes. Although more information is available in the Snowflake Documentation, a series of tests demonstrated the result cache will be reused unless the underlying data (or SQL query) has changed. As a series of additional tests demonstrated inserts, updates and deletes which don't affect the underlying data are ignored, and the result cache is used, provided data in the micro-partitions remains unchanged
Finally, results are normally retained for 24 hours, although the clock is reset every time the query is re-executed, up to a limit of 30 days, after which results query the remote disk
To disable the Snowflake Results cache, run the below query. It should disable the query for the entire session duration
alter session set use_cached_result =false;
Lets go through a small example to notice the performace between the three states of the virtual warehouse. In this example we have a 60GB table and we are running the same SQL query but in different Warehouse states
Cold Warehouse | Warm Warehouse | Hot Warehouse | ||
---|---|---|---|---|
Run Time | 20 seconds | 1.2 seconds | 2 milliseconds | The lower the Query Run time, the better |
Remote Disk (Source) | 12.5 GB | 0 | 0 | This is the data that is being pulled from Snowflake Micro partition files (Disk) |
Local Disk Cache | 0 | 12.5 GB | 0 | This is the files that are stored in the Virtual Warehouse disk and SSD Memory. The more the local disk is used the better |
Results Cache | 0 | 0 | 100% | The results cache is the fastest way to fullfill a query |