Optimize Performance on Snowflake2.05.2023
While Snowflake does all query tuning and optimization behind the scenes without DBA support, there are nevertheless situations, especially when we are dealing with huge amounts of data, when a little help from a Snowflake administrator is needed.
Query Performance Example
Here is a real-world example from our migration case study, where we did a performance comparison. For the initial data load, we took the tables we had in Oracle and migrated them into Snowflake – we used the exact same tables with the same number of records to have the same baseline. Then we executed a selected and very complicated query that the Oracle DBA spent a long time optimizing on Oracle. This SQL query returns about 80 million records, joins 5 tables and uses 6 subqueries, including some analytical functions. This was quite a challenge for the Oracle environment and even after all the tuning effort, the query took about 2 hours to complete, which was not ideal.
When the customer migrated from Oracle to Exadata, they spent additional time optimizing the query for Exadata. Unfortunately, some of the tuning that was done previously on Oracle, such as using indexes, was no longer efficient in Exadata. They then used parallel hints, since Exadata allowed parallel execution, and did some more fine-tuning. All of this meant a lot of manipulation and testing, and a senior DBA spent a substantial amount of time working on it. They managed to reduce the query execution time to 43 minutes, and that was the best they could do.
After the data was migrated to Snowflake, we took that same SQL query, as it was, and executed it on Snowflake. Very few adjustments were needed, such as changing the TRUNC() date function to add an additional parameter, and the query used a simple SQL function that we also had to write using Snowflake. Otherwise, the query was copied and executed without any tuning, as there is no tuning on Snowflake and using the smallest possible Snowflake warehouse – the XS size.
We were astounded when we saw that the query took only 11 minutes to execute. This is where we could clearly see how well Snowflake performs without any extra additional tuning on the smallest possible virtual warehouse.
We also did some comparisons with larger warehouse sizes. On the Small virtual warehouse size, the query ran for 7 minutes and on the Medium it ran for 4 minutes. This illustrates that by increasing the virtual warehouse size, the query execution time is reduced. The following diagram summarizes our findings:
Query Tuning and Optimization
As you probably know, query tuning and optimization is very important in Oracle, as there are Oracle DBAs who do it all the time. They look at the execution plan, they fine-tune queries, add hints to queries, create indexes (and sometimes even disable indexes as there are situations when an index can do more harm than good in certain query types), etc. In Oracle, they use the explain plan functionality to understand what is going on in the query and look for ways to improve it.
In Snowflake, we use the query profile, which gives us a graphical representation of the query execution path along with all of the statistics, such as query pruning, disk spilling, amount of data in bytes and number of records processed, etc. Although Snowflake shows us the explained plan, there is not much we can actually do with it. There is no tuning, optimization, nor hints. Snowflake internally decides how it will execute and optimize the query; we don’t have the influence to optimize it ourselves.
Oracle Exadata is very performant, but then again you have a lot of DBA work to make the most of it. Oracle has really nailed down the partitioning, the indexing, the query hints and other possibilities. If you tune it well, you can have a lot of data and the performance is exceptional.
In Snowflake, you have exceptional performance as well, but no physical data organization steps are needed. There is no indexing, no tablespaces, no table partitions because Snowflake stores data in columnar format using internal micro-partitions that are not user configurable.
In case physical partitioning is required in very large tables where query performance has deteriorated, Snowflake has the option to cluster data. This means that Snowflake will physically co-locate records with the same clustering key within the same micro-partitions, allowing for better partition pruning when executing queries and consequentially better query performance. All clustering is done behind the scenes by Snowflake, no user intervention is needed, except to define the clustering key.
You may want to add clustering to a table in Snowflake, but only when the following conditions are met:
- A table must be large (multi TB range) to have sufficient micro-partitions.
- The column(s) defined in the clustering key must provide sufficient filtering.
Our experience shows that a 2TB fact table with a clustering key of 20 distinct values worked really well.
In case some type of indexing is required in very large tables where query performance has deteriorated, Snowflake has the option to add search optimization to columns in a table. Like indexes, search optimization creates an access path to quickly find records within micro-partitions that are specified in the search optimization column. This is all done by Snowflake behind the scenes without any user intervention. No search optimization building or rebuilding is needed. However, as a general rule, don’t create any search optimizations initially. You will add search optimization to selected columns in tables later, but only if needed, depending on the types of queries that will be executed.
Choose Virtual Warehouse Size
Snowflake provides many different virtual warehouse sizes. The larger the warehouse size, the more resources it has available and the more performance benefits it offers. It also costs more. For that reason, we usually start with the smallest warehouse size XS and add larger warehouses as needed.
In our migration case study, the smallest warehouse size XS on Snowflake yielded better performance when compared to Oracle/Exadata for record counts in the 100 million range.
Snowflake offers the flexibility to choose the virtual warehouse size by different subject areas or even for specific SQL queries in the ETL process. This means that in the case of one complex query that requires a larger virtual warehouse to execute efficiently, we don’t have to run the complete ETL process on the larger and more expensive warehouse, we can just run the particular step with a larger warehouse.
To test query scalability in Snowflake, we take the most demanding queries and monitor how they scale on larger virtual warehouses. Typically, on a 2x larger warehouse, the query should execute 1.5 – 2 times faster. If it doesn’t scale, there are optimization possibilities:
- For large scans that are selective, clustering may be useful.
- We can also rewrite the query by taking multiple smaller steps with intensive inserts (eg. 25M+ each step) into intermediate tables; we can do this by rewriting the query using WITH clauses.
- As a general rule, perform small inserts when possible.
Why Optimize if It Works?
In the cloud world, optimization is directly connected with cost consumption. For BI queries, even an XS or Small warehouse size can be better due to columnar storage. We recommend starting small and then continuing by selectively increasing the warehouse size.
Finally, never compare one Snowflake virtual warehouse to one from Oracle/Exadata. Use multiple warehouses on Snowflake because they allow workload isolation; they can be enabled or disabled instantly, they can be monitored independently, and they can be used to provide more processing power when workloads increase.
Senior Consultant and Snowflake Data Superhero