Back

Snowflake Estimating Functions

3.08.2022

Estimating the similarity of two sets, estimating frequent values, and estimating percentile values

Snowflake announced an update to the SnowPro Core Certification earlier this week. A new topic in the exam domain breakdown that wasn’t there in the previous version of the exam is Estimating functions. According to the documentation, estimating functions that are available in Snowflake are Similarity of Two or More SetsFrequent Values and Percentile Values.

Many Snowflake users probably don’t use estimating functions on a regular basis, as these functions are primarily meant for data scientists. In this post, I will explore the Snowflake estimating functions and attempt to explain them in layman’s terms without complicated mathematical equations.

Estimating Similarity of Two or More Sets

To estimate the similarity between two tables, we want a value between 0 and 1 that tells us how similar or dissimilar these two tables are. Value 0 means that the two tables are completely different and value 1 means that the two tables are exactly the same.

One way to calculate similarity is the Jaccard similarity coefficient (as promised, there will be no mathematical equations here, you can always explore further on your own if interested). The idea of this coefficient is to take the number of rows in the intersection of the two sets that we are comparing and divide it by the number of rows in the union of the two sets.

Let’s work with some examples.

We will create two tables named CUST1 and CUST2 and populate them both with the same data:

create or replace table cust1 
  (id integer, name varchar, birthday date);

insert into cust1 values(1, 'First Customer', '1988–12–05');
insert into cust1 values(2, 'Second Customer', '1992–02–19');
insert into cust1 values(3, 'Third Customer', '1989–07–29');

create or replace table cust2 
  (id integer, name varchar, birthday date);

insert into cust2 values(1, 'First Customer', '1988–12–05');
insert into cust2 values(2, 'Second Customer', '1992–02–19');
insert into cust2 values(3, 'Third Customer', '1989–07–29');

The intersection between tables CUST1 and CUST2 is:

select * from cust1 intersect select * from cust2;
Intersection between tables CUST1 and CUST2

The union between tables CUST1 and CUST 2 is (remember that the union operator returns distinct rows):

select * from cust1 union select * from cust2;
Union between tables CUST1 and CUST2

The similarity coefficient between tables CUST1 and CUST2 is the number of rows in the intersection (which is 3) divided by the number of rows in the union (which is again 3), giving a result of 1. This is the expected result since the two tables match exactly.

Next, we will create table CUST3 and populate it with two rows that are the same as in table CUST1 and one row that is different:

create or replace table cust3 
  (id integer, name varchar, birthday date);

insert into cust3 values(1, 'First Customer', '1988–12–05');
insert into cust3 values(2, 'Second Customer', '1992–02–19');
insert into cust3 values(4, 'Fourth Customer', '1986–09–10');

To calculate the similarity coefficient between tables CUST1 and CUST3, we again need the intersection and the union between the two tables:

Intersection between tables CUST1 and CUST3
select * from cust1 union select * from cust3;
Union between tables CUST1 and CUST3

The similarity coefficient between tables CUST1 and CUST3 is the number of rows in the intersection (which is 2) divided by the number of rows in the union (which is 4), giving a result of 0.5.

In the real world, we rarely work with such small tables. When we have huge tables with many millions of rows, calculating similarity using intersect and union operators would be too time consuming and costly. This is where we need an estimating function to calculate an approximate similarity.

Snowflake uses the MinHash algorithm for quickly estimating how similar two sets are, without computing the intersection or union (again, there will be no mathematical equations here, you can always explore further on your own if interested).

To calculate MinHash of a table, we use the Snowflake MINHASH function. For example, to calculate the MinHash of table CUST1:

select minhash(5, *) from cust1;

The first parameter to the MINHASH function is the number of hash functions to be created. The larger the value, the better the approximation that will be calculated, but the longer the computation time. The suggested value is 100 and the maximum value is 1024. We are using 5 in this example for illustrative purposes.

The second parameter to the MINHASH function is the column list from the table for which we are calculating the MinHash. We are using * in this example to indicate all columns. The result of the above statement is:

MINHASH of table CUST1 with 5 hash functions

To calculate the approximate similarity between two tables, we have to first calculate the MinHash of each table and then use the Snowflake APPROXIMATE_SIMILARITY function. This function returns an estimation of the Jaccard coefficient of the input tables based on their MinHash states.

Here is how we would calculate the approximate similarity between tables CUST1 and CUST3:

select approximate_similarity(mh) 
from (
  select minhash(100, *) as mh from cust1
  union all
  select minhash(100, *) as mh from cust3
);

The result is 0.49 which is close enough to 0.5 that we have calculated previously using intersection and union of the two tables.

We can experiment using different values for the number of hash functions parameter. With parameter value of 1000, the approximate similarity becomes 0.5 exactly. With parameter value of 50, the approximate similarity is 0.52 and so on.

Let’s think about use cases where we might need similarity coefficients. In addition to what data scientists do, we could use similarity coefficients to quickly check the correctness of data ingestion when we are dealing with huge tables. We would calculate the similarity coefficient between the source and the target table to get prompt feedback on whether the table loaded successfully or if there is anything major amiss.

Estimating Frequent Values

In data analysis, we often start by calculating the frequency of values in a column in a table. For example, using the Fantasy Football database from Snowflake Marketplace, this is how we might examine the number of players from the top 10 colleges in descending order:

select top 10 college, count(*) as player_count
from fantasy_football_2020.nfl2020.roster
group by college
order by 2 desc;
Top 10 colleges by number of players

In a real use case scenario, we would probably remove the value “None” from the analysis, but we will keep it here for illustration because we want some skew in our data.

While the SQL query to count the players by the college is simple and useful, we want an alternative way of estimating approximate frequent values in large data sets that is space and time efficient. Snowflake uses the Space-Saving algorithm (still no mathematical equations here, you can always explore further on your own if you are interested) that is implemented in the APPROX_TOP_K function.

For example, to calculate an approximate number of players for the top 10 colleges, we can use the APPROX_TOP_K function:

with colleges as (
 select approx_top_k(college, 10, 70) as colleges_result
 from fantasy_football_2020.nfl2020.roster
)
select value[0]::varchar as college, 
  value[1]::integer as player_count
from colleges, lateral flatten(colleges_result)
order by 2 desc;

The first parameter to the APPROX_TOP_K function is the column name for which we want to find the frequency.

The second parameter to the APPROX_TOP_K function is the number of values whose frequency we want approximated. We set it to 10 because we want the top 10 colleges.

The third parameter to the APPROX_TOP_K function is the maximum number of distinct values that can be tracked at a time during the estimation process. The approximation is more accurate if this number is large in comparison with the number of values whose counts we want approximated. We have set it to 70 in our example, but we could have set it lower for less accurate results or higher for more accurate results.

The APPROX_TOP_K function returns results in JSON format, so for a nicer display we use the LATERAL FLATTEN syntax to transform the results into tabular format:

with colleges as (
 select approx_top_k(college, 10, 70) as colleges_result
 from fantasy_football_2020.nfl2020.roster
)
select value[0]::varchar as college, 
  value[1]::integer as player_count
from colleges, lateral flatten(colleges_result)
order by 2 desc;
Top 10 colleges by player count (estimated)

We can see from the results that the estimated counts are close to the results that we retrieved using the first SQL query.

Estimating Percentile Values

percentile value is a number below which a given percentage of values falls. If a column has N values, the K-th percentile value is a number such that N*K of the numbers are smaller than this number.

Let’s work with an example. We will use a small sample table with exam scores:

select student_id, math_score
from exam_scores
order by math_score;
Sample table with exam scores

In this example, the 50th percentile (or 0.5) is the score below which 50% of the scores fall. Since we have 10 rows in total, we want to find the value below which 5 rows are found (marked with a red line in the sample data above).

The PERCENTILE_CONT function in Snowflake is used to calculate a percentile value based on a continuous distribution of the input values. To find the 50th percentile (or 0.5) value in our sample data:

select percentile_cont(0.5) within group (order by math_score) 
from exam_scores;

The result is 66 and to verify, when we look at the sample data above, we see that this value lies somewhere around the red line in the sample data, meaning that 50% of our scores are below this value.

Alternatively, in situations where we have huge amounts of data in our table, Snowflake has the APPROX_PERCENTILE function to estimate the percentile value:

select approx_percentile(math_score, 0.5) 
from exam_scores;

The result is again 66 which is the same as the result given by the PERCENTILE_CONT function. Since we are working with a very small data set it’s not surprising that the estimating function APPROX_PERCENTILE gives the same result as the PERCENTILE_CONT function. In larger data sets, the estimate might be somewhat different from the actual value.

More information on Snowflake estimating functions

In addition to the examples and high-level explanations that I have shown in this post, there is much more information and more functions in the Snowflake documentation, such as estimating the number of distinct values in a data set and so on.

Hopefully, this introduction has sparked interest in estimating functions. I will try to remember to use them more, especially when I need a quick estimate of a value when working with very large tables.


Maja Ferle
Senior Consultant and Snowflake Data Superhero