BigQuery count distinct vs count of group by colx

I have been under the impression that if you were to do a COUNT(DISTINCT xyz) on some column, it would be equal to the regular count of a GROUP BY that column.

\n\n

However, when I do that over a very large dataset in BigQuery, with the exact same conditions, it is showing a large difference in results:

\n\n

Query Type             Count\n----------------------------------\n- count(distinct ColX)   > 7 million\n- count(ColX)\n    ... GROUP BY ColX    ~ 6.5 million

In google bigquery - If you use the DISTINCT keyword, the function returns the number of distinct values for the specified field. Note that the returned value for DISTINCT is a statistical approximation and is not guaranteed to be exact - the documentation is also clear about this.

\n\n

Alternatives:

\n\n

To compute the exact number of distinct values, use EXACT_COUNT_DISTINCT
For a more scalable approach, consider using GROUP BY on the relevant field(s) and then applying COUNT(*). The GROUP BY approach is more scalable but might incur a slight up-front performance penalty

Mike-Barn

posted on

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds