BigQuery count distinct vs count of group by colx

I have been under the impression that if you were to do a COUNT(DISTINCT xyz) on some column, it would be equal to the regular count of a GROUP BY that column.

However, when I do that over a very large dataset in BigQuery, with the exact same conditions, it is showing a large difference in results:

Query Type             Count
----------------------------------
- count(distinct ColX)   > 7 million
- count(ColX)
    ... GROUP BY ColX    ~ 6.5 million

In google bigquery - If you use the DISTINCT keyword, the function returns the number of distinct values for the specified field. Note that the returned value for DISTINCT is a statistical approximation and is not guaranteed to be exact - the documentation is also clear about this.

Alternatives:

  • To compute the exact number of distinct values, use EXACT_COUNT_DISTINCT
  • For a more scalable approach, consider using GROUP BY on the relevant field(s) and then applying COUNT(*). The GROUP BY approach is more scalable but might incur a slight up-front performance penalty

Mike-Barn

posted on 08 Jun 19

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds