BigQuery count distinct vs count of group by colx
I have been under the impression that if you were to do a COUNT(DISTINCT xyz)
on some column, it would be equal to the regular count of a GROUP BY
that column.
However, when I do that over a very large dataset in BigQuery, with the exact same conditions, it is showing a large difference in results:
Query Type Count
----------------------------------
- count(distinct ColX) > 7 million
- count(ColX)
... GROUP BY ColX ~ 6.5 million
In google bigquery - If you use the DISTINCT keyword, the function returns the number of distinct values for the specified field. Note that the returned value for DISTINCT is a statistical approximation and is not guaranteed to be exact - the documentation is also clear about this.
Alternatives:
- To compute the exact number of distinct values, use
EXACT_COUNT_DISTINCT
- For a more scalable approach, consider using GROUP BY on the relevant field(s) and then applying COUNT(*). The GROUP BY approach is more scalable but might incur a slight up-front performance penalty
Mike-Barn
posted on 08 Jun 19Enjoy great content like this and a lot more !
Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds
Post Comment