How to count distinct values over multiple columns using SQL
Often we want to count the number of distinct items from this table but the distinct is over multiple columns
Method-1 Using a derived table (subquery)
You can simply create a select distinct query and wrap it inside of a select count(*) sql, like shown below:
SELECT COUNT(*)
FROM ( SELECT DISTINCT DocumentId, DocumentSessionId
FROM Table ) AS internalQuery
Method-2 Using Concatenated columns
A simpler method would be to use concatenated columns
SELECT COUNT(DISTINCT(CONCAT(DocumentId,DocumentSessionId))) FROM Table;
Method-3 If performance is a factor
If you end up doing this often and over large tables, you can consider adding an additional column to the table itself (physicalize), which concatenates all the columns and computes a hash on it
Eg: you can add a new column to the table and store MD5(CONCAT(DocumentId,DocumentSessionId)), so you can easily count distinct on this new column going forward.
You can even consider creating indexes and or compute statistics on this new column in order to improve performance
You can use these techniques in,
- Snowflake Data Warehouse
- Google Bigquery
- Teradata
- Any Other SQL database
nVector
posted on 20 Jun 20Enjoy great content like this and a lot more !
Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds
Post Comment