How to remove duplicate rows in BigQuery based on a unique identifier
Consider an example table:
id value
1 4 1 2 1 3
2 5
3 7
4 8
What would be the process of removing duplicate records based on a unique ID in BQ?. You can use the ROW_NUMBER function
Deleting the duplicates in the same table
delete from Dataset.Employee where struct(Id,value) in
(
select STRUCT(Id, value) from
(
select Id, value, ROW_NUMBER() over (Partition by id order by value desc) as rn
from Dataset.Employee
)a where a.rn <> 1
);
And the result:
id value
1 4
2 5
3 7
4 8
NOTE: The above method can be used only when the value column is different across rows. In the case of full-row duplicates, you will not be able to use the above method. Instead, create a new table and only insert the unique ids into the new table using ROW_NUMBER.
nVector
posted on 27 Sep 18Enjoy great content like this and a lot more !
Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds
Post Comment