How to Remove Duplicate records from a Bigquery Table
Problem 🔍
Bigquery table has duplicates and it has to be cleaned up. There are three types of duplicates.
- Full row absolute duplicates (eg. The entire row is inserted twice accidentally)
- For the same Id there are multiple rows with different timestamps. Keep the latest record for each Id and discard the old ones
- Surgical deletes - perform complex grouping and only delete a subset of rows
Let's see how we can cleanup for each of the above scenarios
Solution 💡
Method #1 How to cleanup full row absolute duplicates that has been accidentally inserted twice
Use DISTINCT to fetch only the unique rows and recreate the table as follows.
For Partitioned tables, You need to use the same parameters (PARTITION BY) for the table
CREATE OR REPLACE TABLE `dataset.employee`
PARTITION BY date
AS SELECT DISTINCT * FROM `dataset.employee`;
For unpartitioned tables use the below syntax:
CREATE OR REPLACE TABLE `dataset.employee`
AS SELECT DISTINCT * FROM `dataset.employee`;
Method #2 How to keep the record with the latest timestamp and remove the older records for each Id in Bigquery
In the below example, For each employee Id, we are keeping the record with the latest load timestamp and deleting all the older duplicate records
DELETE FROM `yourdataset.employee`
WHERE STRUCT(id, LoadTimestamp)
NOT IN (
SELECT AS STRUCT id, MAX(LoadTimestamp) as LoadTimestamp
FROM `yourdataset.employee`
GROUP BY id)
Method #3 How to surgically delete only a subset group of data in Bigquery
Sometimes you want to perform WINDOW functions and only want to remove a sub-group of data.
Eg. For each Employee Id, keep the records with the latest value and delete all the other rows that belongs to that Id
delete from Dataset.Employee where struct(Id,value) in
(
select STRUCT(Id, value) from
(
select Id, value, ROW_NUMBER() over (Partition by id order by value desc) as rn
from Dataset.Employee
)a where a.rn <> 1
);
dan-irving
posted on 13 Feb 19Enjoy great content like this and a lot more !
Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds
Post Comment