How to Remove Duplicate records from a Bigquery Table

Problem 🔍

Bigquery table has duplicates and it has to be cleaned up. There are three types of duplicates.

  • Full row absolute duplicates (eg. The entire row is inserted twice accidentally)
  • For the same Id there are multiple rows with different timestamps. Keep the latest record for each Id and discard the old ones
  • Surgical deletes - perform complex grouping and only delete a subset of rows

Let's see how we can cleanup for each of the above scenarios

Solution 💡

Method #1 How to cleanup full row absolute duplicates that has been accidentally inserted twice

Use DISTINCT to fetch only the unique rows and recreate the table as follows. 

For Partitioned tables, You need to use the same parameters (PARTITION BY) for the table

CREATE OR REPLACE TABLE `dataset.employee`
PARTITION BY date
AS SELECT DISTINCT * FROM `dataset.employee`;

For unpartitioned tables use the below syntax:

CREATE OR REPLACE TABLE `dataset.employee`
AS SELECT DISTINCT * FROM `dataset.employee`;

Method #2 How to keep the record with the latest timestamp and remove the older records for each Id in Bigquery

In the below example, For each employee Id, we are keeping the record with the latest load timestamp and deleting all the older duplicate records

DELETE FROM `yourdataset.employee` 
WHERE STRUCT(id, LoadTimestamp) 
NOT IN (
SELECT AS STRUCT id, MAX(LoadTimestamp) as LoadTimestamp
FROM `yourdataset.employee`
GROUP BY id)  

Method #3 How to surgically delete only a subset group of data in Bigquery

Sometimes you want to perform WINDOW functions and only want to remove a sub-group of data.

Eg. For each Employee Id, keep the records with the latest value and delete all the other rows that belongs to that Id

delete from Dataset.Employee where struct(Id,value) in 
(
select STRUCT(Id, value) from
(
select Id, value, ROW_NUMBER() over (Partition by id order by value desc) as rn
from Dataset.Employee
)a where a.rn <> 1
);

dan-irving

posted on 13 Feb 19

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds