How to DELETE duplicate records from BigQuery table

#1 If the table has multiple records for each Id / Key and you want to keep only the latest record

For example, your table has duplicate records and you want to keep one record and get rid of the duplicate records, You can use the below query to cleanup the table

Note in this example, Id is the Key and for each Id there has to different loadTime for each record for this method to work

DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
        SELECT AS STRUCT id, MAX(loadTime) loadTime 
        FROM `yourproject.yourdataset.duplicates` 
        GROUP BY id)

#2 How to remove absolute full row duplicates from Big query table

The easiest way is to re-create the whole table in place using DISTINCT. You need to use the same parameters (PARTITION BY) for the table.

CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;

For unpartitioned table use the below query:

CREATE OR REPLACE TABLE `transactions.testdata`
AS SELECT DISTINCT * FROM `transactions.testdata`;

DataFreak

posted on

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds